My love of regular expressions keeps expanding… Luckily for those mainly programming LotusScript on Windows I recently found out that the Microsoft VBScript RegEx object supports non-capturing parenthesis’ which is great when working with sub-matches.
For more information on sub-matches and some working LotusScript agent code to use when testing it refer to my sub-matches post.
For the Java programming / non-Windows users
If you need server-side agents on platforms other than Windows you can still use regex. You will need to use Java though. The options:
- Notes / Domino 5.x: Use Apache Jakarta ORO compiled to JVM 1.1 (you will need to recompile from source to make the class format compatible).
- Notes / Domino 6.x: Use the default binary Apache Jakarta ORO distribution.
- Notes / Domino 7.x: Use the Java 1.4 built in regex classes found in the java.util.regex-package.
Non-capturing parenthesis’ 101 (some knowledge of regex is required)
When working with regular expressions everything you put in parenthesis’ will be available for you to reference after a match has been found. These matches are then accessible using an index into a Match-collection (the result of the execute() function). The purpose of non-capturing parenthesis’ is to be able to use parenthesis’ but without having them capture anything and therefore being added to your Match-collection.
I needed this non-capturing functionality just the other day when writing an agent to extract parts of the body text from incoming e-mail using a “When mail arrives”-agent. The incoming e-mails looks like the text below.
Direct link to issue: http://www.example.com/somedir/someurl?id=1234
Praesent hendrerit, duis ad ut enim consequat sed consectetuer nulla.
Status: Open
Responsible John Doe
Next followup: 24-12-2005
In the above example we would like to get all the the text after the initial URL and having the text available in a single match in the Matches-collection. Without non-capturing parenthesis’ this just isn’t possible.
Let us look at an example – it’s a little bit long but bear with me…
The easy way out would be to use a regex like this:
?id=d+s+(.*)
This basically means:
- find the string ?id= followed by at least one digit (?id=d+)
- after this make sure there is at least one line break (s+)
- get all the text following the found line break and put the text in the matches-collection ((.*))
Result:
Praesent hendrerit, duis ad ut enim consequat sed consectetuer nulla.
The reason this doesn’t work (we do not get the entire text and we do not get the trailing Status, Responsible and Next followup lines) is because the period-character matches everything but not line breaks.
OK so we throw in some s sequences to make sure we match the line breaks as well.
?id=d+s+(.*s*.*)
Result:
Praesent hendrerit, duis ad ut enim consequat sed consectetuer nulla.
Status: Open
It is getting better but we still don’t get all the lines after the text – only the first one. The problem is that we only match “some text, a possible line break and some text”. This grouping occurs a couple of times so we need to tell the regex to match this continuously.
?id=d+s+((.*s*.*)*)
Result:
Praesent hendrerit, duis ad ut enim consequat sed consectetuer nulla.
Status: Open
Responsible John Doe
Next followup: 24-12-2005
While the result looks right it doesn’t live up to the initial requirement which was only one entry in our Match-collection. With the above regex we will have two matches where the second one is blank (two sets of parenthesis’).
The reason is because all the parenthesis’ are capturing information, but only one set of parenthesis’ has some text to capture. Non-capturing parenthesis’ to the rescue – the only change from above is highlighted in bold and blue).
?id=d+s+((?:.*s*.*)*)
Adding ?: just inside the second set of parenthesis’ makes them non-capturing and will make the regex do what we want.
Why not do a Mid$-statement?
You might be asking yourself why go through all this trouble? Why didn’t I just do this using normal string operations and a traditional Instr/Mid combination:
Dim i As Integer
Dim result As String
i = Instr(1, text, "?id=")
result = Mid$(text, i+4)
Agreed – I could have done that and the result would have been the same, but what if there was a change in the received e-mail and some text was added after the “Next followup line” which we didn’t want to include? What if the format of the e-mail was changed so the URL was at the bottom? What if you needed to support multiple formats of e-mails using the same agent?
The power of regular expression really shines here since your agent doesn’t change – just the regex.
However I am as lazy as the next person and the reason I did it using regex was another. I needed to be able to let users specify the what to do on specific incoming e-mail patterns – mail rules on steroids. Regex is the only way to do this since users will continuously add patterns which would leed to constant reprogramming of the agent. Using regex the patterns can be specified using configuration documents in the database and tested by the user before being moved into production.
Try doing that with an agent and Instr/Mid!! 🙂