Recently a colleague submitted a proposal to our team regarding the storage of regular expressions in a database lookup table. Whilst I could see the obvious benefits from this, it did make me slightly uneasy.

An NFA Graph created to describe a regular exp...
Image via Wikipedia

My principle concern was that vast swathes of key business logic could be broken with a simple update, and that we were effectively storing (pseudo) executable code in a database table. In addition to this, I was concerned that as these expressions would be defined outside of compiled code, they would be circumventing any syntactic/lexical validation performed by the compiler.

Further concerns that were raised were regarding the nature of the expressions to be stored. For example, postcode pattern matching should be deemed acceptable, however specific business-related search terms should not.

There was also the concern regarding different flavours of regex – PERL, PL/SQL, Unix, JavaScript, C# etc. How does a platform know which regexes it can use, and which it cannot.

After a bit of parley, we came up with a sensible set of proposals:

  1. Provide an API for their access and use with rigorous escapement and error entrapment. In particular, erroneous or poisonous expressions (such as those that may facilitate SQL Injections) should be handled.
  2. Ensure different ‘flavours’ are captured in the lookup table. Provide an API for those platforms wishing to subscribe.
  3. Avoid storing very specific (business or otherwise related) search terms, and give special consideration to terms that needn’t require a regular expressions.
  4. Require evidence of testing and impact analysis before the submission of any new regex is accepted.

And finally, should you want to learn or brush up on them, you could do far worse than Regex Coach.

, , ,

I stumbled across an article today regarding Google Code search and some of its amusing side effects.

Google Code Search
Image via Wikipedia

http://www.kottke.org/06/10/google-code-search

This demonstrates why we should be careful of code and comments that we implement, especially if it is published externally, or available on some code search tool.

What I found most interesting from the above link was the employment of Google Code Search to search for specific programming errors. To this end, I have conjured an example, which can check for a specific sort of JavaScript error:

if (x=y) {

The issue being, of course, that this is an assignment, not a comparison. x will be assigned y’s value and the following block always being executed. The regex for identifying this looks something like:

if\s*\([\s\w]+\=[\s\w]+\)

For completeness, here is the tester code that I wrote. You’ll see from the commented out line, how I gradually built it up:

window.onload = function() {
    var arr = ["if(x=0)", "if (x=0)", "if (x =0)", "if (y = 0)", "if (z == 0)", "sif (z = 0)", "if (x >= 0)", "if (x)"];
    //var regexes = [/if /, /if\s*/, /if\s*\(/, /if\s*\(.+\=.+\)/, /if\s*\([\s\w]+\=[\s\w]+\)/];
    var regexes = [/if\s*\([\s\w]+\=[\s\w]+\)/];
    var i, j;

    for (i = 0; i < arr.length; i++) {
        for (j = 0; j < regexes.length; j++) {
            document.write("Find " + regexes[j] + "  in " + arr[i] + ":" + (arr[i].search(regexes[j])>0) + "
");
        }
    }
}

When this is fed into Google Code Search we can see some examples amongst the JS files that are already there:

http://www.google.com/codesearch?hl=en&lr=&q=if%5Cs*%5C%28%5B%5Cs%5Cw%5D%2B%5C%3D%5B%5Cs%5Cw%5D%2B%5C%29&sbtn=Search

It is, of course, slightly frivolous to do this in JavaScript in that there is ready-made tool for checking this (and other problems) in JSLint. The above, however could certainly be employed on C or C++ code. You can probably think of some others as well.

, , ,

I once answered a Question on StackOverflow regarding surnames and regular expressions. I thought this might be worthy of a note here as well.

The questioner wanted to how to write a regular to transform surnames with irregular capitalisations. I.e. names like

  • MckIntosh
  • MacDonald
  • O’Reily

Quite simply this is not possible as there is no reliable rule that holds 100% of the time.

Consider the following names:

  • Mrs Macey
  • Mr Opal
  • Mr Macdonald

They are all correct. Even Mr Macdonald who doesn’t capitalise his ‘D’s. Our regex would churn out:

  • Mrs MacEy
  • Mr O’pal
  • Mr MacDonald

Bad regex!

We have to be careful when dealing with surnames – these could be our customers after all. And there is little that is more insulting than having your own name being churned up and spat out by some half-baked regex. Especially as this may be done by several such half-based regexes at different companies. You may feel like you want to change your name just so they get it right!

It’s as bad a name mispronunciation. I feel for all the people named Cockburn – (pronounced ‘Coeburn’), or McLeod – (‘McCloud’).

Unfortunately, this is all too common. Some systems are programmed only to store uppercase characters, in which case you are scuppered, and you do have to rely on some magical but flawed algorithm.

Others seek to perform some sort of user-input validation or correction. In any such case, the validation system should allow the user to input what they intended and not tell them how to think.

And always make it really really easy in your systems and processes to make minor corrections to a surname. This is a human being after all!

I still get letters from Scottish Gas addressed to Mr G Wiseman. And yet, they know my first name is James. I’ve tried to change it but just go through numerous levels of call-centre, and then get told that I need to provide it in writing. and email is not good enough. Sigh!

,