Benchmarking Regular Expressions

I've been helping out quite a bit on MSDN Forums over the last month in preperation for the approaching MCAD -> MCPD Windows exam. There is only so much book reading you can do and the good thing about forums are they are a great study tool.

Tonight I got into a discussion with TaDa about what is faster; parsing text using regular expressions or parsing text by iterating character per character. The goal was to extract out a large amount of telephone numbers in the format 0809 983 398, as an example, mixed along with some unwanted text. Before I go into any more detail thanks TaDa I very rarely get to discuss any sort of development with another developer as I'm the only .NET developer in my company.

So the original benchmark was 100,000 records, file size 3.81mb.

The results were not including time to load the file...

Using regular expressions it took 78 milliseconds.

Using iterative method it took 222 milliseconds.

Regular expressions were 2.846 times faster but the time taken by the iterative method wasn't a huge amount of time. In terms of the application (which had a maximum of 50,000), and in term of the discussion, there wasn't that much to be concerned about. Either way would have done the job.

So the benchmark was increase to 1,000,000 records, file size 31.8mb.

TaDa benchmark method and my benchmark method changed somewhat, TaDa read the file line by line while I just loaded the file in one big 31.8 mb hit. The results where interesting...

TaDa's method (using Reader.Peek and ReadLine) resulted in the following times (including file access).

Using regular expressions it took 2760 milliseconds.

Using iterative method it took 4232 milliseconds.

My method after reading the whole file into a string....

Using regular expressions it took 235 milliseconds.

Using iterative method it took 14052 milliseconds.

But then I ran my application again....

Using regular expressions it took 3 milliseconds.

Using iterative method it took 2136 milliseconds.

. o O (WTF? 3 milliseconds!) I thought, how was that possible !!

I expected a bit of difference between the times needed for iterative approaches as there was quite a lot of memory getting used and the GC would have randomly kicked in but how, how can parsing a 31mb file only take 3 milliseconds. Turns out regular expressions are compiled into assemblies the same as XSLT stylesheets so in the initial call there is a bit of a performance hit but after that things are better. I presume that this is case here and that this assembly is cached for later application runs, but that is quite a saving.

So I re-ran TaDa's methods, that used Peek and ReadLine, wondering why the same saving wasn't being made in the regular expression that were ran line per line. In TaDa's methods after reading each line he called the Regex.Matches() method passing in the string that was being search for, and it dawned on me.

Regular expression must be getting compiled based on both pattern and string, if the string changes then a new compile occurs, and this compile must be getting saved to only one assembly, this assembly getting over written per compile. In TaDa's code where he was passing a different string each time to the Matches() method it was causing a new compile as the string he passed was different to the string he passed the last time, a little bit of time again and again for each different string passed to Regex.Matches().

In my method I just threw the whole string into the Regex.Matches() method, each and ever time the string didn't change, no compile was ever needed. So the first time the application ran there was a hit but after that the string never changed so no compile happened and that results in a 3 milliseconds result.

Problem is.... when the regular expression returns a Matches collection containing over 1,000,000 records it take a bit of time to return the count of that collection. However I am not sure what similar performance hit applies to the iterative method.

So I think this is important when it comes to using regular expressions. They are faster than looping character per character. On small amount of data there isn't that much of a difference, not enough to cause any great problems but as the data set increases then regular expressions are still faster but how you use them can have a massive effect on the amount of time it takes, so do you go line by line, do you batch your lines, or do you throw the whole lot at regex. I think the answer, as with most situations, depends on the project and the data, but, with after this test it turns out, with large amounts of data, you really need to think about.

Published Monday, May 21, 2007 5:08 PM by dsmyth
Filed under:

Comments

No Comments

The leading UI suite for ASP.NET - Telerik radControls
Outstanding performance. Full ASP.NET AJAX support. Nearly codeless development.