Comparison to GNU Grep

Grouse Grep hasn't been used by anywhere near as many people as GNU Grep, so regrettably the code is certain to have bugs. Do not rely on the program to perform any important work.

I've tried to minimise the bug level in the code, since it's unacceptable for me to claim a performance improvement if the improvement is a by-product of incorrect behaviour. However, if possible, do try to use the program in parallel with using GNU Grep, and let me know about any problems you find. It's only through use that confidence in the program can build, and it's worth investing this time, as some of the improvements are substantial.

The speedup percentages below are only a rough guide based on "typical" files: Your mileage may vary. An example is searching for "[~^]", where the Grouse FSA can search almost twice as fast as the GNU code: If the search never matches, then Grouse Grep will execute almost twice as quickly as GNU grep; however, if the search matches on the first character of every line, then both searches will spend most of their time skipping forward to the next line (byte search for LF), and so execution times will be very similar.

Note that all comparisons here are against GNU Grep Version 2.3. Version 2.4 runs substantially slower since it uses read(2) to obtain file data instead of mmap(2) , whereas both Grouse Grep and GNU Grep use mmap . The comparisons are drawn based on searching large files; searches covering many small files will show less improvement as Grouse Grep's buffer handling costs are higher than GNU's.

The areas where Grouse Grep performs better than GNU Grep (including rough percentages) are:

· Case-sensitive simple string searches (1-2 chars): 2% faster (buffer handing)

· Case-sensitive simple string searches (3-5 chars): 3-10% (STBM, buffer)

· Case-sensitive simple string searches (6+ chars): 1-5% (STBM, buffer)

· Case-insensitive simple string searches (1-2 chars): 20-50% faster (FSA)

· Case-insensitive simple string searches (3-5 chars): 15-25% (STBM)

· Case-insensitive simple string searches (6+ chars): 10-15% (STBM)

Also note that STBM is less vulnerable to pathological cases.

· Single class: [a-z] 20-25%, [0-9] 15-20%, [~^] 45% (FSA)

· Simple string of classes (e.g. [a-z][0-9][a-z][0-9][a-z] , [0-9][a-z][0-9][a-z][0-9] ): 15- 45% (FSA)

· Single-byte classes, e.g. [j], [jJ]: 40-60% (converted from class to literal or case-insensitive literal)

Other components such as the '?' option and '*' and '+' iteration are far too variable to quantify: they may be substantially faster or be thousands of times slower, or anywhere in between. The slow cases cannot be fixed without rewriting the engine to implement a DFA instead of a NFA. In particular, the results for iterative searches like "x.*y.*z " need to treated with a lot of caution -- the speed of search in this case is mostly just the cost of searching for " z", due to the intervention of EasiestFirst.

Extended RE components such as alternation, parenthesis, backreferences and {n,m} repeat specifiers are not supported.

Many features of GNU grep, such as full directory recursion options and before/after context reporting, have not been implemented. Support for "--" options is uneven.

A number of people have suggested that I add the GNU DFA engine to Grouse Grep, so that I can call on this superb piece of engineering whenever my code can't handle the request or performs too poorly. I've strongly resisted this suggestion, as I believe a new DFA implementation, with the design and implementation intimately based on the Grouse FSA, would be an extremely fast and powerful search engine, so I didn't want to clutter things up by including the existing DFA implementation.