RegExp

(A note on filenames: I've renamed the module's files to regexp00.h and regexp00.c , to avoid any confusion with the wonderful program modules already published by Henry Spencer and others. The name used inside the sources continues to be RegExp.)

"Regular expression" (RE) is a concept that has come from the mathematics community and has become a vital component in computer systems, especially GNU/Linux. What make REs so useful is that they capture concepts that are easy and naturally applied by humans to searching, but which aren't provided by simple search facilities -- concepts like "the start of a line" or "a vowel" or "a string of digits" or "the end of a word".

The text notation used to describe the RE components is derived from the mathematical notation developed along with the concept. Unix utilities such as grep use a compact but readable notation:

literal Most characters merely match themselves (literals)

\literal \ escapes "special" characters to defeat their special meaning

. Match any character, except LF (or CR /LF if -R given)

^ Anchor search to the start of the line

$ Anchor search to the end of the line

[0-9aeiou] Match any member of class

[^0-9aeiou] Match all but the characters specified in the class

[]] Place "]" first in class if it's a class member (otherwise it closes the class spec)

[aeiou-] Place " -" last in class if it's a class member (otherwise it's a range specifier)

[x^] Place " ^" anywhere but first to stop the "all but" meaning

[[:alnum:]] Match any alphabetic or numeric value; also [:alpha:] , [:cntrl:], [:digit:] , [:graph:], [:lower:], [:print:] , [:punct:], [:space:], [:upper:] , [:xdigit:]

\< Match start-of-word boundary

\> Match end-of-word boundary

\b Match word boundary

\B Match word nonboundary

* Match previous RE component 0 or more times

+ Match previous RE component 1 or more times

? Match previous RE component 0 or 1 times (make element optional)

While this notation is quite terse but expressive, it is inconvenient to use when implementing the search, mainly because many elements are variable in length, but also because the interpretation of the text depends on the context, which can be awkward to maintain. RegExp provides a more compact and uniform way of describing the RE, so that these problems are overcome.

Compact RE format

The "compact" format is built around 32-bit numbers, so that most RE elements can be described in a single number, and more verbose components such as classes that can't be described in 32 bits can at least appear in a compact and efficient form using a fixed number of elements:

regexp01.jpgThe fields and their meanings are:

type The RE type. Types near 63 are meta elements: they test the buffer but don't add any text to the match. Other codes (5-55) are reserved for future expansion, (e.g. extended RE support).

0 End of RE

1 Literal

2 Class

3 All-but class

4 Match any

63 Anchor start

62 Anchor end

61 Word beginning

60 Word end

59 Word boundary

58 Word nonboundary

57 Word: Previous nonword

56 Dead end

repeat Element repetition. Codes are (with {min, max} matches):

0 Match self (no modifier): {1, 1}

1 Optional element: ("?" modifier): {0, 1}

2 Repeat 1 or more times ("+" modifier): {1, *}

3 Repeat 0 or more times ("*" modifier): {0, *}

4-7 Reserved for extended REs, probably including:

? Repeat at least n times: {n, *} (n supplied in next 32-bit RE element?)

? Repeat at least n but no more than m times: {n, m} ( n, m in next 2 elements?)

case insensitive Flag marking if element is case insensitive. 0 = case sensitive; 1 = case insensitive

word Restrict this element's match based on word characters ([0-9A-Za-z_]):

0 No word restriction

1 Restrict match to word characters only

2 Restrict match to nonword characters only

3 Reserved for future use

number Storage for a 20-bit number, currently not used. In the future, might help describe structure of extended RE elements (e.g. offset to next alternative in alternation?)

literal Storage for a 16-bit literal, which big enough to handle Unicode glyphs, although Unicode isn't supported elsewhere in the code (yet).

Utility Functions

In addition to publishing the compact format -- including defining a set of macros to help access the fields above -- RegExp also provides a set of utility functions. These include functions to compile a string into the compact format and to display a compact RE in a human-readable format.

During all RE handling (and also during table expansion), the code tries to be careful about locale by using the ANSI standard character class test functions defined in ctype.h . Some inconsistent behaviour might occur if the locale is changed between RE compilation, table expansion, and match engine execution. While we're on the topic of international support, it's worth noting that this architecture can greatly assist the support of international character sets and Unicode -- including UTF-8 -- but that the initial release does not attempt to explore any of this potential.

Another set of utility functions manipulate the RE to implement various optimisations. It's easier and more efficient to analyse and manipulate the RE in this compact format than when the RE has been expanded into the full tables of the Grouse FSA. The two main optimisations are called EasiestFirst and SlashEnds.

EasiestFirst is one of the most critical components of Grouse Grep: It has to try and cover up as much of the problems introduced by the NFA match engine architecture as possible. Quite often, it manages to succeed, but in general the exponential search times of the NFA simply can't be avoided (but changing the implementation to use a DFA would require much, if not all, of Grouse Grep to be rewritten). EasiestFirst works by looking for "easy" RE elements to search (strings of literals, or classes with limited members), and if found, extracting the easy bit into a separate RE so that the search engine can scan for that component first. The definition of "easy" is fairly heavily dependent on guesses about the nature of the data to be searched, so this code is likely to be fooled by pathological cases.

SlashEnds looks at the start and finish of the RE and deletes any optional components. This speeds the search by avoiding extra work where the caller doesn't care exactly what the match is, but merely wants to report the matching lines (or perhaps just count the lines, or even just report the filename). The code edits the RE using the following steps:

1. Remove optional elements at the front, e.g. "a*b?c+defgh+i*j? " becomes "c+defgh+i*j? "

2. Edit any leading ONE_OR_MORE element to remove the iteration, e.g. "c+defgh+i*j* " becomes "cdefgh+i*j? "

3. Remove optional elements from the back, e.g. "cdefgh+i*j? " becomes "cdefgh+ "

4. Edit any trailing ONE_OR_MORE element to remove the iteration, e.g. "cdefgh+ " becomes "cdefgh "

I chose the term "slash" because I used to play lacrosse!

Public routines:

Init -- Prepare module for operation

Compile -- Check and convert text expression to compact form

CompileString -- Convert string without looking for RE syntax

EasiestFirst -- Improve search by deferring complex starts

SlashEnds -- Remove start/end elements for inexact matches

AllOptional -- Check RE specs after end-of-line spec

Diagnose -- Report reason for last failure

ShowCodes -- Disassemble RE coding (for debugging)

Private routines:

ClassName -- Handle named classes (e.g. [:punct:])

CountMembers -- Report how many members are in the set

ClassCompile -- Decode class specifier and emit RE codes

Compact -- Render text expression in compact form

CountStates -- Work out how many states are in RE

ExtractRE -- Copy specified RE components into new spec

Frequency -- Estimate how frequently we might match string

FindOptional -- Find sequence of optional elements in RE

FindEnd -- Find end of RE

RemoveCodes -- Remove optional codes from start or end