RETable

RETable takes a RE specification from RegExp and builds a state machine using the machine specification in MatchEng . This machine is used by MatchGCC to execute the search.

Building the tables is implemented in three phases:

· Each RE component is expanded into its own state table, working from first to last component,

· The tables are scanned from last to first, looking for optimisations between pairs of tables, and

· A match start state is added to the front, to scan for potential match positions within the buffer.

In addition to the tables themselves, RETable also writes a short summary of the contents of each table in a "flags" list. This list is used to direct the optimisation of the tables, since the nature of each table becomes difficult or impossible to infer after expansion. These flags contain some similarities to the types and modifier fields used by RegExp, but capture the information in a format tailored for RETable's own needs.

Expanding the state tables for literals, classes and all-but classes is straightforward: We initialise each table to NO_MATCH , then change the action for each matching character to ADVANCE (including catering for case insensitivity if selected). Finally, we apply any iteration specified to the table, including +, * and ? modifiers. For example, if a table has "* " iteration, we change each ADVANCE entry to AGAIN_PUSH_ADVANCE , and change NO_MATCH entries to BACK_AND_ADVANCE .

The optimisation phase analyses the tables right-to-left. This is done so that optimisations can ripple (cascade) through multiple tables in some cases. The classic optimisation case is iteration, such as searching for " x.*y": During the " .*" state, we can eliminate pushing a backtrack reference to the next state if we can see that it won't match: in the case of "x.*y ", this means that only the entry for "y " contains AGAIN_PUSH_ADVANCE ; all other instances of this action are optimised to become AGAIN since the next table has NO_MATCH in the corresponding entry. This represents an enormous saving in backtracking effort, and lifts the performance of the NFA to near or even past the DFA engine in some circumstances. Sadly, while we can often avoid the NFA's weaknesses, we can't always, and then the performance of the engine degrades exponentially.

The match start state scans the buffer looking for text that matches the leftmost components of the RE. The scan is optimised to use Boyer-Moore string searching or CPU byte scanning if appropriate. If line numbering is requested, this function can be executed in parallel to byte searching for a very low cost. The Boyer-Moore code is also interesting as it extends the concept to include classes, so strings of classes such as "[abc][def][ghi][jkl] " run extremely quickly (Grouse Grep was nearly 2.4 times faster than GNU grep in one test of counting lines with this pattern in a large text file).

Public routines:

Init -- Prepare module for operation

Start -- Begin managing what has to be managed

AllocBacktrack -- (Re-)Allocate space for backtracking

Expand -- Convert "compact" RE codes to optimised table form

Private routines:

SetState -- Write actions and flags for state

Create -- Acquire resources and initialise memory

WordMetaState -- Set up entries to test word characteristics

AddMatchChars -- Write action for chars that match RE code

ApplyIteration -- Apply RE iteration modifier to tables

AnchoredToEnd -- Set up tables where RE is merely end anchor

ScanForLineStart -- Search for LF for anchored search

AnalyseStart -- Look for easy(ish) components at start of RE

ScanByBoyerMoore -- Set up start state to use BM

ScanByByteSearch -- Set up start state to use CPU instruction

ScanSequentially -- Set up start state to scan buffer

PrepareStart -- Set up state to scan buffer for match

IterationOpt -- Use following table to optimise backtracking

ZeroTripOpt -- Use following table to optimise backtracking

MetaMatchOpt -- Optimise zero-with (meta) tables

TrimWordAdvance -- Trim table since next comes a word test

ZeroTripOptEnd -- Eliminate redundancy where z-trip at end

OptimiseTables -- Optimise by comparing table activities

WordFinalState -- Prepare success state for word match

CopyLFtoCR -- Make CR serve as a line terminator