Grouse FSA (MatchEng/RETable/MatchGCC/TblDisp)

Undoubted the technology in Grouse Grep with the most potential is the Grouse Finite-State Architecture (Grouse FSA), which is in fact just a couple of extremely old and well-loved software techniques dusted off and given a new coat of paint. These techniques are table-driven finite state machines and threaded assembly.

Table-driven finite state machines have been around for years: In "The Mythical Man-Month", Fred Brooks claims (while complaining about the inefficiency of flowcharts) "Show me your flowcharts and I'll continue to be mystified, but show me your tables and all will be revealed." Threaded assembly has also been around for years, most notably in FORTH (and PostScript is the most obvious recent offspring of the FORTH dynasty).

I discovered a particularly compact yet efficient and expressive way of combining these techniques for byte-oriented applications on the Intel Ix86 family: I found that the assembly-language sequence:

        lodsb
        xlat
        jmp ax

formed the basis of an engine that ran extremely quickly, yet was expressive enough that I could implement many optimisations and come up with stunningly fast programs. For example, one program published along with the Dr. Dobb's article was a Huffman decoder that could decompress a bit-aligned compressed memory buffer and produce 25 megabytes of output bytes per second on a 350MHz Pentium system -- that's enough data to draw a 1024x768 display at 31 frames per second.

In this release, I've ported the original MS-DOS program to GNU/Linux, and have used the "computed goto" extension in GCC to implement the FSA in C rather than in assembly. However, GCC doesn't understand all the nuances of the architecture, and so does not implement some the optimisations that were identified in the original handcrafted assembly. I've fixed this by using a Perl script that analyses the assembly output from GCC and edits the instructions to reinstate the optimisation.

What are Finite-State Machines?

Finite-state machines is a simple concept that arose in hardware engineering contexts but which has been applied very effectively to software components. It is essentially a technique to organise the operation of a machine into a more understandable, flexible and maintainable framework.

The components of a finite-state machine are:

· A set of states, with each state given a descriptive name,

· A list of events that are handled by the machine, and

· A set of actions that describe how the machine reacts to an event (possibly including changing state).

In a table-driven state machine, the event encoding is chosen so that the event's code can be used as an index into a table. Each state gets its own table, with the table's values describing the action to perform. This is why table-driven finite-state machines execute very quickly: Looking up the action in the table is a very simple indexing operation, and can be implemented very efficiently.

Word/Line Count Utility

Applications such as grep and the Huffman decoder end up with a multitude of actions and a complex set of states once you start striving for performance, so let's introduce the architecture using a much smaller and simpler example: Counting words and lines in a buffer. We'll ignore the line counting at first, and focus on the word count problem:

wordtut1.jpg

As we scan the text (from left to right), we alternate between whitespace characters and word characters. Counting words becomes counting the word starts -- the transition from a whitespace character to a word character.

We define a finite-state machine to implement the word count function. The events are the incoming characters of the buffer (we also need a buffer end event, but we'll look at that later). This state machine requires three actions:

· TO_WHITE : Change state to whitespace state

· TO_WORD : Change state to word state, and increment word counter

· NOP : Stay in the current state

The machine's states are (fairly obviously) Whitespace and Word. So the state machine looks like:

wordtut2.jpg

Where the table-driven machine excels is in the implementation of (relatively) complex character classifications such as "word" and "whitespace". The white space characters (and ASCII codes) are TAB (9), LF (10), VT (11), FF (12), CR (13) and space (32); all other characters are word characters. Counting lines is merely counting each LF character seen (but also treating LF as a whitespace character). Here is a "normal" C code fragment to implement not only the word count but also a line count on the characters in a memory buffer: Notice that the C compiler must test each character against the entire list of whitespace characters before concluding that it's a word character:

        inword = 0;
        nwords = 0;
        nlines = 0;
        while (pBuffer != pBufferEnd) {
                switch (*pBuffer++) {
                        case LF:
                                nlines++;
                                /*FALLTHROUGH*/
                        case CR:
                        case ' ':
                        case VT:
                        case FF:
                        case TAB:
                                inword = 0;
                                break;

                        default:
                                if (! inword) {
                                        inword = 1;
                                        nwords++;
                                }
                                break;

                }
        }

This code takes on average (very roughly) 16 instructions per character, mainly in the chain of whitespace character tests. (I'll ignore issues such as pipelined execution available in modern CPUs.) However, using the Grouse FSA, we can bring this cost down substantially. We extend the finite-state machine to treat LF specially, so the special actions (LF_WHITE and LF_NOP) can both treat LF as a whitespace character but also increment the line count. We also eliminate the buffer end test by choosing a character we believe will be unlikely to appear in the buffer, append this character to the end of the buffer, and modify the state machine so that we only test for buffer end when this character is found. For a number of reasons, the favourite Grouse byte value to use as this endmarker is "0xee "which is a word character, so we add actions EE_WORD and EE_NOP to the state machine. The revised state machine looks like:

wordtut3.jpg

The state table for Whitespace is built as follows:

1. Initialise all entries to WORD .

2. Change entries TAB , VT, FF , CR and space to NOP.

3. Change entry LF to LF_NOP.

4. Change entry 0xee to EE_WORD.

The state table for Word is built as follows:

1. Initialise all entries to NOP .

2. Change entries TAB , VT, FF , CR and space to WHITE.

3. Change entry LF to LF_WHITE.

4. Change entry 0xee to EE_NOP.

The state table for White looks like (interesting cells have been highlighted in red):

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

NOP

LF_NOP

NOP

NOP

NOP

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

NOP

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

EE_WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

WORD

The state table for Word follows a similar pattern:

NOP

WORD

NOP

NOP

NOP

NOP

NOP

NOP

NOP

WHITE

LF_WHITE

WHITE

WHITE

WHITE

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

WHITE

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

EE_NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

NOP

Next, we look at how we encode the actions themselves. We could simply assign number from a list: NOP = 0, WORD = 1, WHITE = 2, LF_NOP =3, LF_WHITE=4, etc., and then use a switch statement to get to the code associated with each action. Notice that the code selected by each action is very, very simple and streamlined -- the finite-state machine architecture has provided the freedom for each action to become very focussed as it's executed in a very controlled context. Here's a typical fragment of code written using standard C (some details like loop exit have been omitted for brevity):

        for (;;) {
                NextCh = *pBuffer++;
                Action = pCurrentState[NextCh];
                switch (Action) {
                case NOP:
                        break;

                case LF_WHITE:
                        Lines++;
                        /*FALLTHROUGH*/

                case WHITE:
                        pCurrentState = WhiteStateTable;
                        break;

                case EE_WORD:
                        if (pBuffer > pBufferEnd) {
                                goto Finished;
                        }
                        /*FALLTHROUGH*/

                case WORD:
                        Words++;
                        pCurrentState = WordStateTable;
                        break;
                }
        }

When we analyse the performance of this routine, we find that the machine spends almost half of its time translating the action code into the code address. Since there's no restrictions on how we choose to encode the actions, we can remove this cost by using threaded assembly: The code we choose to represent each action in the state table is the address of the code that executes the action. This brings the average number of instructions per character from (roughly) 8 down to 5.

Implementing threaded assembly is demanding: in the DOS version, the engine was written in assembly by hand, and the addresses of the routines handed to the C program via #defines . However, we have centered ourselves on GCC (and, to a lesser extent, on GNU/Linux running on Intel CPUs): GCC provides a "computed goto" extension that can handle the threaded assembly very neatly. Using this facility, we can implement the machine entirely in C, making coding far less tedious. Another bonus of the GCC memory model is that entry points are implemented as 32-bit pointers, so we are no longer restricted to a 256-byte code page. We can use the following C code to closely match the assembly:

        unsigned char NextCh;
        void *pAction;

                NextCh = *pBuffer++;            /*lodsb*/
                pAction = pStateTable[NextCh]   /*xlat*/
                goto *pAction;                  /*jmp ax*/

We need to find the entry points upon startup, so that we can use these values when setting up the state tables. This is handled by defining a structure containing a pointer for each action, filling this structure in upon startup with the actual addresses, and then using the structure variables when working with the actions.

One more optimisation: Jumping to the entry point for the next action does not have to happen from a single place at the top of the loop, like in the switch statement -- we can execute this jump from anywhere in the routine. This means that we can optimise the code still further by looking up the next byte of the input and jumping directly to the next action at the tail of previous action. The accumulation of the state machine and these optimisations means that we only execute on average roughly 4 instructions to handle each input byte.

So, finally, we can present the code to define and initialise this structure, and to implement the full word counting engine:

typedef struct
{
      void *pEE_NOP;
      void *pNOP;
      void *pEE_WORD;
      void *pWORD;
      void *pWHITE;
      void *pLF_WHITE;
      void *pLF_NOP;
} Wc_ActionCodes;

static struct Wc_ActionCodes gWcAct;

#define EE_NOP           (gWcAct.pEE_NOP)
#define NOP              (gWcAct.pNOP)
#define EE_WORD          (gWcAct.pEE_WORD)
#define WORD             (gWcAct.pWORD)
#define WHITE            (gWcAct.pWHITE)
#define LF_WHITE         (gWcAct.pLF_WHITE)
#define LF_NOP           (gWcAct.pLF_NOP)

void
CountC(Wc_ActionCodes *pActionCodes, 
       UINT16 NrBytes, 
       BYTE *pText)
{
        unsigned char NextCh;
        void *pAction;
        void **pTab = pCurrentTable;
        unsigned char *pEnd = &pText[NrBytes];

        /*Does the client want to receive the action entry points?*/

        if (pActionCodes != NULL) {
                /* Yes, write them now and then return  */
                pActionCodes->pEE_NOP   = &&entryEE_NOP;

                pActionCodes->pNOP      = &&entryNOP;
                pActionCodes->pEE_WORD  = &&entryEE_WORD;

                pActionCodes->pWORD     = &&entryWORD;
                pActionCodes->pWHITE    = &&entryWHITE;
                pActionCodes->pLF_WHITE = &&entryLF_WHITE;

                pActionCodes->pLF_NOP   = &&entryLF_NOP;

                return;

        }

        /*Add endmarker to buffer to trigger end test below*/
        *pEnd++ = ENDMARKER;

        /*Loop through each byte*/

entryEE_NOP:
        if (pText == pEnd) goto Finished;
        /*FALLTHROUGH*/
entryNOP:
        NextCh = *pText++;
        pAction = pTab[NextCh];
        goto *pAction;

entryEE_WORD:  
        if (pText == pEnd) goto Finished;
        /*FALLTHROUGH*/
entryWORD:
        Words++;
        pTab = WordTable;
        NextCh = *pText++;
        pAction = pTab[NextCh];
        goto *pAction;

entryWHITE:
        pTab = WhiteTable;
        NextCh = *pText++;
        pAction = pTab[NextCh];
        goto *pAction;

entryLF_WHITE: 
        pTab = WhiteTable;
        /*FALLTHROUGH*/
entryLF_NOP:   
        Lines++;
        NextCh = *pText++;
        pAction = pTab[NextCh];
        goto *pAction;

Finished:
        /*End of buffer found: remember in-word context and exit*/
        pCurrentTable = pTab;
        return;

} /*CountC*/

Improving GCC output

While the code generated by GCC is fast and efficient, an inspection of the generated code shows that there is still some room for improvement. In particular, GCC generates code like the following for NOP :

        movb (%edx),%al
        incl %edx
        andl $255,%eax
        jmp *(%ebx,%eax,4)

GCC does not assume that %eah (the high bytes of %eax ) are zero, and so adds the andl instruction above to complete the byte load into the register. However, this instruction can be eliminated in almost all places it occurs if we force %eah to be zero when we enter the machine, and demand that each action ensure this condition is also met at the instant where control is handed to the next action. This is a substantial benefit: In the case of the word count utility, this optimisation carves nearly 9% off the execution time. (Controlling %eah is part of the original Grouse FSA: for a particularly interesting use of %eah , see the Grouse Huffman decoder, where this register is used to keep track of the bit alignment (0..7) of the input bitstream.)

Grouse Grep for GNU/Linux includes a Perl script (called trimasm ) that analyses the compiler's output and implements this optimisation. This script is quite sophisticated in its ability to identify the finite-state entry points, to trace code execution to keep track of the %eah register, and then to delete or edit instructions as appropriate where it identifies that the optimisation can be made. In the longer term, this analysis is best done by the compiler, perhaps as a switch advising the compiler of this architecture, so that the compiler can deal with the register as part of its normal register allocation algorithm.

I'm also curious to know how the original "lodsb; xlat; jmp ax" code stacks up against the "movb (%edx),%al; incl %edx; jmp *(%ebx,%eax,4)" code. The original 256-byte code page fitted entirely into the on-chip CPU instruction cache, whereas the current GCC-based version requires over 1600 bytes: Does this create a performance bottleneck? Some feedback from the FORTH community suggests that the GCC version should work better since it has more of a load/store architecture, and therefore better exploits the pipelined architecture of modern CPUs. These and many other issues and trade-offs are worth investigating, but I haven't had the time.

Designing for the Grouse FSA

Designing applications to use this architecture is interesting, creative and rewarding:

· Work out a way of framing the incoming data so that you can meaningfully think about it exactly one byte at a time.

· Analyse how you need to react to each byte, and try to organise the context of the overall engine into a small(ish) set of states;

· Define a small(ish) set of actions that describe how to deal with each byte in each state, including describing how to change state where required.

These steps are usually repeated in an iterative fashion: an unwieldy set of actions indicates problems with the framing or the state decomposition, while a set of states larger than the available memory suggests the actions are too simple or the byte-oriented framing is inappropriate. Analysing the nature of the problem can often suggest ways to modify other components to ease the difficulty.

Framing the input is especially interesting, as the coverage provided by the tables means that unusual combinations of information that would be too awkward for conventional techniques can work extremely well. The table-driven architecture is particularly good at implementing sophisticated and highly efficient classification of the input. For example, I've been toying with a move generation program for a chess program, based on summarising 4 squares in one byte by assigning a two-bit code to each cell: 00 =empty, 01 = white piece, 10 = black piece, 11 = outside edge of board. Sadly, this project hasn't received enough attention for me to know if it's worthwhile.

Perhaps the most startling possibility of input framing being an advantage is in searching DNA sequences: if each base (C, G, A and T) is encoded as a two-bit number, then four bases can be packed into a single byte, and fully evaluated with a single table lookup. Setting up the tables and defining the search actions is tricky, but could include support for limited substitution, insertion and/or deletion of bases. Packing bases like this instead of storing the sequences with one base per character means other I/O bandwidth requirements are slashed by a factor of 4. Seeing four bases in parallel cuts down the bookkeeping the search has to maintain. The Grouse Huffman decoder already demonstrates that a bit-level engine can operate on an input bitstream for 14 cycles per output byte on a Pentium. This translates to roughly 10 million input bytes per second on a Pentium 350; if this performance was achieved by a DNA search engine, it would result in a search speed of 40 million bases a second on a single CPU.