Design and Coding Style

Designing and implementing a high-performance complex program such as Grouse Grep places significant demands on the designer and coder. Certainly the code is not perfect -- there are deficiencies in many areas, such as gross design issues such as implementing an NFA engine versus the more powerful DFA engine, through mid-level issues such as the unwieldy, rambling and somewhat confused search/reporting structures in matcheng.h and scanfile.h , right down to minor coding style issues such as the dubious use of all capitals in the type names in compdef.h .

However, substantial discipline has been exerted throughout the entire project, as a necessity rather than a luxury, and this has helped improve the design, implementation and testing of the program. This page seeks to lay out the various disciplines that were applied (some are more subtle than others), and to discuss the pros and cons of the choices made. I hope you might find something of value in the following discussion.

Overall philosophy

At a basic level, any project must have a clearly defined set of requirements, including functional requirements, performance demands and resource restrictions. This is a large but well-understood area that I won't cover here. Suffice to say that Grouse Grep was required to provide features comparable to GNU Grep's basic RE search capability, but had to demonstrate outstanding performance using only modest resource consumption. (Incidentally, I'm not satisfied that the memory consumption of Grouse Grep meets this requirement, but this is a by-product of the NFA engine implementation which should be fixed when (if?) a DFA engine is implemented, so I've chosen to compromise on this criteria for now.)

Within the requirements, there is substantial freedom to work towards other goals in the design and implementation decisions. There is one goal that drives my design and implementation decisions far more than any other:

It is imperative to reuse existing engineering rather than reengineer whereever possible. If you are engineering something new, it is vital that you compose reusable pieces as you work to improve your ability to reuse in the future. Reuse is the only viable way to bring down the cost of engineering systems.

That's a rather broad statement, and it's obviously very hard to achieve this goal, but the statement can be broken down into a number of specific directives that shape how you approach various tasks.

The first outcome has a couple of surprising components: You must provide substantial benefits to the engineer if you are to convince him or her to reuse rather than reengineer. Superior performance is a very substantial benefit; a well-designed interface is enticing; reducing resource consumption may be attractive; the engineer may enjoy the reduction in testing that results from reusing well-tested components. The last component -- testing -- is one of the least enticing benefits for most engineers, but only because most people grossly underestimate the effort needed to properly test a new piece of engineering. However, this component can be the most effective, if you can properly demonstrate the savings that result from reusing.

The reason that engineers need to see substantial benefits is because of ego: if you look at virtually any person designing or implementing software, and you ask them, "How did you get into this field?" the answer will always be the same: "I found I got a buzz from being able to create this code myself, and others were willing to pay me to create code for them, so that's why I'm here." It takes an extraordinary effort to convince people that came from this self-motivated background that reusing someone else's work is better than creating stuff for themselves. This is not to blame anyone for any deficiency; it's just the way the industry is. In the case of Grouse Grep, this meant that I demanded that my code outperform GNU Grep in substantial areas and by a substantial margin, and that I had to write a program with very few bugs and with comparable features before I could reasonably invite others to evaluate my code, let alone suggest that they might benefit from reusing it.

The next outcome looks very different, but is in fact a variation on the same theme: Your code must be very readable and conform to widely-used idioms , rather than demanding the reader become accustomed to your style. What I've chosen to interpret as "readable and conforming" is a rather controversial position: I've tried to pluck the best benefits of the literate programming genre and use them within conventional, idiomatic code. The desire to fit into the common programming idioms means that many of the key ideas of literate programming (like presenting lines of code in a different order to the order used by the compiler) are not allowed.

The main idea from literate programming that I've tried to incorporate is that you should try to write your code so that it reads like a terse essay, but not to the point where the essay style overwhelms common code idioms. Reading essay-style code lets people reuse skills they acquired long before they began working with computers.

(One idea in the coding style is that bad code should stick out as looking ugly -- the rule on long variable names below is an example of this. Another rule seeks to produce the same result: 8-column tab stops must be used, and the tabs must be implemented by using spaces, and never the tab character. This rule means that heavily nested code tends to run off the right side of the page, which is a very strong hint that it's time to simplify the code. There is some discretion in applying most of these rules, but exceptions within the Grouse code should be quite rare.)

Organisation

The basic idea is to write ANSI C with the discipline imposed by Modula-2. The improvements to C incorporated in ANSI C are substantial and make coding in C far easier. I've chosen to completely ignore non-ANSI systems, and to embrace common practice in modern systems such as external identifiers larger than 8 characters. Code must be presented using a fixed font, and adding spaces to align code is recommended if the alignment helps point out relationships or patterns in the code.

Using colours to highlight syntax is mildly discouraged: the nature of each token is relatively unimportant, and it's how the code reads in black and white that really matters. Colours added to essay-style text usually pick out relatively unimportant detail, and help obscure more important issues. In many applications, of course, colours are invaluable as an aid to understanding subtle patterns in the information being presented. I have used colours in a limited fashion when presenting the code here, mainly to highlight function headers and to distinguish comments from executable code.

The code is fairly strictly divided into modules, with each module having an interface definition header file (module.h) and an implementation file (module.c). This exactly mirrors Modula-2's .DEF and .MOD files. Each header file must be idempotent (it may be included multiple times without generating multiple declarations). The headers are also protected to define C interfaces if referenced from C++.

Function names are always defined and documented as simple names (e.g. the function Compile in RETable). However, the Modula-2 style means that function names defined to the C compiler, used by the linker, and used by other modules must be made unique by being prefixed with the module name and an underscore, (e.g. RETable_Compile ). This means that the module and the routine are clearly documented each time the routine is referenced in the code.

#include statements are organised into alphabetical order, and a simple rule greatly assists reuse: if you reference it, you include it. This makes the code more reusable: it stops battles over ordering problems with include files, and means that you can reuse any file with confidence, instead of being surprised when a hidden connection is broken.

Slicing

The key to achieving a good design is to decompose the problem into a set of smaller problems, such that the complexity of each subproblem, plus the connections that are needed between the subproblems, are more manageable than the original. In most cases "manageable" is defined as (a) simpler to understand and (b) efficient to implement (although many engineers are tricked into ranking efficiency as more important than simplicity).

It's easy to focus on the details of the internals of each subproblem, and gloss over the interfaces between the components. But this is wrong, for two reasons: the internals are easier to define and specify than the interfaces, and the overall quality of the design is much more heavily dependent on the connections than on the internals.

So to be a good designer, try to learn good ways to slice up big problems, and especially to be able to evaluate the pros and cons of a proposed interface as brutally honestly as you can manage.

Here are the things to consider at the interface when proposing to divide some problem area into two components:

· The names of the routines that define messages that can pass across the interface, and the variables relevant to each routine;

· The protocol of how and when routines are called, and the meaning of each result.

· The imaginary machine that each party believes that it's dealing with on the other side of the interface, including an image and a language to describe the salient features of the image and to issue requests to the machine.

· The actual machine implementation on each side of the fence.

Global variables published by one module but able to manipulated by other modules are strongly discouraged, and (with the possible exception of errno) not used at all inside any of the Grouse code. The goal of introducing an interface is to create a division such that information maintained on each side of the slice and information flowing across the boundary is well controlled; global variables live in no-man's land and control is much more difficult to trace. A notable exception to the global variables rule is the structure containing the entry points for each action in gMatchEngAct. This exception is partially historical (I was very tentative when first playing with this code), and partially because the FSA is such a dominant part of the entire system.

Any display of the internals behind an interface is strongly discouraged, and slicing should aim at letting objects be reused by relinking the object, not by recompiling. This last comment means that I dislike the design of C++, where the private variables of a class are described in the public interface file (despite the fact that the language defines that outsiders can't manipulate or reference those variables) -- the compiler still uses these definitions of these variables to compute various sizes, and so the implementation can't be modified to add or remove variables without forcing all users of the interface to recompile. In this case, the place where the interface ends and the implementation starts is very blurry -- a clear sign of poor slicing.

A corollary of this emphasis on slicing is that it greatly discourages the use of conditional compilation: the code bracketed by a #if...#endif block has a complex and poorly-documented interface between the code inside the block and the code outside. Conditional compilation also makes testing much harder -- an exhaustive test rig should step through each combination of conditions before signing off that a module is correct, so each additional condition doubles the amount of testing required.

The only use of #if...#endif in the Grouse code has been to make the headers idempotent and to cater for C++ users; other uses of conditional compilation are not permitted in the final code (I do use #if 0 ... #endif during development as a convenient way to temporarily/tentatively remove a block of code). Where platform-dependent code is required, such as creating highlighted match output, the platform-specific code is explicitly moved to an externally-provided function, called by an explicit interface.

Hierarchies

Okay, time for probably the most controversial bit of my style -- I do not have nearly as much faith as most people in hierarchies as a way of organising problem decomposition and in organising the integration of the pieces into a workable whole. In particular, I believe that the object-oriented crowd gets it wrong when they propose to use a hierarchy such as trees of classes as a model for analysing and implementing systems.

More exactly, I divide hierarchy into three scales:

· Macro scale, where the technology and concepts at each step of the hierarchy is completely foreign to the steps below or above,

· Micro scale, where you impose priorities on a local set of components to organise their functionality, and

· Midi scale, which is neither macro nor micro, but is something in between.

I find that the macro and micro hierarchy scales are very valuable, but that the midi scale is at best unproductive, and at worst very costly in that it makes code reuse much harder to achieve.

This viewpoint comes partially from observation of real-world systems, such as hardware: Macro-scale and micro-scale hierarchies are easy to find; midi-scale hierarchies are noticeably absent. For example, there are only a handful of macro-scale technology steps between the fibre-optic cables of the internet and the individual transistors in a PC's CPU. (Roughly: Transistor/ macrocell/IC/circuit board/backplane/LAN/WAN/internet.) There are local priorities between modules on a circuit board (such as line driver enables). But in-between organisation such as chips piggybacked on other chips is extremely rare, and boards piggybacked on other boards is quite uncommon (except perhaps to produce a bus-style arrangement, where the modules function as peers anyway).

For an example of macro hierarchy changes, look at the changes in technology at each step of the internet: the LAN wires together systems; each system's backplane wires together circuit boards; each circuit board wires together ICs; each IC is made by wiring together macrocells; the macrocells are formed by wiring together transistors. When the higher technology in a step makes the next lower technology look more or less just like a handful of black boxes wired together, you know you've got the technology jump right. Too much complexity in wiring all the boxes together means the jump is too large; if the interface from the box edges to the wires isn't sharply defined, such that there's blur between where the higher level begins and lower level ends, then the jump is too small.

Software similarly provides examples of macro and micro hierarchy. An example of a macro technology jump is the relationship between command shells such as bash and filter programs such as grep, cut, paste, and sed, where pipes are used to wire the filters together. We can observe micro hierarchy when we observe the use of precedence in expressions and the local interactions between statements. Here, we use hierarchy to organise the local interaction of peer elements. See how defining relationships between components into micro-hierarchies helps organise the code:

        Sum = 0;
        for (i = 0; i < 10; i++) {
                Sum += x[i] * (x[i] + 14);
        }

The problem with midi-scale hierarchies is that there is a mistaken hope that there is a continuum between the self-organising properties of the micro scale and the different-technology organising properties of the macro scale which the midi scale can exploit. This is incorrect: the two styles of organisation are incompatible. Any attempt to blend the two ends up producing a hazy jumble in the mid levels.

Note that the components within each technology step are defined and implemented as peers. Components are placed side-by-side, and then wired together. There is enormous flexibility in choosing components and creating relationships. Contrast this side-by-side peer approach with the class hierarchy approach of the OO approach: if you pick on any class in the middle of a tree of classes, and try to extract the transformations in that class into a stand-alone, reusable component, you're very unlikely to succeed: the details of the class's interface and implementation is intimately controlled by its container classes. The result is base class wars: Which base class are you forced to choose today? (Or equivalently: "The standard libraries are a drag in this application because...")

If you look at the definition of modules in Grouse Grep, you'll see that I've tried to encourage the peer-component approach. The code in Platform, and to a lesser extent in GGrep, instantiate modules as peers, use registration functions to wire the components together, then sets the result going. The generic and reusable Self-Tuning Boyer-Moore code (STBM) is fitted into the grep program by the addition of an peer-level interface (shim) module: STBMShim. This approach gives rise to interface-definition modules like MatchEng that set out how other modules are to interact -- in this case, RETable, MatchGCC and TblDisp. Remember that the interface is a contract between parties, and that there's no reason why a contract between two parties can't be based on a contract written by a third party.

Having large technology jumps provides freedom to engineer better solutions. For example, in the original DOS version of Grouse Grep, either an ANSI C or a highly-optimised assembly search implementation could be used, without the rest of the system realising what technology was being used. This gave freedom to quickly write and debug the high-level state machine states and actions in C, yet allowed the very high performance assembly version to be swapped in whenever desired. Smaller, more incremental steps would have made this style of development difficult or impossible -- each interface makes connections between components explicit, and brings in inertia into the project that fights against change. Some inertia is good: It makes goals more attainable as they don't shift around so erratically. However, too much inertia makes it unreasonably hard to make worthwhile changes to components or interfaces.

Hierarchy is a two-edged sword in another way: Too much emphasis on an hierarchical decomposition can force you to make critical decisions before you're in a position to make a good choice. Each choice should be deferred for as long as possible without impacting the final delivery date (although the scheduling of these decisions is of course an art that takes time and experience to learn). Basically, complex decisions carry high risk, whereas simple decisions are low-risk. If you can work on the low-risk items while analysing the high-risk items, you can achieve substantial progress and buy yourself time to analyse things properly and to better appreciate the ramifications of major decisions on the final design.

But not if you're forced to decompose in an hierarchical fashion. The early decisions about how things are carved up dominate all the future development, and much of the later work may be trying to work around the limitations of the original design. Again, a peer-component design style is a better approach: since the final result will be a set of peer modules all cooperating, and since the modules interfacing with external components are often straightforward drivers, these components can be engineered -- in a reusable fashion for addition to a software library -- while the more difficult central components are being investigated.

And finally, a small idea to help you test technology claims: any product that claims it "Makes things easy by removing the need to make engineering trade-offs" is incorrect: True power and freedom comes from architectures that facilitiate and promote the ability to make engineering trade-offs. This is an extremely powerful and accurate test to use when evaluating the benefits of competing technologies that you might invest in to improve your engineering capabilities.

Expanding, summarising and headlines

Much of programming comes down to choosing good ways of representing data, and this in turn means finding smart ways to summarise groups of data, and to expand broad outlines into detailed descriptions.

Which is exactly the tension between the headline and the body of articles in a newspaper -- the headline seeks to summarise the story details in an interesting way, and the article body expands the story into a complete document. So a good programmer would almost certainly be a good subeditor -- the skills are the same.

In the Grouse style, this structure is made explicit in the way each module and function is introduced. The introduction consists of a name and a headline separated by a em dash (we use two hyphens to represent this dash in the fixed-font world of C code). For example, the headlines for each module are:

Bakslash -- Provide C-style "\" escape encode/decode

FastFile -- Provide low-cost buffer-based file access

GGrep    -- Regular expression search using Grouse architecture

MatchEng -- Match engine interface and supporting definitions

MatchGCC -- GNU C version of match engine

memrchr  -- Fast, right-to-left-scanning memory search

Platform -- Handle platform-specific functions (Linux version)

RegExp   -- Create and manipulate regular expressions

RETable  -- Efficient language for implementing regexp searches

ScanFile -- Manage RE search for a single file

STBM     -- Self-Tuning Boyer-Moore string search

STBMShim -- Interface module to join STBM to ScanFile

TblDisp  -- Describe state tables in concise, readable format

Tracery  -- Allow access to module trace facilities


Within each module, each function gets its own headline. For example, the headlines in RegExp are:

RegExp -- Create and manipulate regular expressions


ShowCodes     -- Disassemble RE coding (for debugging)

ClassName     -- Handle named classes (e.g. [:punct:])

CountMembers  -- Report how many members are in the set

ClassCompile  -- Decode class specifier and emit RE codes

Compact       -- Render text expression in compact form

Compile       -- Check and convert text expression to compact form

CompileString -- Convert string without looking for RE syntax

CountStates   -- Work out how many states are in RE

ExtractRE     -- Copy specified RE components into new spec

Frequency     -- Estimate how frequently we might match string

EasiestFirst  -- Improve search by deferring complex starts

FindOptional  -- Find sequence of optional elements in RE

FindEnd       -- Find end of RE

RemoveCodes   -- Remove optional codes from start or end

SlashEnds     -- Remove start/end elements for inexact matches

AllOptional   -- Check RE specs after end-of-line spec

Diagnose      -- Report reason for last failure

Init          -- Prepare module for operation

Coding

Here are some components of my style:

· All lines of code should (must?) be less than 80 characters wide, so that all the text is readable without requiring scrolling to the left or right.

· Comments must appear on the line preceding the code, and never in parallel to the left or right of the code.

· Divide the code into paragraphs, with a blank line between paragraphs, and add a single comment line describing at a high(er) level what the paragraph does to the start of each paragraph.

· Don't place spaces inside parentheses, and don't add a space between the function name and the opening parenthesis. Otherwise, add spaces to each side of binary operators, but no space between an unary operator and its operand.

Variables -- Names and Storage

Much of the art of programming comes down choosing clear names for things. Choosing short, concise, descriptive and expressive names is an essential step in creating readable code, and should be a major part of choosing an architecture and designing interfaces and implementations, not just during coding.

· Variable names are composed by combining one or more short descriptive words together.

· When choosing names, try to capture the sense of the information: a flag variable named "Ready " is far more useful than a flag variable named " Status".

· Try to avoid long names -- they are a big, big hint that you've lost control of your design. For example, " pInternetModem->Ready" is better than "InternetInterface.pModem->ModemDeviceState.ModemIsReady ". This can be very hard, and requires constant discipline -- you need to be ruthless in stripping off irrelevant outside context when you are focussing on one corner of the design. Try reading your code out loud, and see if a listener could make sense of what's being said. [Some of Grouse Grep code fails this test: see ScanFile .]

· Variable names are written in mixed case: This is more comprehensible than either all-lower case or all-upper case.

· A variable name must look like a single token to the reader, so underscores are not permitted within names.

If the name is truly expressive, it reduces the need to describe the type. Therefore, I do not believe in the full-blown Hungarian notation (although, to be fair, my interfaces are less complex and perhaps need fewer names than Windows programs). However, I do use Hungarian notation to help clarify some of the complex uses of variables: in particular, to help manage pointers in C, and to help resolve global/object/stack lifetimes of variables.

Pointers are a well-known troublesome part of C. I don't believe that "pointer to type" is worth formalising as rigidly as the type itself -- using & to obtain pointers and * to dereference pointers is very free and easy, and the typing should play along rather than seek to stifle this by being too formal. To keep track of each level of referencing, I prefix a "p" per level to the name. You can pair up the asterisks in the declaration and in dereferencing operations to the "p"s in the name. For example, given "int Value; int *pValue; int **ppValue; int ***pppValue; int *pResult; ", statements like "*ppValue = pResult" can be interpreted quite easily.

The other use for variable names is much more subtle. There are three possible lifetimes for any object or variable:

· A variable may be associated with a code module, and live for the lifetime of the module.

· A variable may be a component of an object, dynamically created and destroyed along with the object, and

· A variable may reside on the run-time stack, either as a passed parameter or as a temporary variable.

Each of these lifetimes is critically important, and must not be confused with any of the others. For example, creating an object from a stack variable is lethal. Confusing module-wide variables with object instance variables leads to complex interactions that are hard to debug.

The coding style seeks to make these lifetimes visible every time the variable is used, by using the following prefixes:

· Module variables are prefixed with the module name, plus a Hungarian style "g" meaning "global", e.g. a module called "Window" might define a counter called gWindow.Counter.

· Object-related variables are referenced by the pointer to the object: pObject->Data .

· Stack variables are not prefixed by either gModule or pObject: Sum . Other coding styles have distinguished between variables passed as parameters and variables declared locally, but I find this to be of little value. (Although I acknowledge that some compilers generate slower code for parameter variables.)

All the module variables are collected into a single structure, and the structure name is the module name plus the leading "g" prefix. The entire structure is locally scoped so the module's variables cannot be explicitly referenced by outside parties.

Note especially that static is utterly, utterly forbidden as a attribute added to a variable declared inside a function. This decision was made after an omitted static declaration in an embedded system module led to a command being corrupted after it had gone through critical integrity checks -- the command buffer was on the stack, and so was overwritten when the stack grew in size again. I've gone so far as to limit the use of static to its scoping meanings -- public scope or module scope -- so that the keyword static should not appear anywhere in the sources (except for the line "#define private_scope static" in compdef.h ).