.TH DFA 3 "30 September 1998" .SH NAME dfa, dfa_new, dfa_ignore, dfa_halton, dfa_restore, dfa_begin, dfa_add, dfa_close, dfa_perror, dfa_errstr, dfa_nextstate, dfa_contains, dfa_accepts, dfa_extent, dfa_dump, dfa_scanf, dfa_sscanf, dfa_fscanf, dfa_pscanf, dfa_malloc, dfa_memalloc, dfa_memincr, dfa_memfree, dfa_destruct \- deterministic finite-state automaton subroutines .SH SYNOPSIS .LP .B #include .LP .nf .ft B #define DFA_MUNGEBIT 1 #define DFA_ISACCEPTING(stateptr) DFA_ISMUNGED(stateptr) #define DFA_ISMUNGED(P) ((int)(P) & DFA_MUNGEBIT) #define DFA_MUNGED(P) ((unsigned long)(P) | DFA_MUNGEBIT) #define DFA_UNMUNGED(P) ((unsigned int)(P) & ~DFA_MUNGEBIT) .ft R .fi .LP .nf .ft B typedef char DFA, PNTR DFAPtr; .ft R .fi .LP .nf .ft B typedef struct DFA_State { struct DFA_State PNTR next[1]; } DFA_State, PNTR DFA_StatePtr; .ft R .fi .LP .nf .ft B typedef union { /* Patterns may need to be identified by a variety of means */ void *p; long i; unsigned long u; } DFA_PatID, *DFA_PatIDPtr; .ft R .fi .LP .nf .ft B typedef struct DFA_Patlist { /* linked list of patIDs */ /* the "next" field in last item is NULL */ struct DFA_Patlist PNTR next; DFA_PatID patID; } DFA_Patlist, *DFA_PatlistPtr; .ft R .fi .LP .nf .ft B typedef struct DFA_Accept { DFA_StatePtr retState; /* retState is Munged iff one patID is in the list */ DFA_PatlistPtr listptr; /* pointer to linked list of patIDs */ } DFA_Accept, PNTR DFA_AcceptPtr; .ft R .fi .sp 0.2i .LP .nf .ft B DFAPtr dfa_new(amin, amax) int amin, amax; .ft R .fi .LP .nf .ft B DFA_Error dfa_ignore(dp, a) DFAPtr dp; int a; .ft R .fi .LP .nf .ft B DFA_Error dfa_halton(dp, a) DFAPtr dp; int a; .ft R .fi .LP .nf .ft B DFA_Error dfa_restore(dp, a) DFAPtr dp; int a; .ft R .fi .LP .nf .ft B DFA_Error dfa_begin(dp) DFAPtr dp; .ft R .fi .LP .nf .ft B DFA_PatlistPtr dfa_add(dp, pattern, patlen, patID) DFAPtr dp; unsigned char *pattern; size_t patlen; DFA_PatID patID; .ft R .fi .LP .nf .ft B DFA_Error dfa_close(dp) DFAPtr dp; .ft R .fi .LP .nf .ft B DFA_StatePtr dfa_nextstate(S, ch, pos, report) DFA_StatePtr S; int ch; size_t pos; int (*report)(size_t pos, void *patID); .LP .nf .ft B DFA_Error dfa_destruct(dp) DFAPtr dp; .ft R .fi .LP .nf .ft B DFA_Error dfa_check(dp) DFAPtr dp; .ft R .fi .LP .nf .ft B DFA_PatlistPtr dfa_contains(dp, pattern, patlen, patID) DFAPtr dp; const unsigned char *pattern; size_t patlen; void *patID; .ft R .fi .LP .nf .ft B DFA_PatlistPtr dfa_accepts(dp, pattern, patlen) DFAPtr dp; const unsigned char *pattern; size_t patlen; .ft R .fi .LP .nf .ft B unsigned long dfa_extent(dp) DFAPtr dp; .ft R .fi .LP .nf .ft B unsigned long dfa_size(dp) DFAPtr dp; .ft R .fi .sp 0.2i .LP .nf .ft B DFA_Error dfa_perror(s) char *s; .ft R .fi .LP .nr .ft B char *dfa_errstr(errno) DFA_Error errno; .ft R .fi .LP .nf .ft B DFA_Error dfa_dump(dp, fp) DFAPtr dp; FILE *fp; .ft R .fi .sp 0.2i .LP .nf .ft B void *dfa_malloc(dp, nbytes) DFAPtr dp; size_t nbytes; .ft R .fi .LP .nf .ft B void * (*dfa_memalloc(falloc)) (size_t) void *(*falloc)(size_t); .ft R .fi .LP .nf .ft B void (*dfa_memfree(ffree)) (void *) void (*ffree)(void *); .ft R .fi .LP .nf .ft B size_t dfa_memincr(newincr) size_t newincr; .ft R .fi .sp 0.2i .LP .nf .ft B DFA_Error dfa_scanf(dp, report) DFAPtr dp; int (*report)(size_t pos, DFA_AcceptPtr ap); .ft R .fi .LP .nf .ft B DFA_Error dfa_sscanf(dp, query, querylen, report) DFAPtr dp; const DFA_Letter *query; size_t querylen; int (*report)(size_t pos, DFA_AcceptPtr ap); .ft R .fi .LP .nf .ft B DFA_Error dfa_fscanf(dp, fp, report) DFAPtr dp; int (*report)(size_t pos, DFA_AcceptPtr ap); FILE *fp; .ft R .fi .LP .nf .ft B DFA_Error dfa_pscanf(dp, gettoken, stream, report) DFAPtr dp; int (*gettoken)(void *stream), (*report)(size_t pos, DFA_AcceptPtr ap); void *stream; .ft R .fi .SH DESCRIPTION .LP These functions create a deterministic finite-state automaton (DFA) that accepts or recognizes variable length, metacharacter-free expressions in an input stream. A Mealy machine implementation is employed, wherein output activity is linked to state transitions (\fIaccepting\fR transitions), rather than in association with the states themselves (\fIaccepting\fR states) as in a Moore machine. This can reduce enormously the number of states required in a multi-pattern automaton and marginally increases its search speed over using a lookup table. Compared to a sparsely populated lookup table, an equivalent Mealy machine is compact and may easily reside in some central processor data caches. In cases where an equivalent lookup table would be fully populated, the DFA implementation produces an automaton that is only marginally larger than the lookup table, by the storage equivalent of a single state. As pattern lengths increase, lookup tables quickly (exponentially) become too large to be practical, whereas a DFA as implemented here will handle such situations with ease as long as the pattern space is not significantly explored. A hash table method could be used instead to conserve storage, but at the expense of slower search speeds. .LP An initial DFA structure is created by calling .BR dfa_new . The input alphabet is defined at this point as the set of all integers from .I amin through .I amax. Prior to calling .BR dfa_begin to create the initial state of the automaton, optional calls to .BR dfa_ignore and .BR dfa_halton may be made, to specify individual letter values that the automaton is to either ignore or halt on, respectively, when encountered in the input stream. .BR dfa_restore may be called to restore default behavior for any letter specified in a previous call to .BR dfa_ignore or .BR dfa_halton . .LP A single call to .BR dfa_begin creates the initial state of the automaton. Only after this has been done can consecutive calls be made to .BR dfa_add to augment the skeletal automaton to recognize each .I pattern of length .I patlen bytes. On success, .BR dfa_add returns a pointer to the DFA_Patlist structure for each newly added pattern/patID combination; on failure, .SM NULL is returned and a description of the error is saved in .BR dfaerrno. A visible string description of the error can be displayed to stderr by calling .BR dfa_perror. .LP The structure of a DFA is completed by calling .BR dfa_close . This function does not create any additional states, but it does allocate a temporary wrap-around buffer of DFA_State pointers and demands permanent heap storage for additional accepting transitions that may be required. .LP A DFA is built within longword-aligned storage obtained either by calling the default memory allocator malloc() or by calling the memory allocation function .I falloc specified to .BR dfa_memalloc . .I falloc should return a .SM NULL pointer if no further storage is available or if the DFA is otherwise not permitted to grow any further. .I falloc will be called with a request for a longword-aligned block of size .I memincr bytes, which may be altered from its default value by calling .BR dfa_memincr . Typically, .I memincr has a value of several thousand bytes, or enough storage for at least a few states and their associated accepting transitions. Each block of this storage is then doled out in smaller chunks as necessary. .LP All heap storage acquired for a DFA through calls to the memory allocator can be released by a single call to .BR dfa_destruct . For each successful call to the memory allocator, .BR dfa_destruct calls either the default memory free function .BR free(3) , or the function specified as the .I ffree argument to .BR dfa_memfree , with the address of a block of storage that had been returned by the memory allocator. .LP After a DFA has been completed by calling .BR dfa_close , use of the DFA to search an input stream requires persistent testing of a \*(lqmunge\*(rq flag encoded in the lowest-order bit of .BR DFA_State pointers. A munge flag that is set reveals the encounter of an accepting transition. The .BR DFA_ISACCEPTING , .BR DFA_ISMUNGED , .BR DFA_MUNGED , and .BR DFA_UNMUNGED macros are used to test, set, and clear the .BR munge flag. When processing the input stream, a munged .BR DFA_State pointer indicates that the .I unmunged pointer actually references an accepting transition structure of type .BR DFA_Accept, not the next state; the appropriate next-state pointer must be retrieved from the .BR retState field of the .BR DFA_Accept structure. .LP The following sample code searches standard input using a previously created DFA, .I dp. .IP .nf /* Sample code to search stdin using a DFA */ int report(); DFA_StatePtr S; /* Current state pointer */ DFA_AcceptPtr A; /* for interpreting accepting transitions */ DFA_PatlistPtr P; /* for processing lists of patIDs */ int c; /* holds each input character */ /* Initialize the automaton by placing it in State 0 */ S = dp->state0; for (;;) { do { /* PRIMARY INPUT PROCESSING LOOP */ /* Get next token from the input stream */ c = getchar(); /* Transit to the next state, checking for acceptance */ } while ( !DFA_ISACCEPTING(S = S->next[c]) ); /* Check for end of file condition */ if (c == EOF) return; /* Process the accepting transition */ A = (DFA_AcceptPtr)DFA_UNMUNGED(S); /* Obtain the return state */ S = A->retState; /* Get the head of the linked list of PatIDs */ P = &A->pl; /* Report each pattern accepted in this transition */ do if (report(P->patID) != 0) break; while (P = P->chain); /* Last item in list has NULL chain pointer */ } .fi .LP Further examples of how to use DFAs may be found in the source code for .BR dfa_fscanf . .LP Each DFA structure begins with a header that contains information such as the alphabet in use, the initial state, memory management details, etc. Some of the useful elements within the DFA structure are: .TP 18 .BR opstate the operational state of the automaton. .TP .BR state0 the address of the initial state. .TP .BR statesize the size (in bytes) of each state in the automaton. .TP .BR nstates the number of states in the automaton. .TP .BR asize the number of letters in the alphabet. .TP .BR amin the smallest letter value present in the alphabet. .TP .BR amax the largest letter value present in the alphabet. .LP Other characteristics of a DFA are not readily available as specific elements in the structure. For a few of these cases, special purpose functions are provided. .BR dfa_extent returns the total amount of storage (in bytes) allocated to the DFA, while .BR dfa_size returns the amount of that storage (in bytes) which is actually in use. .LP .BR dfa_dump produces a brief feature summary of the DFA and a full, printable dump of the binary contents of the states, accepting transitions, and .I patID lists on the output stream .I fp. .LP The enumerated operational states maintained in the .BR opstate field of a DFA structure are: .TP 18 dfaOpstateInit basic initialization completed. .TP dfaOpstateMemInit memory management established. .TP dfaOpstateAlphaInit alphabet has been validated and processed. .TP dfaOpstateVirgin the DFA has not been fully initialized by .BR dfa_begin and contains no states. .TP dfaOpstateOpen the DFA is ready for adding a pattern. .TP dfaOpstateBuilding a pattern is being added to the DFA. .TP dfaOpstateClosing the DFA is in the process of being closed. .TP dfaOpstateClosed the DFA is closed. .TP dfaOpstateZombie the DFA is garbage, but somehow still lives. .TP dfaOpstateDestroyed the DFA has been destroyed/deallocated and should not be referenced. .LP .BR dfa_pscanf is a sample input parser using a fully constructed DFA pointed to by .I dp. .I gettoken is called repeatedly by .BR dfa_pscanf with the specified .I stream argument (which is not checked) to obtain each character from the input stream. When an accepting transition is encountered, the user-supplied function .I report is called with the current location in the input stream (\fIpos\fR) and the .I patID of a recognized pattern. .I report should return 0 on success, or return non-zero on failure upon which .BR dfa_pscanf will abort its search of the input stream. Upon obtaining the return value .SM EOF (\fIi.e.\fR, signed integer value -1) from a call to .I gettoken, .BR dfa_pscanf returns normally to its caller. .SH "RETURN VALUES Functions that return a size indicate an error condition with zero return values; a non-zero return value indicates no error. Functions that return an int Functions that return a pointer indicate errors by returning .SM NULL .BR (0) . Upon receipt of an error return value, .BR dfa_error may be called to determine the precise error encountered. .BR dfa_clearerr may be called to reset an error condition. .SH ERRORS The functions will fail under the following enumerated error conditions: .TP 20 dfaErrBadPtr .I dp is not a valid DFA pointer. .TP dfaErrBadParm a bad value was specified for one or more parameters. .TP dfaErroMemincrTooSml .I memincr is too small; see .BR dfa_memincr. .TP dfaErrNoMem insufficient memory available to complete request. .TP dfaErrPatlen .I patlen is zero. .TP dfaErrOpstate .I dp is in the wrong operational state for the called function. .TP dfaErrNonAlpha .I pattern contains a letter not present in the .I alphabet. .TP dfaErrNullParm a required parameter is \s-1NULL\s0. .TP dfaErrAlphaSize invalid alphabet size. .TP dfaErrFileIO stream I/O error encountered by .BR dfa_dump. .TP dfaErrReport the report function called by .BR dfa_scanf, .BR dfa_sscanf, .BR dfa_fscanf, or .BR dfa_pscanf returned non-zero. .TP dfaErrUnknown non-specific error. .SH RESOURCE UTILIZATION .LP Efficiency of DFA construction can be broken down into the separate requirements for states, accepting transitions, linked lists of \fIpatID\fRs associated with the accepting transitions, and working storage. Time and storage requirements for the states are no worse than linear functions of the total length of the patterns. Each state contains one next-state pointer for every letter in the alphabet, so the size of a DFA is linear in the alphabet size. The Mealy machine implementation imposes an upper bound on the number of states that is the summation of .I S**m, for all .I m = 0, 1, 2,... n-1, where .I S is the number of letters in the alphabet and .I n is the length of the longest pattern accepted by the DFA. This worst-case situation is encountered when all possible patterns of length .I n are to be recognized by the DFA, corresponding to situations where the equivalent lookup table has all slots filled. When .I S is a power of two, the number of states is bounded by .I S**n - 1. .LP Since each state contains .I S next-state pointers, the maximum number of next-state pointers in a DFA is the product of .I S and the upper bound on the number of states, or the summation of .I S**m, for all .I m = 1, 2, 3,... n. When .I S is a power of two, the total number of next-state pointers is limited to .I S**(n+1) - S. Compared to a lookup table for any size alphabet (fully populated or not), the lookup table requires storage for .I S**n entries. Essentially the leaf states of a worst-case DFA are equivalent in total size to the lookup table; the root state plus any internal states comprise the additional storage required by the DFA in worst case situations. For arbitrarily sized alphabets, the number of pointers required by a DFA over that required for a lookup table is the summation of .I S**m, for all .I m = 1, 2, 3,... n-1. If the alphabet is a power of two in size, this simplifies to .I S**n - 2. Thus, in worst case situations, the DFA is not quite twice the size of an equivalent, fully populated lookup table. For short pattern lengths, this makes little practical difference. For long pattern lengths, many applications exist which do not require that the entire pattern space be explored. Furthermore, for speed in searching (using logical shift operations rather than multiplies), lookup tables are often chosen to be the largest power of two in size necessary to support the desired alphabet. When such a rule is used to pick the size of a lookup table, depending on how much larger than a power of two the alphabet is, the equivalent worst-case DFA may approach half the size of the lookup table because there is no performance gained by using states that are larger than necessary. .LP Requirements for the accepting transitions are linear when all patterns are equal length, and more complicated when some patterns are internal substrings of others. Accepting transitions individually occupy little storage (4 bytes overhead on common platforms), but their cumulative storage can amount to far more than the storage required for the states themselves. .LP The output lists associated with accepting transitions also have small but additive time and storage dependencies (8 bytes overhead per element in a linked list). In many instances, resources are conserved by joining existing output lists to produce longer linked lists that have both initial and internal entry points; in other cases, lists of \fIpatID\fRs may be re-used without duplication. .SH BUGS .LP This manual page has not been rigorously examined for accuracy and is expected to contain numerous errors, some substantive. .LP Hiding the .BR munge flag in pointers is not the best machine-independent programming practice, but it does conserve significant time and storage when it works. On most platforms encountered, the least-significant bit (bit 0) is used for the .BR munge bit. On the Cray YMP, however, a different bit must be used, for which bit 30 has been selected. .LP In making sweeping changes to the DFA library in February, 1992, .BR dfa_close was left in an incomplete state--it allocates a temporary storage space (WABuf) that is usually larger than necessary but still small enough for most practical applications even on modest-sized microcomputers. When time permits, this function will be brought back into shape so that the temporary storage is smaller and wraps around on itself as it did in earlier versions. .LP The memory allocation and free functions employed, as well as the size of each request made for additional storage, apply globally to all DFA structures currently in use. If necessary, it would be trivial to have these characteristics set at the time a DFA is created by .BR dfa_new and retained throughout the life of the DFA, independent of any subsequent calls to .BR dfa_memalloc , .BR dfa_memfree and .BR dfa_memincr . .LP Note that the .I pattern argument to .BR dfa_add is a pointer to an array of unsigned char, while the .I amin and .I amax arguments to .BR dfa_new are signed integers. A slightly modified version of .BR dfa_add needs to be written that accepts pointers to arrays of signed char or int. .LP Automata created by this library are not guaranteed to contain the smallest number of states needed to recognize a given list of patterns. .LP Fragmentation of the free pool by a temporary wrap-around buffer may occur if the need for more accepting transitions by .BR dfa_close should result in one or more calls to .I falloc. .LP When insufficient storage is available to create a DFA, .BR dfa_begin , .BR dfa_add and .BR dfa_close will fail. These functions do not back out any intermediate modifications made to the DFA. Even if sufficient storage becomes available at a later time, a DFA that was specified in a failed call to .BR dfa_begin , .BR dfa_add or .BR dfa_close should never be used again in calls to these functions. .BR dfa_destruct can and should be called to return any allocated storage to the free pool. .LP .BR dfa_ignore , .BR dfa_halton and .BR dfa_restore may not be applied to a DFA once it has been the subject of a call to .BR dfa_begin . .LP .BR dfa_add may not be applied to a DFA once it has been the subject of a call to .BR dfa_close . .LP No means are currently provided to save a DFA to disk in a re-usable format or to share a DFA image between unrelated processes. .LP The information produced by .BR dfa_dump may be voluminous and could be easier to interpret. .LP No attempt is made by .BR dfa_add to filter duplicate calls using the same combination of .I patID or .I pattern arguments. If recognition of identical patterns by an automaton is to be avoided, the functions .BR dfa_accepts and .BR dfa_contains may be useful for screening the calls made to .BR dfa_add . .LP In linked lists of \fIpatID\fRs, identical patterns are listed in the reverse order from which they were made the subject of calls to .BR dfa_add or .BR dfa_addx . Any pattern which is a suffix of a longer pattern is listed after the longer pattern. .SH SEE ALSO .LP Aho, Hopcroft and Ullman (1974). \fIThe Design and Analysis of Computer Algorithms,\fR page 335. Addison-Wesley, Reading, MA. .LP Aho and Corasick (1975). \fIEfficient String Matching: An Aid to Bibliographic Search\fR, Communications of the Association for Computing Machinery \fB18\fR(6):333\-340. .LP Hopcroft and Ullman (1979). \fIIntroduction to Automata Theory, Languages, and Computation\fR, pages 42-45. Addison-Wesley, Reading, MA. .SH EPILOGUE .LP The methods used in this function library were developed independently by the author during the early 1980s, while a student in the laboratory of Michael Botchan at the University of California, Berkeley. The C language implementation described here was initiated in 1989, while the author was a staff fellow at the NCBI. It was expanded somewhat at Washington University in 1995. Subsequently, a description of many of the ideas implemented here was found in Aho and Corasick (1975). In particular, Aho-Corasick's Algorithm 2 is analogous to .BR dfa_add ; and .BR dfa_close is analogous to a consolidation of Aho-Corasick's Algorithms 3 and 4, just as the authors suggest would be done in practice. Notable differences include the use of a Mealy machine here and a Moore machine by Aho-Corasick; more efficient storage of output lists here when some patterns are suffixes of others; and a more efficient, hierarchical ordering here of tests for the end of the text string being searched and the acceptance of some pattern(s). .SH ACKNOWLEDGEMENTS The author wishes to thank former upperclassman Michael J. Karels for the keen suggestion which led to this development: that automata be used to identify restriction enzyme recognition sites in a single pass through a sequence; Glen Herrmannsfeldt for pointing me to Aho and Corasick (1975); and former classmate J. Michael Cherry for his encouragement. .SH AUTHOR .LP Warren Gish, Washington University School of Medicine, Department of Genetics, 4444 Forest Park Avenue, Saint Louis, MO 63108. gish@watson.wustl.edu. Former address: National Center for Biotechnology Information, National Library of Medicine, Building 38A Room 8N-806, 8600 Rockville Pike, Bethesda, MD 20894. gish@ncbi.nlm.nih.gov.