AB-BLAST

Basic Local Alignment Search Tool
from Advanced Biocomputing, LLC

Index

Description
Licensing
Key Features
Licensing of AB-BLAST
Manifest
Command Line Options and Parameters
Comparable AB/NCBI-BLAST Parameters
Environment Variables
Filters and Masks
Precomputed Statistical Parameters
- nucleotide scoring systems
- protein scoring systems
Bugs <= READ THIS!
Memory Requirements
Supported Platforms
Installation
Differences between AB-BLAST and WU-BLAST
Citing BLAST
Historical Notes
References

Description

AB-BLAST 3.0 is a powerful software package for gene and protein identification, using sensitive, selective and rapid similarity searches of protein and nucleotide sequence databases. The feature list for AB-BLAST is long and continues to expand, while performance is improved. Much of this is outlined below. A complete suite of BLAST search programs (blastp, blastn, blastx, tblastn and tblastx) is provided in the package, along with several database management and support programs that include nrdb, patdb, xdformat, xdget, seg, dust and xnu.

AB-BLAST has been built to be the most trusted database search tool in your software toolbox, doing what you tell it, reporting precisely what it’s doing — even telling you what it could not do because of specific parameter restrictions you might wish to change — and able to handle the biggest jobs with aplomb. Users of other BLAST implementations have suffered every few years through a series of expensive and time-consuming rewrites, alpha releases, beta releases, output format changes, database format changes, specialized spin-off programs, and new program parameters and behaviors that may be important to NCBI operations but not to anyone else. Meanwhile AB-BLAST was built from scratch to offer performance, reliability and flexibility — plus backward compatibility with every AB-BLAST release for over 24 years.

AB-BLAST represents the most rigorous, sensitive implementation of the core BLAST algorithm available, yet it often runs faster than the rest. AB-BLAST has a simple, easy-to-use command line syntax; offers consistent behavior across all search modes; runs on general purpose computer hardware; can uniquely categorize and filter results based on biologically relevant criteria; and much more. All of these features help AB-BLAST users be more productive and save time and money.

AB-BLAST is not a re-hashed version of NCBI-BLAST. AB-BLAST shares virtually no code with NCBI-BLAST except for some portions that both packages copied from the public domain, ungapped BLAST version 1.4 released in 1994 (W. Gish, unpublished). A brief history of AB-BLAST development is available here. The AB-BLAST lineage includes the original gapped BLAST, first released in May 1996. Before that, its author created (and maintained) the "nonredundant" databases and BLAST Network Service but was asked by NCBI management not to publish this work.

Licensing

Please see https://blast.advbiocomp.com/licensing/ for complete licensing information.

Key Features

Some of the key features of AB-BLAST are described below.

AB-BLAST is the premier gapped BLAST with statistics. AB-BLAST is derived from the original gapped BLAST with statistics (W. Gish, 1996, unpublished). Gapped alignment routines are available and used by default in all AB-BLAST search modes (BLASTP, BLASTN, TBLASTN, BLASTX and TBLASTX), with purely ungapped alignments available as an option.
Faster and More Sensitive. AB-BLAST is up to twice as fast as NCBI BLAST in all search modes, while being more sensitive. AB-BLAST uses distinctly different, more-sensitive search algorithms than NCBI-BLAST, that have been painstakingly implemented to be faster, as well. Due to algorithmic differences (not to mention differences in statistics), no setting of parameters can guarantee the same results will be produced by the two packages, but with comparable parameter settings, the classical 1-hit BLAST used by default in AB-BLAST is faster and more sensitive than the 1-hit BLAST available as an option in NCBI BLAST. At comparable sensitivity levels, the AB-BLAST 1-hit implementation is nearly as fast and uses much less memory than the 2-hit algorithm described by Altschul et al. (1997). For users who desire maximum speed, an AB-exclusive 2-hit algorithm (W. Gish, unpublished) is available in all search modes (including BLASTN) that is faster, more sensitive and more memory-efficient than the NCBI 2-hit algorithm. (See the hitdist option.)
Formatting databases is faster with AB-BLAST—up to 4x faster—which allows your BLAST servers to start scanning the databases that must sooner.
Multiple Local Alignments with Joint Evaluation. Virtually all database search programs will find sequence similarities (locally optimal alignments or approximations thereof) that are by themselves statistically insignificant and thus are not reported, but AB-BLAST can identify alignments none of which are statistically significant on their own but which are statistically significant as a group. Alignments are clustered into “consistent” sets under a variety of user-configurable constraints, including maximum allowable separation distance and maximum allowable overlap. This combination of sensitivity and selectivity — original to AB-BLAST and used by default in all AB-BLAST search modes — increases the biological relevance of the results. This feature is essential for finding:
- all exons in a multi-exon gene sequence, not just high-scoring exons;
- all complete or partial copies of a repetitive element in a genomic sequence, not just high-scoring ones; and
- multiple, discrete domains of similarity between sequences, not just the highest-scoring domain.
Often More Sensitive/Selective than Smith-Waterman. The combination of well-chosen heuristics and statistics in AB-BLAST can be more sensitive and selective than the full dynamic programming approach of the classical Smith-Waterman (1981) algorithm, which reports only the single highest scoring alignment between two sequences, as well as other approaches or BLAST implementations that may identify multiple regions of local similarity but then only evaluate the alignments individually for their statistical significance.
Full Smith-Waterman Option. With the postsw option, a full Smith-Waterman alignment is performed between pairs of query-subject sequences that are already scheduled to be reported by BLASTP. The Smith-Waterman alignments are combined with the heuristic BLAST results, any redundancy between them is removed, and the statistics are recomputed. In addition to providing alignments guaranteed to be optimal, this post-processing can significantly improve the P-values and relative ranking of database hits, often while increasing the execution time only marginally.
Choice of Statistical Methods. AB-BLAST uses “Sum” statistics (Karlin and Altschul, 1993) by default in all search modes, with Poisson statistics available as an option (poissonp). Sum statistics and Poisson statistics involve joint probability calculations on sets of one or more alignments. To evaluate the significance of individual alignments (or alignment scores), simple Karlin-Altschul (1990) statistics are also available with the kap option.
BLASTN Flexibility. Unique to AB-BLAST are these features of the BLASTN search mode:
- Nucleotide scoring matrices. AB-BLASTN supports fully-specified scoring matrices, not just simple match/mismatch scoring systems. This allows transitions to be scored differently than transversions; and positive G-A substitution scores for the design of siRNAs (small interfering RNAs) where G-U base pairing is allowed. Scoring matrices can also be tailored to improve the design of PCR primers or applied to areas of research where a simple match/mismatch scoring system can not adequately discriminate. Contrary to W. Miller (2001), scoring matrices were first supported by the NCBI ungapped BLASTN version 1.4 (Gish, W., 1994, unpublished; see https://blast.advbiocomp.com/pub/blast-1.4). Support for nucleotide scoring matrices was indeed dropped by the NCBI when its blastall program was released in 1997, but this feature was maintained continuously in all WU versions of BLASTN since the migration to Washington University in St. Louis in 1994, continuously through the introduction of the original gapped BLAST (WU-BLAST) in 1996, and on through to today with AB-BLASTN.
- Flexible Word Lengths. AB-BLASTN supports BLAST word lengths as short as 1 (re: the W parameter).
- Nucleotide Neighborhood Words. Nucleotide neighborhood words are supported by AB-BLASTN using the standard neighborhood word score threshold parameter, T. Using neighborhood words, nucleotide sequence similarity can be detected even in the absence of any identical residues between two sequences. Users are cautioned, however, that careless use of the T parameter can result in crushing amounts of memory being requested by BLASTN. For this reason, T should likely be used only in conjunction with very short word lengths.
Consistently Accurate Statistics with BLASTN. Since the release of the first gapped BLAST with statistics in 1996 (W. Gish, unpublished), the statistical significance of gapped alignment scores in all search modes — including BLASTN — has been evaluated using appropriately pre-computed “gapped” values for the statistical parameters λ, K and H, rather than the potentially very different values for these parameters that are computed at run-time for evaluating ungapped alignment scores. If precomputed values are not available for the specific combination of scoring system and gap penalties requested by the user, a prominent warning has always been issued by AB-BLAST. In contrast, NCBI-BLASTN only relatively recently began using precomputed gapped values for λ, K and H. For many years prior, going all the way back to 1997, NCBI-BLASTN was without warning always using parameter values computed for ungapped alignments to evaluate the significance of gapped alignments.
Virtual Gene Structures. Linkage information describing “consistent” groups or chains of local alignments (HSPs) are provided by AB-BLAST when the topcomboN or links options are used. This facility can help with construction of overall gene structures from what might otherwise be a barrage of individual local alignments scattered throughout a 2-dimensional search space. The hspsepQmax, hspsepSmax, olfraction, olmax, golfraction and golmax parameters can also help ensure the reported structures are more biologically relevant.
Ease of Database Management. AB-BLAST supports the eXtended Database Format (XDF), a power user’s dream for working with peptide and nucleotide sequences. Both the NCBI-BLAST 2.0 database format and the NCBI implementation of the BLAST search algorithm were originally restricted to sequences under 16 Mbp in length, whereas human genome contigs exceeded 25 Mbp in the last millennium (Hattori et al., 2000) and extended to hundreds of megabytes many years ago. In contrast, XDF databases, which were introduced in 1999, have the facility to accurately store individual sequences of up to 1 Gbp (1 billion bp) in length with ambiguity codes intact. Other BLAST software, such as the NCBI’s, limits database files to 2 gigabytes each, whereas from its inception XDF database files could be of virtually unlimited size — provided of course that the host operating system and file system support such “large files” (as most modern operating systems and file systems do).

To support XDF databases, the database formatting tool named xdformat is provided with AB-BLAST. Among other distinct capabilities and advantages to using XDF and xdformat are:

fast appends of new sequences to existing databases (both protein and nucleotide). There is no need to reformat an XDF database just to add one or more sequences to it.
xdformat runs 2-4 times faster than the NCBI formatdb program, while offering a superset of features and greater reliability;
the full complement of NCBI standard “FASTA” sequence identifiers is indexed by xdformat, including standard identifiers the NCBI formatdb program does not index; See this FAQ for further details;
duplicated sequence identifiers (that should be unique) in public databases distributed by the NCBI are reported by xdformat that the NCBI formatdb does not catch;
safe roll-backs of database updates when file I/O (e.g., disk-full errors) or parse errors are encountered;
huge databases need not be broken into multiple volumes individually composed of several files but can be managed simply with as few as 3 files (4 files when sequence identifier index is included), regardless of the database size up to 1 TB;
flexible indexing of all sequence identifiers — not just a subset of the NCBI identifiers — including user-defined identifiers;
index support for duplicate occurrences of the same identifier, even the identical “gi” identifiers that cause some indexing programs to abort;
identifier indexing is supported not only when creating an XDF database but when appending new sequences to an existing XDF database; and if a database was originally created without an identifier index, an index can be added later in one, relatively fast step;
identifier indexes can be quickly re-built if necessary using different indexing policies, without having to reformat the entire database;
intelligent retrieval of indexed sequences uses a complementary program named xdget. Xdget can retrieve sequences by identifier even if the program is not told what name space (e.g., gi, accession, locus, user-defined, etc.) the identifier came from. For more on identifier indexing, see this;
xdformat and xdget accept and work intelligently with identifiers that obey the International DDBJ/EBI/NCBI collaboration’s Accession.Version identifier syntax (e.g., the programs know that BAA84643.2 is a newer version of BAA84643, but will retrieve BAA84643.1 if specifically requested); And the parsing programs that come with AB-BLAST for converting GenBank and EMBL database flat files into “FASTA” format not only report gi identifiers but Accessions with Versions;
greatly reduced memory requirements and BLAST search initiation times for databases containing large numbers of entries, which is particularly important when memory is in short supply or when multiple processors are standing by waiting for a single-threaded initialization phase to be completed;
the ability to dump (or recover) the contents of an XDF database back into FASTA format with the original annotation and ambiguity codes intact;
both the X and N ambiguity codes are supported in nucleotide sequences, thus permitting the use of distinct substitution scores for these letters and the use of PHRED/PHRAP sequence output “as is” for input to xdformat.
Compared to the classical BLAST 1.4 database format, XDF provides the ability to use FASTA/Pearson format input files with unjustified (i.e., ragged or blank) input lines. With nucleotide sequence databases, there is also no longer the need to retain the original FASTA input file in order to access the ambiguity codes during a database search.
Support for XDF by the BLAST search programs does not come at the expense of backward compatibility. AB-BLAST can search databases in either XDF or the classical BLAST 1.4 database formats. Furthermore, by simply installing new versions of setdb and pressdb, the migration to using XDF can be performed swiftly and transparently, without making any changes whatsoever to existing database maintenance scripts. While providing this drop-in upgrade path to XDF, support for legacy databases in the BLAST 1.4 database format is retained transparently, as well: the AB-BLAST search programs automatically identify the database format being used and adjust their operations accordingly. This allows users to migrate incrementally to XDF, at their own pace and as they see fit, without losing the ability to study or reproduce results obtained with older databases. Even so, users are encouraged to make the migration to XDF, as there are definite benefits to the new format, including an improved nucleotide sequence data representation and the ability to index sequences by their identifiers.
When searching very large databases, virtual memory requirements are dramatically reduced in AB-BLAST, eliminating program failures that occurred when system resource limits were unexpectedly reached.
Virtual databases are supported by AB-BLAST. Virtual databases can be specified on the command line as a white space-delimited list of component database names. Virtual databases can be comprised of components in either XDF or classical BLAST 1.4 format, as long as the formats are not mixed on the same command line. For example, this command might be used to search the pri, rod, mam, vrt, and htg divisions of GenBank:
```
  blastn "pri rod mam vrt htg" myquery.nt
```
Virtually no file size limits exist for databases and other files, provided the host operating system supports large files. Operating systems such as Linux (kernel version 2.2 and earlier) for 32-bit Intel computing platforms are often incapable of using files larger than 2 GB, although virtual database support (see above) helps avoid this limitation by allowing large databases to be segmented into files of a manageable size. Linux users in need of large file support should use at least a version 2.4.* kernel or ideally a 2.6.* kernel.
AB-BLAST supports segmented query sequences, such as the contigs that result from shotgun sequencing assembly or perhaps multiple short probes for a given gene. For example, all of the contigs from a given clone can be concatenated together with a single hyphen (-) character to delimit each contig. Segment boundaries are therefore clearly distinguishable from purely ambiguous regions of the sequence, while consuming little storage. AB-BLAST honors segment boundaries by guaranteeing that no alignment, be it ungapped or gapped, will cross a boundary. Support for segmented database sequences is in progress.
Multi-sequence query files are supported, such that every sequence in the FASTA file is searched against the specified database. Previous versions of the software only compared the first sequence in the query file against the database. Each search result is separated from the next by a single ASCII form feed character (control-L). See the new qrecmin and qrecmax options.
The format of all dates reported in BLAST output can be controlled by the UN*X standard CFTIME environment variable. For example, dates will be reported in ISO 8601 standard format, if CFTIME is set to '%Y-%m-%dT%H:%M:%S'. Date strings produced by the xdformat program are also governed by CFTIME. Note that the format of many date and time strings reported for XDF databases is determined by the setting (if any) of CFTIME when the database was created or last modified.
Both sequence filtering and word masking of query sequences are supported. The terms “filter” and ”mask” are sometimes used alone and interchangeably, however there are two distinct techniques people can use which deserve separate names. Lower case alphabetic letters in the query sequence can be used to inform the BLAST search program as to which residues it should either filter (convert to X or N) or mask (skip when generating neighborhood words but otherwise leave the sequence intact). See the lcfilter and lcmask options, respectively.
Multiple filter=<filter> directives can be specified on the BLAST command line. Each of the filters is executed independently and their results are OR-ed at the end.
NCBI-BLAST 2.0 uses either built-in (closed) complexity filters or the original external filtering technique of BLAST 1.3 (Gish, W., unpublished), which uses the UNIX popen() system call and temporary/intermediate files. AB-BLAST provides open access to its complexity filters by using filter programs that are distinct from the search programs, while simultaneously avoiding the use of problematic system call interfaces and temporary files.
One or more word masks can be specified on the command line, using the wordmask=<mask> option, where <mask> may be a classical filter program such as seg, xnu, or dust. Whereas sequence filters convert certain letters in the query sequence into ambiguity codes (X for amino acid and N for nucleotide), word masks do not alter the sequence. Word masks instead cause the indicated portion(s) of the query sequence to be skipped during BLAST neighborhood word generation. This leaves the query sequence intact for generating alignments that are seeded by word hits arising in flanking, unmasked regions of the sequence.
The BLAST algorithm word length parameter, W, can be set from 1 to 1024 in all search modes (BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX). This wide-ranging flexibility and the resultant speed are due in part to the highly optimized DFA (deterministic finite-state automaton) used by AB-BLAST.
AB-BLAST reliably supports parallel processing on a variety of SMP (symmetric multiprocessing) computing platforms. AB-BLAST was the first BLAST to support Mac OS X, the first to thread properly (i.e., robustly and efficiently) under Mac OS X, and the first to support 64-bit computing under Mac OS X on PowerPC G5 and Intel X64 processors. POSIX threads are used under Linux and Mac OS X. Although POSIX threads are available under Solaris, Solaris native threads are used instead for slightly better performance.
To illustrate just some of the flexibility available with AB-BLAST, the software bundle includes a PERL script named ab-blastall (formerly wu-blastall) that translates an NCBI blastall command line into a roughly equivalent AB-BLAST command line and then invokes the appropriate AB-BLAST search mode. The output remains in AB-BLAST format, but the ab-blastall script may help users of the NCBI blastall program migrate to AB-BLAST and start to discover its power.
NOTE: the ab-blastall script is not at all intended to provide a literal replacement for the NCBI blastall program and is not an appropriate method for assessing the relative performance (sensitivity, specificity, accuracy or speed) of the NCBI and AB software.
By coercing AB-BLAST to substitute for CrossMatch (Phil Green, unpublished), the unique combination of flexibility and speed of AB-BLAST can yield up to a 30-fold increase in performance of RepeatMasker in its slow mode, while maintaining virtually identical sensitivity.
Many command line options are available. The complete list of options and parameters along with their descriptions is provided in the parameters.html file that comes bundled with the software. The latest version of this file is maintained on-line at https://blast.advbiocomp.com/doc/parameters.html.
For convenience, the AB-BLAST package includes an optimized version of the classic sequence redundancy removal program nrdb. A newer program named patdb is also included that can find identical sequences and perfect substrings of others in the input nearly as fast as nrdb finds identical sequences alone.

A reverse chronological list of changes to the AB-BLAST software is available in the file named HISTORY that comes bundled with the software. When possible, any bugs that have been found have typically been fixed within 24 hours of their being reported.

Please send us bug reports, questions, or suggestions.

Licensing

Full information about licensing of AB-BLAST is provided here.

Manifest

The AB-BLAST 3.0 package includes the following data analysis and utility programs:

blasta — the unified database search program, which provides blastp, blastn, blastx, tblastn, and tblastx search functionality.

xdformat — the recommended program for rapidly converting sequences from FASTA format into the native XDF format read by blasta. The program can also append new sequences to an existing database; automatically rollback on errors; provides flexible indexing and verification services; and can dump data back into FASTA format.

xdget — a flexible tool for retrieving sequences (or segments thereof) from an indexed XDF database; retrieved sequences are optionally reverse-complemented and translated in the case of nucleotide sequences. xdformat and xdget are actually one and the same program to help ensure their mutual compatibility during upgrades.

nrdb — a tool for rapidly removing trivial redundancy (i.e., duplicate sequences) from one or more input files in FASTA format. The nrdb program is often many times faster and uses much less memory than “competing” solutions.

patdb — a tool much like nrdb for rapidly removing trivial redundancy from one or more input files in FASTA format, but with the important option of identifying sequences that are perfect substrings of others. The program attains high speed by using a Patricia tree that is often combined with finite state automata. The substring removal option can be usefully applied to protein sequences which may differ in their inclusion of the initiator methionine; or in mapping short read sequences onto a genome. On protein sequence data, with or without its substring option, patdb runs at about the same speed as nrdb and requires about the same amount of memory. When operating on nucleotide sequences, nrdb may be more practical, because nrdb uses data compression techniques that are unavailable in patdb; nrdb just can not identify perfect substrings.
ab-blastall — a PERL script for converting an NCBI blastall command line into a roughly equivalent blasta command line and then invoking blasta. The output is still in AB-BLAST format. This is primarily intended as a technology demonstration tool but may also assist users in their migration from NCBI BLAST to the more accurate AB-BLAST. For benchmarking of BLAST, careful tweaking of parameters may be required, but even with great care, benchmarking for speed can still be confounded by inaccuracies in NCBI BLAST.
ab-formatdb — a PERL script for converting an NCBI formatdb command line into the equivalent xdformat command line and then invoking xdformat. This is primarily intended as a technology demonstration tool but may also assist users in their migration from NCBI BLAST to AB-BLAST.
pam — a program to compute amino acid substitution scoring matrices having arbitrary scales, using the Dayhoff PAM model.
pressdb.real — the legacy pressdb program for any rare users who may be reliant on the NCBI-BLAST 1.4 database format for nucleotide sequences.
setdb.real — the legacy setdb program for any rare users who may be reliant on the NCBI-BLAST 1.4 database format for amino acid sequences.
gb2fasta — a parser to extract nucleotide sequences from GenBank flat files into FASTA format.
gt2fasta — a parser to extract amino acid sequences from CDS features in GenBank flat files and output them in FASTA format.
sp2fasta — a parser to extract protein or nucleotide sequences from EMBL, TrEMBL, or Swiss-Prot database files and output them in FASTA format.
pir2fasta — a parser to extract protein sequences from the old NBRF PIR database files and output them in FASTA format.

seg — a low-complexity filter for protein and nucleotide sequences (Wootton and Federhen, 1993; Wootton and Federhen, 1996). The program identifies low compositional complexity regions.

dust — a low-complexity filter for nucleotide sequences (Hancock and Armstrong, 1994; Tatusov and Lipman, unpublished).

xnu — a low-complexity filter for protein sequences (Claverie and States, 1993). The program identifies short-periodicity repeats.
sysblast.sample — a sample configuration file that system administrators may wish to modify and install as /etc/sysblast. Parameter settings in this file can be used to: limit the number of threads employed by each BLAST search process; change the default number of threads employed per process; alter the “nice” value for BLAST processes; limit the amount of memory utilized by each BLAST process.

AB-BLAST Command Line Options and Parameters

A complete list of command line options and parameters for modifying the behavior of the AB-BLAST search programs is available here.

Comparable AB/NCBI BLAST Parameters

A brief comparison of the some of the most important parameters for controlling sensitivity, selectivity and speed of AB-BLAST and NCBI BLAST is available here.

Environment Variables

AB-BLAST can utilize the settings of a few environment variables to adapt its behavior to different computing environments: BLASTDB, BLASTFILTER and BLASTMAT. To allow for triple AB/WU/NCBI BLAST installations, AB-BLAST also supports the environment variables ABBLASTDB, ABBLASTFILTER and ABBLASTMAT, as well as WUBLASTDB, WUBLASTFILTER and WUBLASTMAT. Settings of the AB versions of these variables take precedence over all others, and WU variable settings take precedence over the corresponding base name variables.

In AB-BLAST, the BLASTDB (or ABBLASTDB) environment variable can be a list of one or more directory names in which the programs are to look for database files. In UNIX parlance, such an environment variable might be called a path for the database files. Directory names should be delimited from one another by a colon (“:”) and listed in the order that they should be searched. If the BLASTDB environment variable is not set, the programs use a default path of .:/usr/ncbi/blast/db, such that the programs first look in the current working directory (“.”) for the requested database and then look in the /usr/ncbi/blast/db directory. For backward compatibility with programs that expect BLASTDB to be a single directory specification and not a path, if the user has set a value for BLASTDB but omitted the current working directory, AB-BLAST will still look for database files in the current working directory as a last resort. This usage is unchanged from NCBI/WU BLAST version 1.4 (1994), except multiple directories could be specified with the BLASTDB variable beginning with WU-BLAST 2.0 ca. 1997.

The BLASTFILTER (or ABBLASTFILTER) environment variable can be set to the directory containing the sequence filter programs, such as seg and xnu.

The BLASTMAT (or ABBLASTMAT) environment variable can be set to the parent directory for all scoring matrix files. The default directory location for scoring matrix files is beneath the matrix/ subdirectory of the AB-BLAST software installation. And beneath this directory exist 4 subdirectories to accommodate the 4 combinations of query-subject alphabets the search programs can use. Before looking anywhere else, though, the search programs first look for the requested scoring matrix in the current working directory.

For more information about environment variables, see the Installation instructions.

Filters and Masks

AB-BLAST provides a highly flexible means of applying both “hard” and “soft” masks to a query sequence, of supporting alternative, user-defined filter programs and non-standard parameters to the standard filters. The filter (for hard masking) and wordmask (for soft masking) command line options provide the basic interface. Multiple specifications of each type are acceptable on the BLAST command line and are executed in left-to-right order.

Individual filter and wordmask specifications may consist of pipelines of commands. For example, three filters are used in succession by this pipeline:

      filter="myfilter1 | myfilter2 | myfilter3 -x5 -"

The first two filters in this case expect to read their input from UN*X standard input (also known as stdin), whereas myfilter3 apparently needs to be told explicitly to read data from stdin, using the conventional “-” symbol for stdin. The standard output (stdout) from myfilter1 will be read via stdin by myfilter2, which in turn processes the query before handing its results to myfilter3; finally, myfilter3 reports its results to stdout, which the BLAST program itself reads to obtain the fully masked sequence. The final output from the filter pipeline is expected by the BLAST program to be in FASTA format.

Instead of running all 3 filters in the above example as part of one pipeline, they could instead be specified as three separate filter options like this:

    filter=myfilter1  filter=myfilter2  filter="myfilter3 -x5 -"

The same choice of running as a pipeline or running separately is available for wordmasks, too. Naturally, the two approaches can also be combined on the same command line. An advantage to using the pipeline approach is that all 3 filters in the example above may complete a little bit faster, because some I/O overhead is avoided. Furthermore, when used in a pipeline, there is no requirement that the output from myfilter1 and myfilter2 actually be in FASTA format. Those two programs could potentially pass information between themselves and to myfilter3 in a proprietary format. The only absolute format requirements are that the first filter in the pipeline must read FASTA data from stdin, and the last filter in the pipeline must output FASTA format. The final output from a filter or pipeline must also have the same length as the original input sequence.

It should be noted that with some filter programs, passing the query sequence sequentially through a pipeline of filters may yield a different result than processing the query independently with each filter and OR-ing the results. The script seg+xnu included in the filter/ directory provides an example with which to test this. Specifying filter=seg+xnu on the BLAST command line invokes a seg and xnu pipeline that is built-in to the search programs; whereas specifying filter="seg+xnu -" causes the seg+xnu script to be invoked on the query, which independently executes seg and xnu, then logically “ORs” the results with the bundled pmerge utility program. The built-in seg+xnu pipeline is historically the way these two filters have been invoked together, but the somewhat slower method employed by the seg+xnu script with pmerge may be more desirable.

The echofilter option can be used to display the filtered sequence near the beginning of search program output.

Precomputed Statistical Parameters

Nucleotide Scoring Systems

Precomputed values for λ, K and H are available for BLASTN searches with the following match,mismatch (M,N) scoring systems, using the sets of gap penalties {Q,R}:

Precomputed Nucleotide Scoring Systems
M	N	{Q,R}
+1	−3	{3,3} {3,2} {3,1} {7,2}
+1	−2	{2,2} {2,1} {1,1}
+3	−5	{10,5} {6,3} {5,5}
+4	−5	{10,5}
+1	−1	{3,1} {2,1}
+5	−4	{20,10} {10,10}
+5	−11	{22,22} {22,11} {12,2} {11,11}

Precomputed values are also available for a Purine-Pyrimidine scoring matrix named “pupy”:

`PuPy` Matrix
Q	R
20	10
10	10

Protein Scoring Systems

Precomputed values for λ, K and H are available for protein-level searches (BLASTP, BLASTX, TBLASTN and TBLASTX) with the following scoring matrix and gap penalty combinations (or gap penalty ranges for R) {Q, R}:

`BLOSUM50`
Q	R
16	1–4
15	1–4, 6, 8
14	1–5, 8
13	1–5, 8
12	2–5, 7
11	2–4, 6, 8
10	2–6, 8
9	3–5, 7
8	4–8
7	6, 7

`BLOSUM55`
Q	R
16	1–4
15	1–4, 6, 8
14	1–5, 7
13	2–5, 8
12	2–5, 8
11	2–6, 8
10	3–6, 9
9	3–5, 7
8	4–8
7	7

`BLOSUM62`
Q	R
12	1–3
11	1–3
10	1–4
9	1–5
8	2–7
7	2–6
6	3–5
5	5

`BLOSUM80`
Q	R
12	2–12
11	2–11
10	2–10
9	3–9
8	4–8
7	5–7

`PAM40`
Q	R
12	1, 2, 6
11	1, 2, 7
10	1–3, 7
9	1–3, 6
8	1–4
7	1–4
6	2–5
5	2–5
4	3, 4

`PAM120`
Q	R
12	1, 2, 4
11	1–3
10	1–3, 5
9	1–3, 5
8	1–4, 6
7	2–4, 6
6	2–5
5	3–5

`PAM250`
Q	R
16	1–4
15	1–5
14	1–6
13	1–6
12	2–7
11	2–7
10	3–8
9	3–7
8	5–7
7	7

Bugs

AB-BLAST is certainly not bug free, but historically bugs have been fixed typically within a day of their being reported. The currently known bugs are:

The scale of the included BLOSUM80 scoring matrix is 1/3 bit, rather than the 1/2 bit scale used otherwise for BLOSUM60 and above (BLOSUM60, 62, 70, 90, and 100). This anomaly—which goes all the way back to NCBI BLAST 1.3 in 1993—may be corrected (along with revised gapped lambda, K and H parameters) in a future release.

If you think you might be experiencing the effects of a bug, please contact us.

AB-BLAST exhibits a few different behaviors worth mentioning here, because they could trip up or confuse even the most knowledgeable of BLAST users. Any unexpected behavior might rightfully be construed as being a bug, so the following information is provided here in the Bugs section to help avoid the unexpected. If you should encounter problems or confusing areas other than those described below, or if you have questions or suggestions for improvement, please send them to us.

With the December 2018 release of AB-BLAST, a valid license must be installed to run some of the most important programs in the suite. See the Installation section for details.
AB-BLAST 3.0 establishes a new default scoring system for BLASTN. The new scoring system is match and mismatch scores M=1 N=−3 and gap penalties Q=7, R=2. The +1/−3 scoring system is more efficient at finding nearly identical sequences — the most frequent use for BLASTN — compared to the old +5/−4 scoring system. The +1/−3 scoring system is also more consistent with the BLASTN default word length of 11, which also selects for nearly identical sequences. The WU-BLASTN default scoring system (M=5 N=-4 Q=10 R=10) will be restored if the compat2.0 option is specified.
Much confusion had been caused over the years by the default WU-BLASTN scoring system, which dates back to the earliest incarnations of NCBI BLAST in 1989. The NCBI changed its default scoring system to +1/-3 upon introduction of blastall in 1997. For the sake of long-term compatibility and consistency, the scoring system had been left unchanged in WU-BLAST.
NOTE: The amino acid scoring system used by default in blastp blastx, tblastn and tblastx remains unchanged in AB-BLAST but differs slightly from the NCBI amino acid scoring system. A major difference in default behavior between AB-BLAST and NCBI-BLAST — whether to filter query sequences for low-complexity regions — remains unchanged in AB-BLAST. Namely, just as WU-BLAST did not filter query sequences by default, AB-BLAST does not filter query sequences by default either.
With its newer “BLAST+” package, the NCBI seems intent on confusing the marketplace and squelching the competing AB-BLAST effort, by abruptly using conflicting program names (blastp, blastn, blastx, tblastn and tblastx) that it had abandoned in 1997. These program names had been in use by AB-BLAST and WU-BLAST before it for 17+ years. The NCBI BLAST+ programs use an entirely different command line syntax than vintage 1994 NCBI/WU-BLAST (as well as vintage 1997 NCBI-BLAST). In contrast, through considerable effort, WU-BLAST and AB-BLAST have maintained a high degree of backward compatibility in command line usage and have used the original BLAST program names continuously since 1994. Advanced Biocomputing suggests the behavior of the U.S. Government agency is anti-competitive and amounts to abuse of its monopoly position. The conflict can be mitigated on an individual basis by renaming the search programs in either package. We encourage you to seek a better solution by writing to your U.S. Congressional Representative and Senators. The Congressional mandate which created the NCBI in 1988, did not endow the agency with the right to impede outside R&D or discourage business and it was not intended by Congress to do so. Thank you for your support.
Due to support added for the amino acid codes J and O, XDF protein sequence databases produced with the AB- version of xdformat are not readable by programs in the old WU-BLAST package. For an XDF protein sequence database named “foo” created by WU-xdformat, the AB-xdformat command:
```
     xdformat -p -i foo
```
will report the alphabet name as “NCBIstdaa(1)” (NCBIstdaa version 1). The larger amino acid alphabet normally used by the AB- version of xdformat is named “NCBIstdaa(2)” (NCBIstdaa version 2). WU-BLAST only uses the version 1 alphabet, whereas AB-BLAST creates new databases using version 2, can read and update existing databases in either alphabet, and can read combinations of the two alphabets in virtual databases.
N.B. If a protein database created by WU-xdformat is updated using AB-xdformat, the alphabet is silently updated to NCBIstdaa version 2, which will render the database subsequently unreadable by programs in the WU-BLAST package. This warning does not currently apply to nucleotide sequence databases, because no change has thus far been necessary in the nucleotide alphabet used by AB-BLAST.
The amino acid codes U (selenocysteine or Sec) and O (pyrrolysine or Pyl) are acceptable in query and database sequences, but the scoring matrices distributed with AB-BLAST do not specify scores for these letters. By default these letters are scored the same as alignment with an X (unknown residue) would be scored, except for their self-alignment scores (i.e., U with U and O with O) which are set to 0 by default. If more meaningful scores are known, alternative scores for these letters can be set explicitly in the amino acid scoring matrices.
The only accepted way to specify an alternative scoring matrix file is to refer to the file by name (e.g., matrix=BLOSUM55) and for the file to reside in the current working directory or for the path to the file to be listed in the BLASTMAT environment variable. If both a path and file name to a scoring matrix file are specified, such as in matrix=/usr/local/blast/matrix/aa/BLOSUM62 or matrix=aa/blosum62, the search programs will claim not to be able to find the file even though it may indeed exist and be readable. This is a security measure that may allow managers of network- or web-based search services to expose the command line to users without opening up access to potentially any file on the server, when the mere knowledge that a file exists might be considered a breach of security.
The gap penalty parameters Q and R of AB-BLAST have similar but important differences in interpretation from the parameters G and E of NCBI Gapped BLAST. While the two extension penalties R (AB-BLAST) and E (NCBI-BLAST) are analogous, Q (AB-BLAST) is analogous to the sum of G and E with NCBI-BLAST. In other words, where Q represents the total penalty for a gap of length 1, NCBI Gapped BLAST computes this penalty as G + E.
The default sort order for reporting database hits is by increasing E-value (most-to-least significant ordering), but for a given database hit, the alignments or HSPs with that sequence are sorted primarily by query strand, secondarily by the database (“subject”) strand, and only then by E-value. For example, if any alignments of a given database sequence are to the minus strand of the query, they will be reported after any alignments to the plus strand, even if alignments to the plus strand are less significant. In a TBLASTX search, in which both the query and subject are translated nucleotide sequences, for each strand of the query, hits to the plus strand of the subject will be reported before any hits to its minus strand. Consequently, identifying the HSP ascribed with the greatest statistical significance may require many lower-significance alignments to be parsed first. Naturally, this consideration is not an issue for BLASTP searches, where only one “strand” of query and subjects is searched.
On those rare computing platforms today that do not support “large” files (files >2 GB in size), users will be unable to search nucleotide sequence databases larger than about 8 billion nucleotides or 2 billion amino acids. Migrating to a contemporary 32-bit operating system — or to a 64-bit computing platform that provides “large file” support — is sufficient to break through the “2 GB barrier”.
The statistical significance of gapped alignment scores is computed using values for λ, K and H obtained from built-in, precomputed tables. (The values for λ, K and H used to assess the significance of ungapped alignment scores are still computed at run time, as is practical). These parameter values are determined by the scoring matrix and gap penalties being used. Precomputed values are necessarily not available for all scoring matrix and gap penalty combinations, though; and the precomputed values may not be well-suited to an unusual residue composition of the query or database sequences. In cases when precomputed values are unavailable, the programs issue a relevant WARNING message and proceed to evaluate gapped alignment scores using values for λ, K and H that are likely to be incorrect: the values computed at run-time for ungapped alignments. In such cases, the reported significance estimates may be highly inaccurate and will be biased towards being overly significant. If the user knows more accurate parameter values for their situation, however, the gapK, gapL and gapH command line options can be used to set them.
Selecting an alternative scoring matrix does not alter the gap penalties (Q and R) from their default values. Leaving gap penalties at their default values when choosing an alternative scoring matrix can not only result in alignments with undesirable gap characteristics but can create a situation in which the programs do not have precomputed values in their built-in tables for λ, K and H. Worst-case, the end result can be that the alignments represent horribly inaccurate mappings between the query and subject sequences and the P-values ascribed to the alignments are horribly inaccurate as well. (Actually, a worst-case scenario might be when the alignments and statistics are bad but not bad enough to be noticed by the user, who then proceeds to use the results—both false positives and false negatives—as though they were meaningful.) As described earlier, a WARNING message will be displayed when precomputed values are not available, but nevertheless the search will go on and the alignments and statistics may be anywhere from slightly to horribly misleading.
The hspsepqmax and hspsepsmax parameters are measures of distance in residues along the sequences in the specific form in which they are actually compared. For instance, in a BLASTX search (conceptually translated nucleotide query compared against a protein sequence database), hspsepqmax refers to a distance measured in amino acid residues, not the underlying nucleotides in the query.
ASN.1 formatted output is not available from AB-BLAST. XML and tab-delimited output formats are recommended instead. (See the mformat parameter.)

Supported Platforms

The computing platforms currently supported for AB-BLAST are listed below.

Linux kernel versions 2.6+ for 64-bit X64
Apple macOS 10.15+ (“Mavericks” and later) for 64-bit X64

The list of supported platforms is subject to change without notice.
Through multithreading, multiple processors and processor cores are supported by AB-BLAST on all of the above platforms.

Installation

Prior to software installation, make sure you have obtained a license file named license.xml. This file will normally have been sent as an email attachment from Advanced Biocomputing, LLC. An active license is required to run all of the AB-BLAST search programs plus several of the support programs (nrdb, patdb and all of the *2fasta programs). An active license will never be required to run the xdformat and xdget programs, to ensure you can recover your data from AB-BLAST databases even if your license has expired; and to allow creation of searchable databases for users who do have active AB-BLAST licenses.

At the confidential URL that was emailed to you, download the compressed tar.gz archive for your computing platform. The compressed tar archive will unpack into a subdirectory named ab-blast-YYYYMMDD-os-arch, where the date or version of the AB-BLAST package is indicated by YYYYMMDD. The operating system is indicated by os (e.g., linux or macos) and arch describes the hardware architecture (e.g., x64).

Mac users should download the ab-blast-YYYYMMDD-macos-x64.dmg file instead, then double-click the file to open and reveal an installer. After successfully running the installer, you should find the AB-BLAST programs installed beneath the /usr/local/ab-blast directory. You will likely want to add that directory to the PATH environment variable for your login shell.

Individual users of AB-BLAST must place their license.xml file in the directory ~/.config/ab-blast, where ~ (tilde) signifies the home directory. If the directory does not exist, it must first be created. For site licensees only, the license.xml file can be conveniently placed in the same directory as the AB-BLAST software, to enable access for all users of the computer system.

Note that the programs blastp, blastn, blastx, tblastn and tblastx are actually “hard links” to the same executable program (blasta) that encodes the integrated capabilities of all 5 search methods. If desired, these links can be renamed, as long as the original names appear as substrings within the new names. For instance, a link named ab-blastp will still invoke the search program in its blastp operational mode.

Similarly, the xdformat and xdget programs are hard links to the very same program that operates differently depending on the name by which it is invoked.

If you previously had AB-BLAST (or WU-BLAST) installed with BLAST-able databases, your installation or update of AB-BLAST is likely complete. If you did not have AB-BLAST or WU-BLAST already installed, read on...

Low-complexity sequence filters or masking programs — e.g., seg, xnu and dust — are included in AB-BLAST distributions. The bundled versions of these programs are precompiled and optimized. While these filter programs are not required for running the search programs, they can enormously reduce search times, the amount of garbage output produced, and the memory used by the programs.

NOTE: unlike NCBI-BLAST, AB-BLAST does not employ sequence complexity filtering by default. This behavior might change in the future, though. In case the search programs are updated to a version that does perform complexity filtering by default and you wish to guarantee an automated analysis pipeline will not perform this filtering, you can specify filter=none on the BLAST command line to maintain the behavior.

The databases themselves are not included with the AB-BLAST software. Once the source databases have been downloaded from any of many Internet sites, the database files are typically uncompressed and processed into FASTA format, if they are not in FASTA format already. Included in the tar archives are several utility programs for converting plain text database files into FASTA format:

gb2fasta converts the nucleotide sequences in GenBank flat files into FASTA format.
gt2fasta converts the CDS translations in GenBank flat files into FASTA format.
sp2fasta converts EMBL or UniProt/Swiss-Prot flat files into FASTA format.

The NCBI software Toolbox also contains some relevant parsers. One of these is asn2fsa, which converts both nucleotide and peptide sequences in GenBank ASN.1 format into FASTA format files. The asn2ff parser, which converts GenBank ASN.1 data into other flat file formats, may also come in handy, especially if you are inclined to parse GenBank into FASTA using your own routines or use the gb2fasta and gt2fasta programs mentioned above.

All of the above parsers can read from standard input (signified by a hyphen, “-”), so their input files can be maintained on disk in compressed format and streamed uncompressed directly into parsers with zcat, gunzip or other relevant decompression program. Because command line options themselves start with hyphens, if a hyphen is needed to specify standard input for the input file name, some of these programs require that a double-dash (--) be entered on the command line before the single-dash. This double-dash signifies the end of options and the start of the required filename arguments.

Once a source database is in FASTA format, the xdformat program should be used to convert it into “blastable” format. Concise usage instructions for xdformat (and xdget) can be obtained by invoking each program without any command line arguments. By default, xdformat produces 3 output files whose names are derived from the name of the FASTA input file. The 3 output files have distinct file name extensions and together comprise the blastable database. If sequence identifiers are optionally indexed during database creation, the blastable database will consist of a total of 4 output files. Databases formatted by xdformat contain full ambiguity code information within the blastable database files it produces.

By default, if any unrecognized amino acid or nucleotide codes are encountered or if the FASTA input file should otherwise appear corrupt, xdformat will emit an error message and halt. In such cases, if the blastable database was to be newly created, xdformat will remove the blastable database files it was creating before halting. If an existing blast database was being appended with new sequences when the error arose, the blastable database will be rolled back to its original state prior to the attempted update, with none of the new sequences appended.

While formatting a database, the xdformat program can optionally (-I option) index the sequence identifiers for later identifier-based retrieval with the xdget program. XDF databases that were formatted without an identifier index can have an index created post hoc by xdformat with its -X option. It may be of interest to note for the purposes of their maintenance that xdformat and xdget are actually one-and-the-same program file, merely invoked under the two different names to obtain the two different program behaviors. This helps ensure that the index created with xdformat will be compatible with xdget. See the file "FAQ-Indexing.html" for more details on identifier indexing.

For compatibility with legacy BLAST installations, the xdformat program can function in a setdb- and pressdb-compatibility mode, wherein its behavior is similar to that of setdb and pressdb. In its compatibility mode, a similar command line structure is used and the output files produced have the same names as those produced by setdb and pressdb. Compatibility mode is invoked when xdformat is renamed or has links pointing to it named setdb and pressdb. While the files produced in compatibility mode have the same file names as those produced by the original setdb and pressdb programs (setdb.real and pressdb.real), the content of these files is always XDF. Versions of the BLASTA search program dated on or after 1999-12-14 are able to work with the more-capable XDF databases.

Note that two XDF databases — one protein and one nucleotide — can be created with the exact same name and exist in the exact same directory, because the 3-letter file name extensions of XDF databases are completely distinct for protein and nucleotide sequence databases.

If xdformat and the legacy setdb and pressdb programs have all been used to create databases with the same name that reside in the same directory, the BLAST search programs will preferentially search the databases created with xdformat which will have the standard XDF database file name extensions. Note that two XDF databases of the same name — one protein and one nucleotide — can reside in the same directory, because the file name extensions of XDF databases are distinct for protein and nucleotide sequence databases.

Using the -t option to xdformat, a descriptive name or title can be assigned to a database that will appear in BLAST search output. The title of an existing database can be changed after its creation, by appending an empty FASTA database and specifying the -t option with the desired new title. For example,


     xdformat -n -a mydb -t "Fancy New Title" /dev/null

The blastable database files can be placed anywhere, but for convenience the BLASTDB environment variable should include their directory location. If the BLASTDB environment variable is not set, the programs look for databases by default in /usr/ncbi/blast/db and in the current working directory. If the old pressdb program (instead of xdformat) is used to create the blastable database, the associated nucleotide sequence FASTA file must be located in the same directory as the three output files from pressdb, if the BLAST search programs are to find the FASTA file. It may sometimes be useful to maintain the FASTA files in a separate directory — even on another disk partition — and provide UNIX soft links in the BLASTDB directory that point to the real location of the FASTA files. In addition, on systems where NCBI BLAST will not be in use, blastable databases can be maintained in multiple directories listed in the BLASTDB environment variable, with each directory name delimited from the next by a colon (:), just as directory names are often delimited in the PATH environment variable.

On multi-processor computer systems, the search programs will employ as many CPUs as are installed; when more than about 4 CPUs are used, this default behavior cause efficiency of hardware utilization to be quite low, compared to running individual single-threaded BLAST jobs on each CPU. Memory use also increases linearly with the number of CPUs or threads employed. One way to govern the number of processors employed is to wrap the search programs in a shell script that sets a lower number of CPUs via the cpus=# command line option. Another, simpler approach to changing the default number of CPUs for all users follows below, for implementation by BLAST system managers possessing “root” or “SuperUser” privileges.

Distributions of AB-BLAST include a sample file named sysblast.sample, that illustrates the system-wide configuration parameters that can be established to govern the execution of BLAST jobs and, thereby, provide a more productive, trouble-free level of service. When the sysblast file is installed under the name /etc/sysblast, all BLAST jobs executed on a given computer system can be made subject to the parameters:

cpusmax=<n>: a hard limit on the number of CPUs or threads employed by each BLAST job; it is possible to prohibit BLAST searches entirely on a given computer by configuring a negative value for cpusmax;
cpus=<n>: the default number of CPUs or threads employed per BLAST job;
nice=<n>: a “nice” value for altering the priority of BLAST processes; As is standard for UNIX operating systems:
- positive nice values correspond to lower priority
- only the root user can run at negative nice values (higher priority);
- any nice value set in /etc/sysblast is added to the current nice value of a BLAST process.
memmax=<n>: the maximum amount of memory that may be allocated by any single BLAST job. The interpretation and recommended usage of memmax are:
- memmax is expressed in units of bytes, with optional modifiers k (kilobytes), m (megabytes), and g (gigabytes).
- It is almost certainly a bad idea to set memmax to a value that is greater than the actual amount of memory (silicon RAM) installed in the computer;
- If memmax=0, the effective limit is “unlimited”, or the natural upper limit for a process executing under the given operating system;
- Values of memmax < 0 are ignored, in which case the standard UNIX datasize resource limit set by the user’s command shell governs BLAST memory usage instead;

The sysblast file is only effective when installed in the /etc directory. The /etc directory generally resides locally to any given computer system, so parameter settings can be tailored to each computer, even if the BLAST software is maintained on a shared disk partition. The /etc directory should only be writable by “root”. Unlike the shell script wrapper approach described above, the limits set in /etc/sysblast typically can not be circumvented by normal (non-root) users of a computer system. See the comments included in the sample sysblast file for further details.

Differences between AB-BLAST and WU-BLAST

Apart from bug fixes, the most outward differences in usage and appearance of AB-BLAST and WU-BLAST include:

The default scoring system for AB-BLASTN is match/mismatch scores M=+1 N=−3 with gap penalties Q=7 R=2; whereas WU-BLASTN uses M=+5 N=−4 with gap penalties Q=10 R=10 by default.
In all search modes, the default value for the gapped alignment drop-off score gapX is ≈50% higher for AB-BLAST, which will tend to make the AB-BLAST search programs slightly more sensitive and just slightly slower.
AB-BLAST supports an expanded amino acid alphabet, compared to the amino acid alphabet used by WU-BLAST. Programs in the WU-BLAST package are consequently unable to search or modify protein sequence databases that were created or modified by AB-xdformat. Once a protein sequence database created with WU-xdformat has been modified by AB-xdformat, it can no longer be searched or modified by any of the WU-BLAST programs. Databases created by WU-xdformat can be searched and modified by the AB-BLAST programs. At least for the time being, the AB-BLAST search programs can also search virtual databases that are a combination of databases created with WU-xdformat and AB-xdformat.
No difference currently exists between the nucleotide alphabets used by AB-BLAST and WU-BLAST or the ability of programs in either package to search/modify nucleotide sequence databases created/modified by programs in the other package.
The bundled BLOSUM30 and BLOSUM35 scoring matrices have been re-scaled to provide better precision.
The bundled amino acid scoring matrices — and the matrices output by the pam program — now contain a J row and a J column. These matrices are incompatible with WU-BLAST, which does not support the letter J and will report a FATAL error when reading the files. The AB-BLAST amino acid scoring matrices are slightly different from the matrices distributed by the NCBI, which also indicate scores for the letter J, but at the time of this writing the NCBI matrices are cross-compatible with AB-BLAST.
AB-BLAST supports the amino acid letter code O (“oh”) normally used to represent Pyrrolysine (Pyl), whereas WU-BLAST does not. The letter O may appear in query sequences, database sequences, scoring matrix files and with command line parameters such as the altscore option, but the scoring matrices bundled with AB-BLAST do not actually utilize this letter. The default score for aligning any other letter with O is the same score as for aligning with X, whereas the O self-alignment score defaults to zero (0).
The AB-BLAST 3.0 search programs support a new compat2.0 option to obtain roughly equivalent parameter settings to those used by WU-BLAST 2.0.
The analog to wu-blastall is named ab-blastall.
The analog to wu-formatdb is named ab-formatdb.
AB-BLAST programs preferentially use settings of the new environment variables ABBLASTMAT, ABBLASTDB and ABBLASTFILTER. See the section on Environment Variables for important details. When upgrading from WU-BLAST to AB-BLAST, due to the support for the letter J in AB-BLAST, it is important to ensure that the AB-BLAST search programs use the bundled scoring matrices rather than the old matrices that were distributed with WU-BLAST, because of the latter matrices’ lack of support for the letter J.
The maximum allowable value for the dbslice parameter has been increased.
The sp2fasta program parses input more reliably.
The first few lines of output, including the program declaration line and copyright notice, are different. See Citing BLAST for examples of the program declaration line from AB-BLAST.
Programs in the AB-BLAST package that use the UNIX standard getopt() function to parse the command line will now uniformly across all computing platforms produce “POSIXLY_CORRECT” behavior. (N.B. The BLAST search programs do not use getopt(), but most other programs in the package, including xdformat and xdget, do). This means some command lines that are acceptable to WU-BLAST on some computing platforms (usually Linux) may be rejected by their AB-BLAST counterpart and need to be restructured. This mostly can happen if all options are not specified before (to the left of) required arguments.
Better thread management under energy conservation conditions.
Better memory management under macOS.
AB-BLAST no longer looks for matrix files or complexity filter programs beneath /usr/ncbi/blast. This behavior was a relic of WU-BLAST's early lineage.

Citing BLAST

Citations or acknowledgments of AB-BLAST usage are greatly appreciated, as are any personal accounts of how the software is being used that you might wish to share. When URLs are acceptable, please cite with:

   Gish, W. (1996-2019) https://blast.advbiocomp.com

When URLs are not acceptable, please use:

   Gish, W. (unpublished).

In scientific communications, it is important to report both the program name and the specific version used. In the case of AB-BLAST 3.0, the version is a combination of the version number, release date, target platform, and build date. The release date is the first (left-most) date displayed on the first line of output and corresponds to the freeze date of the source code. The build date is the second date reported and corresponds to the date and time the executables were built for the indicated target platform. Both dates are reported in program output in ISO 8601 format.

For example, consider this introductory line of output from AB-BLAST 3.0:

  BLASTN 3.0 [2018-12-16] [macos-x64 2018-12-19T22:14:33]

Here the program name is BLASTN, the software version is “3.0”, the release date is December 16, 2018, and the build date and time of the 64-bit macOS X64 binary is December 19, 2018, at 10:14PM.

Historical Notes

The original description of the (1-hit) BLAST algorithm was published by Altschul et al. (1990). In addition to the algorithm itself, BLASTP and BLASTN functionality are described, without referring to the programs by name. BLASTX-like functionality is briefly mentioned as being in progress (again not by name), but TBLASTN was actually the third BLAST search mode implemented. Statistical significance of the ungapped alignments found by the programs was assessed using “Karlin-Altschul” statistics — sometimes also referred to as “Karlin-Dembo-Altschul” statistics, due to a major contribution of Amir Dembo.
In December 1989, prior to the development of the World Wide Web, the NCBI Experimental BLAST Network Service was opened to the public. The BLAST network service provided fast, convenient client-server access from anywhere on the Internet to the very latest versions of the recently parallelized BLAST search programs running on powerful 8–16 processor Silicon Graphics servers at the NCBI. The BLAST servers searched against a comprehensive set of public sequence databases that were updated daily. Users could access the BLAST servers transparently using a UN*X command line client that was invoked just like the BLAST application programs themselves, or via a graphical client named HyperBLAST (J.M. Cherry, 1990, unpublished) created with HyperCard. At about this time, the “nr” (quasi-non-redundant) protein and nucleotide sequence databases were also established (W. Gish, unpublished). The nr database — protein and nucleotide — quickly became the standard database searched with BLAST, and users could often do so in a matter of just a few seconds. The experimental BLAST service was ultimately discontinued a decade later, in March 2000. Experience gained from providing a service that could arbitrate many simultaneous and diverse requests for BLAST helped Gish design a more flexible and robust network service architecture known as the NCBI “Dispatcher”, which was then largely implemented by others at the NCBI (principally Jonathan Epstein) and went into operation ca. 1995. At the request of NCBI management, the experimental BLAST service was never published and remains W. Gish (unpublished). Awareness of the service nevertheless spread quickly by word-of-mouth, as was the case for the later WU-BLAST.
The BLASTX program first appeared in the release of BLAST version 1.1 in July 1990. The program was later described and evaluated by Gish and States (1993). The BLAST3 program ( Altschul and Lipman, 1990) was also folded into the BLAST 1.1 release and parallelized. The use of Poisson statistics to evaluate the joint probability of multiple HSPs from a given (query,subject) sequence pair, as had been suggested by Karlin and Altschul (1990), was also first featured in BLAST 1.1.
The BLASTC program, a specialized version of BLASTX that considered codon usage information in addition to sequence similarity (States and Gish, 1994), appeared only once, in the BLAST 1.3 distribution. The BLAST 1.3 distribution was also the last to include the BLAST3 program.
BLAST 1.4 (W. Gish, 1994, unpublished) was the first version to use Karlin and Altschul (1993) “Sum” statistics to evaluate the joint probability of finding multiple HSPs between a given pair of sequences. Sum statistics were found to be more practical in a biological context than the Poisson statistics utilized by default in BLAST 1.3.
The TBLASTX program first appeared in BLAST 1.4 and remains attributable to W. Gish (1994, unpublished).
All five of the supported BLAST programs in BLAST 1.4 (BLASTP, BLASTN, TBLASTN, BLASTX and TBLASTX) were for the first time coded using a standard API (application programming interface) to a generalized BLAST function library. This function library made maintenance and improvements to the five core programs easier and aided the development of more specialized BLAST applications, such as Entrez sequence neighboring tools and specialized EST analysis tools.
The first release of WU-BLAST was numbered 1.4, which was virtually identical to the public domain NCBI BLAST 1.4, save for a few bug fixes. The WU-BLAST Archives (original URL http://blast.wustl.edu) first appeared on the Internet in 1995, to provide continuity of support for the work Warren Gish began at the NCBI, as well as to provide a central resource where the community could find BLAST-related software, information and earlier versions.
In late 1994, at the invitation of Warren Gish, who had recently moved to Washington University in St. Louis, Stephen Altschul and he engaged in a collaboration to test several of Gish’s hypotheses:
- Sum statistics (Karlin and Altschul, 1993) allowed the evaluation of multiple ungapped alignment scores, using the analytically computed ungapped parameters λ_u, K_u and H_u. Extreme Value statistics — analogous to the statistics for ungapped alignment scores published by Karlin and Altschul (1990) — had been shown empirically to be good estimators of the statistical significance of individual gapped alignments from Smith-Waterman comparisons, using empirical estimates for the gapped parameters λ_g and K_g (Collins and Coulson, 1990; Mott, 1992; Waterman and Vingron, 1994). It stood to reason that Sum statistics might be empirically extended to evaluating multiple gapped alignment scores, using empirically estimated parameters λ_g, K_g and H_g;
- While good estimates for λ_g, K_g and H_g could be computed through lengthy (computationally expensive) Monte Carlo simulations for a specific scoring system and particular pair of sequences, fixed estimates for these parameters precomputed for sequences of “average” composition would work well enough as to be of practical use in comprehensive database searches; and
- For an improved search algorithm, multiple, locally optimal gapped alignments between two sequences could be approximated by a two-stage BLAST implementation that would: remain fast, yet be far more sensitive than ungapped BLAST; produce more-easily interpreted alignments; and yield alignment scores suitable for evaluation with the expanded role proposed for Sum statistics.
If the effort panned out as hoped, the new gapped BLAST method would in some cases be more sensitive and selective than even the standard Smith-Waterman algorithm, due to the newer method’s ability to find multiple gapped alignments between a pair of sequences and to evaluate their significance jointly with Sum statistics.
While Altschul set to work empirically testing Sum statistics on gapped alignment scores, Gish focused on the alignment problem. Early results from their work appeared in Altschul and Gish (1996) and provided much of the foundation for WU-BLAST 2.0 and later NCBI blastall.
The first complete implementation of gapped BLAST (BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX) with statistical significance estimates (both Poisson and Sum statistics) was publicly released as WU-BLAST version 2.0d1 (W. Gish, unpublished), in time for presentation at the Cold Spring Harbor conference on Genome Mapping and Sequencing in May 1996.
The NCBI published its BLAST version 2 or “Gapped BLAST”, including a description of a new 2-hit ungapped BLAST algorithm and the PSI-BLAST program, in Altschul et al. (1997), in September 1997. All search modes, except BLASTN, used the new 2-hit algorithm by default. Within days of their publication, a faster, more sensitive 2-hit algorithm was deployed in WU-BLAST 2.0.
The NCBI published a description of PHI-BLAST in Zhang et al. 1998.
In late 2008, rights to WU-BLAST were acquired from Washington University in St. Louis by the author, Warren R. Gish. The right to license the software to the community were acquired by Advanced Biocomputing, LLC in 2009.

References

Altschul, SF, and W Gish (1996). Local alignment statistics. ed. R. Doolittle. Methods Enzymol. 266:460–80.

Altschul, SF, and DJ Lipman (1990). Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. USA 87:5509–13.

Altschul, SF, Gish, W, Miller, W, Myers, EW, and DJ Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403–10.

Altschul, SF, Madden, TL, Schäffer, AA, Zhang, J, Zhang, Z, Miller, W, and DJ Lipman (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25(17):3389–402.

Claverie, JM, and DJ States (1993). Information enhancement methods for large scale sequence analysis. Computers in Chemistry 17:191–201.

Collins, JF, and AF Coulson (1990). Significance of protein sequence similarities. Methods Enzymol. 183:474–7.

Dembo, A, and S Karlin (1991). Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Ann. Probab. 19:1737–55.

Dembo, A, and S Karlin (1992). Limit distributions of maximal segmental score among Markov dependent partial sums. Adv. Appl. Probab. 24:113–40.

Gish, W, and DJ States (1993). Identification of protein coding regions by database similarity search. Nat. Genet. 3:266–72.

Hancock, JM, and JS Armstrong (1994). SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci. 10:67–70.

Karlin, S, and SF Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264–8.

Karlin, S, and SF Altschul (1993). Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90:5873–7.

Karlin, S, Dembo, A, and T Kawabata (1990). Statistical composition of high scoring segments from molecular sequences. Ann. Stat. 18:571–81.

RF Mott (1992). Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. Math. Biol. 54:59–75.

Smith, TF, and MS Waterman (1981). Identification of common molecular subsequences. J. Mol. Biol. 147:195–7.

States, DJ, and W Gish (1994). Combined use of sequence similarity and codon bias for coding region identification. J. Comp. Biol. 1:39–50.

Waterman, MS, and M Vingron (1994). Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. USA 91:4625–8.

Wootton, JC, and S Federhen (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17:149–63.

Wootton, JC, and S Federhen (1996). Analysis of compositionally biased regions in sequence databases. ed. R. Doolittle. Methods Enzymol. 266:554–71.

Zhang, Z, Schäffer, AA, Miller, W, Madden, TL, Lipman, DJ, Koonin, EV, and SF Altschul (1998). Protein sequence similarity searches using patterns as seeds. Nucl. Acids Res. 26:3986–90.

Last updated: 2022-12-01

Return to the AB-BLAST Archives home page