AB-BLAST 3.0 is a powerful
software package for gene and protein identification,
using sensitive, selective and rapid similarity searches of
protein and nucleotide sequence databases.
The feature list for AB-BLAST is long and continues to expand,
while performance is improved.
Much of this is outlined
below.
A complete suite of BLAST search programs
(blastp, blastn, blastx, tblastn and tblastx)
is provided in the package,
along with several database management and support programs that
include
nrdb,
patdb,
xdformat,
xdget,
seg,
dust
and
xnu.
AB-BLAST has been built to be the most trusted database search tool
in your software toolbox,
doing what you tell it, reporting precisely what it’s doing
— even telling you what it could not do because
of specific parameter restrictions you might wish to change —
and able to handle even your biggest jobs with aplomb.
Users of other BLAST implementations have suffered every few years through a series
of expensive and time-consuming rewrites, alpha releases, beta releases, format changes,
specialized spin-off programs, and bewildering arrays of new program parameters,
options, behaviors and interactions.
Meanwhile AB-BLAST was built from scratch
to offer consistently superior performance and flexibility —
combined with painstaking effort by the developer
to ensure near-absolute backward compatibility
with every AB-BLAST release for over 17 years.
AB-BLAST represents the most rigorous,
sensitive implementation of BLAST available,
yet it typically runs faster than the rest.
AB-BLAST has a simple, easy-to-use command line structure;
offers consistent behavior across all search modes;
runs on general purpose computer hardware;
can uniquely categorize and filter results based on biological criteria;
and much more.
All of these features help AB-BLAST users
be more productive and save money.
AB-BLAST is not a re-hashed version of NCBI-BLAST.
AB-BLAST shares virtually no code with NCBI-BLAST
except for some portions that both packages copied
from the public domain, ungapped BLAST version 1.4 released in 1994
(W. Gish, unpublished).
A brief history of AB-BLAST development
is available here.
Some of the key features of AB-BLAST are described below.
AB-BLAST is the premier gapped BLAST with statistics.
AB-BLAST is derived from the original gapped BLAST with statistics
(W. Gish, 1996, unpublished).
Gapped alignment routines are available and used by default
in all AB-BLAST search modes
(BLASTP, BLASTN, TBLASTN, BLASTX and TBLASTX),
with purely ungapped alignments available as an option.
Faster and More Sensitive.
AB-BLAST is up to twice as fast as NCBI BLAST in all search modes, while being more sensitive.
AB-BLAST is not a re-hashed version of NCBI BLAST
but uses distinctly different, more-sensitive search algorithms
that have been painstakingly implemented to be faster, as well.
Due to algorithmic differences (not to mention differences in statistics),
no set of parameters can guarantee the
same results will be produced by the two packages,
but with comparable parameter settings,
the classical 1-hit BLAST used by default in AB-BLAST
is faster and more sensitive
than the 1-hit BLAST available as an option in NCBI BLAST.
At comparable sensitivity levels,
the AB-BLAST 1-hit implementation is nearly as fast
and uses much less memory than the 2-hit algorithm described and implemented
by Altschul et al. (1997).
For users who desire maximum speed,
an AB-exclusive 2-hit algorithm (W. Gish, unpublished)
is available in all search modes (including BLASTN)
that is faster,
more sensitive and more memory-efficient than the NCBI 2-hit algorithm.
(See the
hitdist
option.)
Formatting databases is faster
with AB-BLAST—up to 4x faster—which
allows your computers to start searching the databases that must sooner.
Multiple Local Alignments with Joint Evaluation.
Virtually all database search programs will find sequence similarities
(locally optimal alignments or approximations thereof)
that are by themselves statistically insignificant
and thus are not reported,
but AB-BLAST can identify alignments
none of which are statistically significant on their own
but which are statistically significant as a group.
Alignments are clustered into “consistent” sets
under a variety of user-configurable constraints,
including maximum allowable separation distance and maximum allowable overlap.
This combination of sensitivity and selectivity
— original to AB-BLAST and used by default in all AB-BLAST search modes —
increases the biological relevance of the results.
This feature is essential for finding:
all exons in a multi-exon gene sequence, not just high-scoring exons;
all complete or partial copies of a repetitive element
in a genomic sequence, not just high-scoring ones;
and
multiple, discrete domains of similarity between sequences,
not just the highest-scoring domain.
More Sensitive/Selective than Smith-Waterman.
The combination of well-chosen heuristics and statistics in AB-BLAST
can be more sensitive and selective than
the full dynamic programming approach of the classical
Smith-Waterman (1981) algorithm,
which reports only the single highest scoring alignment between two sequences,
as well as other approaches or BLAST implementations that may
identify multiple regions of local similarity but then only evaluate the alignments
in isolation for their statistical significance.
Full Smith-Waterman Option.
With the
postsw
option,
a full Smith-Waterman alignment is performed between pairs of query-subject
sequences that are already scheduled to be reported by BLASTP.
The Smith-Waterman alignments are combined with
the heuristic BLAST results, any redundancy between them is removed, and the statistics are recomputed.
In addition to providing alignments guaranteed to be optimal,
this post-processing can significantly improve the P-values and relative ranking of database hits,
often while increasing the execution time only marginally.
Choice of Statistical Methods.
AB-BLAST uses
“Sum” statistics
(Karlin and Altschul, 1993)
by default in all search modes,
with Poisson statistics available as an option
(poissonp).
Sum statistics and Poisson statistics involve
joint probability calculations on sets of one or more alignments.
To evaluate the significance of individual alignments (or alignment scores),
simple
Karlin-Altschul (1990)
statistics
are also available with the
kap
option.
BLASTN Flexibility.
Unique to AB-BLAST are these features of the BLASTN search mode:
Nucleotide scoring matrices.
AB-BLASTN supports fully-specified scoring matrices,
not just simple match/mismatch scoring systems.
This allows transitions to be scored differently than transversions;
and positive G-A substitution scores for the design of siRNAs (small interfering RNAs)
where G-U base pairing is allowed.
Scoring matrices can also be tailored to improve the design of PCR primers
or applied to areas of research where a simple match/mismatch scoring system
can not adequately discriminate.
Contrary to
W. Miller (2001),
scoring matrices were first supported by the NCBI ungapped BLASTN version 1.4
(Gish, W., 1994, unpublished;
see http://blast.advbiocomp.com/blast-1.4).
Support for nucleotide scoring matrices was indeed dropped
by the NCBI when its blastall program was released in 1997,
but this feature was maintained continuously in all WU versions of BLASTN
since the migration to Washington University in St. Louis in 1994,
continuously through the introduction of the original gapped BLAST (WU-BLAST) in 1996,
and on through to today with AB-BLASTN.
Flexible Word Lengths.
AB-BLASTN supports BLAST word lengths as short as 1
(re: the
W
parameter).
Nucleotide Neighborhood Words.
Nucleotide neighborhood words are supported by AB-BLASTN
using the standard neighborhood word score threshold parameter,
T.
Using neighborhood words, nucleotide sequence similarity can be detected
even in the absence of any identical residues between two sequences.
Users are cautioned, however, that careless use of the
T
parameter can result in crushing amounts
of memory being requested by BLASTN.
For this reason,
T
should likely be used only in conjunction with very short word lengths.
Consistently Accurate Statistics with BLASTN.
Since the release of the first gapped BLAST with statistics in 1996
(W. Gish, unpublished),
the statistical significance of gapped alignment scores
in all search modes — including BLASTN —
has been evaluated using appropriately pre-computed “gapped” values
for the statistical parameters λ, K and H,
rather than the often very different values for these parameters that are computed
at run-time for evaluating ungapped alignment scores.
If precomputed values are not available for the specific
combination of scoring system and gap penalties requested by the user,
a prominent warning has always been issued.
In contrast, NCBI-BLASTN only relatively recently began
using precomputed gapped values for λ, K and H
and warning users when appropriate values are unavailable.
Virtual Gene Structures.
Linkage information describing “consistent” groups or chains of local alignments (HSPs)
are provided by AB-BLAST
when the
topcomboN
or
links
options are used.
This facility can help with construction of overall gene structures
from what might otherwise be a barrage of local alignments
scattered throughout a 2-dimensional search space.
The
hspsepQmax,
hspsepSmax,
olfraction,
olmax,
golfraction
and
golmax
parameters can also help ensure the reported structures are more biologically relevant.
Ease of Database Management.
AB-BLAST supports the eXtended Database Format (XDF),
a power user’s dream
for working with peptide and nucleotide sequences.
Both the NCBI-BLAST 2.0 database format and the NCBI implementation
of the BLAST search algorithm were originally restricted to sequences under 16 Mbp
in length,
whereas human genome contigs exceeded 25 Mbp in the last millennium
(Hattori et al., 2000)
and extended to hundreds of megabytes many years ago.
In contrast, XDF databases, which were introduced in 1999,
have the facility to accurately store individual sequences
of up to 1 Gbp (1 billion bp) in length with ambiguity codes intact.
Other BLAST software, such as the NCBI’s,
limits database files to 2 gigabytes each,
whereas from its inception XDF database files
could be of virtually unlimited size —
provided of course that the host operating system and file system
support such “large files” (as most modern operating systems and file systems do).
To support XDF databases,
the database formatting tool named xdformat
is provided with AB-BLAST.
Among other distinct capabilities
and advantages to using XDF and xdformat are:
fast appends of new sequences to existing databases
(both protein and nucleotide).
There is no need to reformat
an XDF database just to add one or more sequences to it.
xdformat runs 2-4 times faster than the NCBI formatdb program,
while offering a superset of features and greater reliability;
the full complement of NCBI standard “FASTA” sequence identifiers is indexed
by xdformat,
including standard identifiers the NCBI formatdb program does not index;
See this
FAQ
for further details;
duplicated sequence identifiers (that should be unique) in public databases
distributed by the NCBI
are reported by xdformat that the NCBI formatdb does not catch;
safe roll-backs of database updates when file I/O
(e.g., disk-full errors) or parse errors are encountered;
huge databases need not be broken into multiple volumes
individually composed of several files but can be managed simply
with as few as 3 files (4 files when sequence identifier index is included),
regardless of the database size up to 1 TB;
flexible indexing of all sequence identifiers
— not just a subset of the NCBI identifiers —
including user-defined identifiers;
index support for duplicate occurrences of the same identifier,
even the identical “gi” identifiers that cause some indexing programs
to abort;
identifier indexing is supported not only when creating an XDF database
but when appending new sequences to an existing XDF database;
and if a database was originally created without an identifier index,
an index can be added later in one, relatively fast step;
identifier indexes can be quickly re-built
if necessary using different indexing policies,
without having to reformat the entire database;
intelligent retrieval of indexed sequences
uses a complementary program named xdget.
Xdget can retrieve sequences by identifier
even if the program is not told what name space
(e.g., gi, accession, locus, user-defined, etc.)
the identifier came from.
For more on identifier indexing, see
this;
xdformat and xdget accept and work intelligently
with identifiers that obey the International DDBJ/EBI/NCBI collaboration’s
Accession.Version identifier syntax (e.g.,
the programs know that BAA84643.2 is a newer version of BAA84643,
but will retrieve BAA84643.1 if specifically requested);
And the parsing programs that come with AB-BLAST
for converting GenBank and EMBL database flat files into “FASTA” format
not only report gi identifiers but Accessions with Versions;
greatly reduced memory requirements and BLAST search initiation times
for databases containing large numbers of entries,
which is particularly important when memory is in short supply
or when multiple processors are standing by
waiting for a single-threaded initialization phase to be completed;
the ability to dump (or recover) the contents of an XDF database back into FASTA format
with the original annotation and ambiguity codes intact;
both the X and N ambiguity codes are supported in nucleotide sequences,
thus permitting the use of distinct substitution
scores for these letters and the use of PHRED/PHRAP sequence output
“as is” for input
to xdformat.
Compared to the classical BLAST 1.4 database format,
XDF provides
the ability to use FASTA/Pearson format
input files with unjustified (i.e., ragged or blank) input lines.
With nucleotide sequence databases,
there is also no longer the need to retain the original FASTA input file
in order to access the ambiguity codes during a database search.
Support for XDF by the BLAST search programs
does not come at the expense of backward compatibility.
AB-BLAST can search databases in either XDF or the classical BLAST 1.4
database formats.
Furthermore, by simply installing new versions of setdb and pressdb,
the migration to using XDF can be performed swiftly and transparently,
without making any changes whatsoever to existing database maintenance scripts.
While providing this drop-in upgrade path to XDF,
support for legacy databases in the BLAST 1.4 database format
is retained transparently, as well:
the AB-BLAST search programs automatically
identify the database format being used
and adjust their operations accordingly.
This allows users to migrate incrementally to XDF,
at their own pace and as they see fit,
without losing the ability to study or reproduce results
obtained with older databases.
Even so, users are encouraged to make the migration to XDF,
as there are definite benefits to the new format,
including an improved nucleotide sequence data representation
and the ability to index sequences by their identifiers.
When searching very large databases,
virtual memory requirements are dramatically reduced in AB-BLAST,
eliminating program failures that occurred
when system resource limits were unexpectedly reached.
Virtual databases are supported by AB-BLAST.
Virtual databases can be
specified on the command line as a white space-delimited list of
component database names.
Virtual databases can be comprised of
components in either XDF or classical BLAST 1.4 format,
as long as the formats are not mixed on the same command line.
For example, this command might be used to search the pri, rod, mam,
vrt, and htg divisions of GenBank:
blastn "pri rod mam vrt htg" myquery.nt
Virtually no file size limits exist for databases and other files,
provided the host operating system supports large files.
Operating systems such as Linux (kernel version 2.2 and earlier)
for 32-bit Intel computing platforms
are often incapable of using files larger than 2 GB,
although virtual database support (see above) helps avoid this limitation
by allowing large databases to be segmented into files of a manageable size.
Linux users in need of large file support should use at least a version 2.4.*
kernel or ideally a 2.6.* kernel.
AB-BLAST supports segmented query sequences,
such as the contigs that result from shotgun sequencing assembly
or perhaps multiple short probes for a given gene.
For example,
all of the contigs from a given clone can be concatenated together with a single hyphen (-) character
to delimit each contig.
Segment boundaries are therefore clearly distinguishable from purely ambiguous regions
of the sequence, while consuming little storage.
AB-BLAST honors segment boundaries by guaranteeing that no alignment,
be it ungapped or gapped, will cross a boundary.
Support for segmented database sequences is in progress.
Multi-sequence query files are supported,
such that every sequence in the FASTA file is searched against the specified database.
Previous versions of
the software only compared the first sequence in the query file against the database.
Each search result is separated from the next by a single ASCII form feed
character (control-L). See the new qrecmin and qrecmax options.
The format of all dates reported in BLAST output can be controlled by the
UN*X standard CFTIME environment variable.
For example, dates will be reported in
ISO 8601
standard format,
if CFTIME is set to '%Y-%m-%dT%H:%M:%S'.
Date strings produced by the
xdformat program are also governed by CFTIME.
Note that the format of many date and time strings reported for
XDF databases is determined by the setting (if any)
of CFTIMEwhen the database was created or last modified.
Both sequence filtering and word masking of query sequences
are supported.
The terms “filter” and ”mask” are sometimes used alone
and interchangeably,
however there are two distinct techniques people can use
which deserve separate names.
Lower case alphabetic letters in the query sequence can be used to
inform the BLAST search program as to which residues it should either filter
(convert to X or N)
or mask
(skip when generating neighborhood words but otherwise leave the sequence intact).
See the
lcfilter
and
lcmask
options, respectively.
Multiple filter=<filter> directives can be specified
on the BLAST command line.
Each of the filters is executed independently
and their results are OR-ed at the end.
NCBI-BLAST 2.0 uses either built-in (closed) complexity filters
or the original external filtering technique
of BLAST 1.3
(Gish, W., unpublished),
which uses the UNIX popen() system call and temporary/intermediate files.
AB-BLAST provides open access to its complexity filters by using
filter programs that are distinct from the search programs,
while simultaneously avoiding the use of problematic
system call interfaces and temporary files.
One or more word masks can be specified on the command line, using
the wordmask=<mask> option,
where <mask> may be a classical
filter program such as seg, xnu, or dust.
Whereas sequence filters
convert certain letters in the query sequence into ambiguity codes (X
for amino acid and N for nucleotide), word masks do not alter the
sequence. Word masks instead cause the indicated portion(s) of the query
sequence to be skipped during BLAST neighborhood word generation.
This leaves the query sequence intact for generating alignments that are
seeded by word hits arising in flanking, unmasked regions of the
sequence.
The BLAST algorithm word length parameter,
W, can be set from 1 to 1024 in all search modes
(BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX).
This wide-ranging flexibility and the resultant speed are due in part
to the highly optimized DFA (deterministic finite-state automaton) used
by AB-BLAST.
AB-BLAST reliably supports parallel processing on a variety of SMP
(symmetric multiprocessing) computing platforms.
AB-BLAST was the first BLAST to
support Mac OS X,
the first to thread properly (i.e., robustly and efficiently) under Mac OS X,
and the first to support 64-bit computing
under Mac OS X on PowerPC G5 and Intel X64 processors.
POSIX threads are used under
Linux and Mac OS X.
Although POSIX threads are available under Solaris,
Solaris native threads are used instead for slightly better performance.
To illustrate just some of the flexibility available with AB-BLAST,
the software bundle includes a
PERL script named ab-blastall
(formerly wu-blastall) that translates an NCBI blastall command line
into a roughly equivalent AB-BLAST
command line and then invokes the appropriate AB-BLAST search mode.
The output remains in AB-BLAST format,
but the ab-blastall script may help
users of the NCBI blastall program migrate
to AB-BLAST and start to discover its power.
NOTE:
the ab-blastall script is not at all intended to provide
a literal replacement for the NCBI blastall program
and is not an appropriate method
for assessing the relative performance
(sensitivity, specificity, accuracy or speed)
of the NCBI and AB software.
By coercing AB-BLAST
to substitute for CrossMatch
(Phil Green, unpublished),
the unique combination of flexibility
and speed of AB-BLAST can yield
up to a 30-fold increase in performance of
RepeatMasker in its slow mode,
while maintaining virtually identical sensitivity.
Many command line options are available.
The complete list of options and parameters along with their descriptions
is provided in the parameters.html file that comes bundled
with the software.
The latest version of this file is maintained on-line at
http://blast.advbiocomp.com/doc/parameters.html.
For convenience,
the AB-BLAST package includes an optimized version
of the classic sequence redundancy removal program
nrdb.
A newer program named patdb is also included
that can find identical sequences
and perfect substrings of others in the input
nearly as fast as nrdb finds identical sequences alone.
A reverse chronological list of changes to the AB-BLAST software
is available in the file named HISTORY that comes bundled
with the software.
When possible,
any bugs that have been found have typically been fixed within 24 hours
of their being reported.
Please send
us
bug reports, questions, or suggestions.
Licensing
Full information about licensing of AB-BLAST is provided
here.
Manifest
The AB-BLAST 3.0 package
includes the following data analysis and utility programs:
blasta — the unified database search program, which provides
blastp, blastn, blastx, tblastn, and tblastx
search functionality.
xdformat — the recommended program for rapidly converting sequences from
FASTA format into the native XDF format read by blasta. The program
can also append new sequences to an existing database;
automatically rollback on errors;
provides flexible indexing and verification services;
and can dump data back into FASTA format.
xdget — a flexible tool for retrieving sequences
(or segments thereof)
from an indexed XDF database;
retrieved sequences are
optionally reverse-complemented and translated
in the case of nucleotide sequences.
xdformat and xdget are actually one and the same program
to help ensure their mutual compatibility during upgrades.
nrdb — a tool for rapidly removing trivial redundancy
(i.e., duplicate sequences)
from one or more input files in FASTA format.
The nrdb program is often many times faster and uses much less memory
than “competing” solutions.
patdb — a tool much like nrdb for rapidly removing trivial redundancy
from one or more input files in FASTA format,
but with the important option
of identifying sequences that are perfect substrings of others.
The program attains high speed by using a Patricia tree
that is often combined with finite state automata.
The substring removal option can be usefully applied to protein sequences
which may differ in their inclusion of the initiator methionine;
or in mapping short read sequences onto a genome.
On protein sequence data,
with or without its substring option,
patdb
runs at about the same speed as nrdb and requires about the
same amount of memory.
When operating on nucleotide sequences,
nrdb may be more practical,
because nrdb uses data compression techniques
that are unavailable in patdb;
nrdb just can not identify perfect substrings.
ab-blastall — a PERL script for converting an NCBI blastall command line
into a roughly equivalent blasta command line
and then invoking blasta.
The output is still in AB-BLAST format.
This is primarily intended as a technology demonstration tool
but may also assist users in their migration
from NCBI BLAST to the more accurate AB-BLAST.
For benchmarking of BLAST,
careful tweaking of parameters may be required, but even with great care,
benchmarking for speed can still be confounded by inaccuracies in NCBI BLAST.
ab-formatdb — a PERL script
for converting an NCBI formatdb command line
into the equivalent xdformat command line and then invoking xdformat.
This is primarily intended as a technology demonstration tool but may also
assist users in their migration from NCBI BLAST to AB-BLAST.
pam — a program to compute amino acid substitution scoring matrices
having arbitrary scales, using the Dayhoff PAM model.
pressdb.real — the legacy pressdb program for any rare users
who may be reliant on the NCBI-BLAST 1.4 database format for nucleotide sequences.
setdb.real — the legacy setdb program for any rare users
who may be reliant on the NCBI-BLAST 1.4 database format
for amino acid sequences.
gb2fasta — a parser to extract nucleotide sequences from GenBank flat files
into FASTA format.
gt2fasta — a parser to extract amino acid sequences from CDS features
in GenBank flat files and output them in FASTA format.
sp2fasta — a parser to extract protein or nucleotide sequences from
EMBL, TrEMBL, or Swiss-Prot database files and output them in FASTA format.
pir2fasta — a parser to extract protein sequences from NBRF PIR database
files and output them in FASTA format.
dust — a low-complexity filter for nucleotide sequences
(Hancock and Armstrong, 1994;
Tatusov and Lipman, unpublished).
xnu — a low-complexity filter for protein sequences
(Claverie and States, 1993).
The program identifies short-periodicity repeats.
sysblast.sample — a sample configuration file that system
administrators may wish to modify and install as /etc/sysblast.
Parameter settings in this file can be used to:
limit the number of threads employed by each BLAST process;
change the default number of threads employed per process;
alter the “nice” value for BLAST processes;
limit the amount of memory utilized by each BLAST process.
AB-BLAST Command Line Options and Parameters
A complete list of command line options and parameters
for modifying the behavior of the AB-BLAST search programs
is available
here.
Comparable AB/NCBI BLAST Parameters
A brief comparison of the some of the most important
parameters for controlling sensitivity, selectivity and speed
of AB-BLAST and NCBI BLAST
is available
here.
Environment Variables
AB-BLAST can utilize the settings
of a few environment variables
to adapt its behavior to different computing environments:
BLASTDB, BLASTFILTER and BLASTMAT.
To allow for triple AB/WU/NCBI BLAST installations,
AB-BLAST also supports the environment variables
ABBLASTDB, ABBLASTFILTER and ABBLASTMAT,
as well as
WUBLASTDB, WUBLASTFILTER and WUBLASTMAT.
Settings of the AB versions of these variables take precedence over all others,
and WU variable settings take precedence
over the corresponding base name variables.
In AB-BLAST, the BLASTDB (or ABBLASTDB) environment variable
can be a list of one or more directory names in which the programs
are to look for database files.
In UNIX parlance, such an environment variable might be called a path
for the database files.
Directory names should be delimited from one another by a colon
(“:”) and listed in the order that they should be searched.
If the BLASTDB environment variable is not set, the programs use a default
path of .:/usr/ncbi/blast/db, such that the programs first look in the
current working directory (“.”) for the requested database
and then look in the /usr/ncbi/blast/db directory.
For backward compatibility with
programs that expect BLASTDB to be a single directory specification and
not a path, if the user has set a value for BLASTDB but omitted the current
working directory,
AB-BLAST will still look for database files
in the current working directory as a last resort.
This usage is unchanged from NCBI/WU BLAST version 1.4 (1994),
except multiple directories could be specified with the BLASTDB
variable beginning with WU-BLAST 2.0 ca. 1997.
The BLASTFILTER (or ABBLASTFILTER) environment variable
can be set to the directory containing the sequence filter programs,
such as
seg and
xnu.
The default directory for the filter programs is /usr/ncbi/blast/filter.
This usage is unchanged from NCBI/WU BLAST version 1.4.
The BLASTMAT (or ABBLASTMAT)
environment variable can be set to the parent
directory for all scoring matrix files.
The default directory for these files is /usr/ncbi/blast/matrix,
beneath which are expected nt and aa subdirectories
for storing scoring matrix
files for nucleotide and amino acid alphabets, respectively.
This usage is unchanged from NCBI/WU-BLAST version 1.4.
For more information about environment variables, see the
Installation instructions.
Filters and Masks
AB-BLAST provides a highly flexible means
of applying both “hard” and “soft” masks to a query sequence,
of supporting alternative, user-defined filter programs
and non-standard parameters to the standard filters.
The filter (for hard masking) and
wordmask (for soft masking)
command line options provide the basic interface.
Multiple specifications of each type are acceptable
on the BLAST command line.
Furthermore, individual filter and wordmask specifications may
consist of entire pipelines of commands.
For example, three filters are used in succession by this pipeline:
filter="myfilter1 | myfilter2 | myfilter3 -x5 -"
The first two filters in this case expect to read their input from UN*X
standard input (also known as stdin),
whereas myfilter3 apparently needs to be told
to read data from stdin,
using the usual “-” or
hyphen argument.
The standard output (stdout)
from myfilter1 will be read via stdin
by myfilter2,
which in turn processes the query
before handing its results to myfilter3;
finally, myfilter3
reports its results to stdout,
which the BLAST program itself reads to obtain the fully masked sequence.
The final output from the filter pipeline is expected by the BLAST
program to be in FASTA format.
Instead of running all 3 filters in the above example as part of one
pipeline, they could instead be specified as three separate filter options
like this:
The same choice of running as a pipeline or running separately is available
for wordmasks, too.
Naturally, the two approaches can also be combined on the same command line.
An advantage to using the pipeline approach is that all 3 filters
in the example above may complete a little bit faster,
because much of the I/O overhead is avoided.
Furthermore,
when used in the pipeline,
there is no requirement that the output from myfilter1
and myfilter2 actually be in FASTA format.
Those two programs could potentially pass any information between
themselves and to myfilter3.
The only absolute requirements are that the first filter in the pipeline,
myfilter1, must read FASTA data
from stdin, and the last filter in the pipeline, myfilter3,
must output FASTA data (that is also of the same length
as the query!) to stdout.
It should be noted that with some filter programs,
passing the query sequence sequentially through
a pipeline of filters may yield
a different result than processing the query independently with each filter
and OR-ing the results.
The script seg+xnu included in the filter/ directory provides
an example with which to test this.
Specifying filter=seg+xnu on the BLAST command line
invokes a seg and xnu pipeline that is built-in to the search programs;
whereas specifying filter="seg+xnu -"
causes the seg+xnu script to be invoked on the query, which independently
executes seg and xnu,
then logically “ORs” the results with the pmerge
utility program.
(The echofilter option can be used to see the results of filtering displayed
in search program output).
The built-in seg+xnu pipeline is historically the way these two filters have
been invoked together,
but the somewhat slower method employed
by the seg+xnu script with pmerge may be more desirable.
Precomputed Statistical Parameters
Nucleotide Scoring Systems
Precomputed values for λ, K and H
are available for BLASTN searches
with the following match,mismatch
(M,N)
scoring systems,
using the sets of gap penalties
{Q,R}:
Precomputed Nucleotide Scoring Systems
M
N
{Q,R}
+1
−3
{3,3} {3,2} {3,1} {7,2}
+1
−2
{2,2} {2,1} {1,1}
+3
−5
{10,5} {6,3} {5,5}
+4
−5
{10,5}
+1
−1
{3,1} {2,1}
+5
−4
{20,10} {10,10}
+5
−11
{22,22} {22,11} {12,2} {11,11}
Precomputed values are also available for a Purine-Pyrimidine scoring matrix
named “pupy”:
PuPy Matrix
Q
R
20
10
10
10
Protein Scoring Systems
Precomputed values for λ, K and H
are available for protein-level searches
(BLASTP, BLASTX, TBLASTN and TBLASTX)
with the following scoring matrix and
gap penalty combinations (or gap penalty ranges for R) {Q, R}:
BLOSUM50
Q
R
16
1–4
15
1–4, 6, 8
14
1–5, 8
13
1–5, 8
12
2–5, 7
11
2–4, 6, 8
10
2–6, 8
9
3–5, 7
8
4–8
7
6, 7
BLOSUM55
Q
R
16
1–4
15
1–4, 6, 8
14
1–5, 7
13
2–5, 8
12
2–5, 8
11
2–6, 8
10
3–6, 9
9
3–5, 7
8
4–8
7
7
BLOSUM62
Q
R
12
1–3
11
1–3
10
1–4
9
1–5
8
2–7
7
2–6
6
3–5
5
5
BLOSUM80
Q
R
12
2–12
11
2–11
10
2–10
9
3–9
8
4–8
7
5–7
PAM40
Q
R
12
1, 2, 6
11
1, 2, 7
10
1–3, 7
9
1–3, 6
8
1–4
7
1–4
6
2–5
5
2–5
4
3, 4
PAM120
Q
R
12
1, 2, 4
11
1–3
10
1–3, 5
9
1–3, 5
8
1–4, 6
7
2–4, 6
6
2–5
5
3–5
PAM250
Q
R
16
1–4
15
1–5
14
1–6
13
1–6
12
2–7
11
2–7
10
3–8
9
3–7
8
5–7
7
7
Bugs
AB-BLAST is certainly not bug free, but historically
bugs have been fixed typically within 24 hours of their being reported.
The currently known bugs are:
The scale of the included BLOSUM80 scoring matrix is 1/3 bit, rather than
the 1/2 bit scale used otherwise for BLOSUM60 and above (BLOSUM60, 62, 70, 90, and 100).
This anomaly—which goes all the way back to NCBI BLAST 1.3 in 1993—may
be corrected (along with revised gapped lambda, K and H parameters) in a future release.
If you think you might be experiencing the effects of a bug,
please contact
us.
AB-BLAST exhibits a few different behaviors worth mentioning here,
because they could trip up or confuse even the most knowledgeable of BLAST users.
Any unexpected behavior might rightfully be construed as being a bug,
so the following information is provided here in the Bugs section to help avoid the unexpected.
If you should encounter problems or confusing areas
other than those described below,
or if you have questions or suggestions for improvement,
please send them to
us.
With its recently released “BLAST+” package,
the NCBI seems intent on confusing the marketplace and squelching
the innovative AB-BLAST effort,
by suddenly using program names (blastp, blastn, blastx, tblastn and tblastx)
that it abandoned over 14 years ago, in 1997.
These program names are identical to—and therefore conflict with—the program names used
by AB-BLAST and WU-BLAST before it for the past 17+ years.
The NCBI BLAST+ programs utilize an entirely different command line syntax
than vintage 1994 NCBI/WU-BLAST (as well as vintage 1997 NCBI-BLAST).
In contrast, through considerable independent effort,
WU-BLAST and AB-BLAST have maintained a high degree of backward
compatibility in command line usage and have used the original BLAST program names
continuously since 1994.
Advanced Biocomputing suggests the behavior of the U.S. Government agency is
anti-competitive and amounts to abuse of its monopoly position.
The conflict can be mitigated on an individual basis
by renaming the search programs in the NCBI BLAST+ package.
We encourage you to seek a better solution by writing to your
U.S. Congressional
Representative
and
Senators.
The Congressional mandate which created the NCBI in 1988, did not endow the agency
with the right to impede outside R&D or discourage business and it was not intended by Congress to do so.
Thank you for your support.
For dramatically increased speed—usually
accompanied by only a marginal loss of sensitivity—
a two-hit BLAST algorithm is now used by default in all protein-level searches
(i.e., in all search modes except BLASTN).
The classical one-hit BLAST algorithm can be invoked instead by specifying
hitdist=0 or by specifying any of the compat*
options such as
compat2.0compat2.0
for WU-BLAST 2.0 compatibility.
For more information, see the description of the
hitdist
parameter.
Due to support added for the amino acid codes J and O,
XDF protein sequence databases produced with the AB- version of xdformat
are not readable by programs in the old WU-BLAST package.
For an XDF protein sequence database named “foo”
created by WU-xdformat,
the AB-xdformat command:
xdformat -p -i foo
will report the alphabet name as
“NCBIstdaa(1)” (NCBIstdaa version 1).
The larger amino acid alphabet normally used
by the AB- version of xdformat
is named “NCBIstdaa(2)” (NCBIstdaa version 2).
WU-BLAST only uses the version 1 alphabet,
whereas AB-BLAST creates and updates databases using version 2,
can read databases in either alphabet,
and can read combinations of the two alphabets in virtual databases.
N.B.: If a protein database created by WU-xdformat
is updated using AB-xdformat,
the alphabet is silently updated to NCBIstdaa version 2, which will render
the database unreadable by programs in the WU-BLAST package.
This warning does not currently apply to nucleotide sequence databases,
because no change has thus far been necessary in the nucleotide alphabet
used by AB-BLAST.
AB-BLAST 3.0 establishes a new default scoring system for BLASTN.
Much confusion
— and at times much consternation —
has been caused by the default WU-BLASTN scoring system
being quite different from the default nucleotide scoring system
used by the NCBI blastall program.
For the sake of long-term compatibility and consistency,
the WU-BLASTN default +5/−4 (match/mismatch) scoring system,
which dates back to the earliest incarnations of BLASTN at the NCBI
in 1989,
had been left unchanged.
The NCBI changed its default nucleotide scoring system to +1/−3 upon introduction
of blastall in 1997.
The +1/−3 scoring system is more efficient at finding
nearly identical sequences than the old +5/−4 scoring system,
and finding identical (or nearly identical) sequences is the most frequent
use for BLASTN.
The newer +1/−3 scoring system is also more consistent
with the BLASTN default word length of 11,
which also selects for nearly identical sequences.
With the transition of WU-BLAST to AB-BLAST,
the decision was made to eliminate the inconsistency
between the default scoring system, the default word length,
and the most frequent use for the program,
by making the default nucleotide scoring system used by AB-BLASTN
be the same as that used by NCBI blastall
(i.e., match and mismatch scores
M=1,
N=−3;
and gap penalties
Q=7,
R=2).
NOTE: The default amino acid scoring system used in all other search modes remains unchanged
in AB-BLAST and is somewhat different from the NCBI amino acid scoring system.
Another major difference in default behavior
between WU- and NCBI-BLAST
—
whether to filter query sequences for low-complexity regions
—
also remains unchanged in AB-BLAST.
Namely,
just as WU-BLAST does not filter query sequences by default,
neither does AB-BLAST filter sequences by default.
The amino acid codes
U (selenocysteine or Sec)
and
O (pyrrolysine or Pyl)
are acceptable in query and database sequences,
but the scoring matrices distributed with AB-BLAST do not specify scores
for these letters.
By default these letters are scored the same as alignment with an X (unknown residue)
would be scored,
except for their self-alignment scores
(i.e.,U with U and O with O)
which are set to 0 by default.
If more meaningful scores are known, alternative scores for these letters
can be set explicitly in the amino acid scoring matrices.
The only accepted way to specify an alternative scoring matrix file
is to refer to the file by name
(e.g.,matrix=BLOSUM55)
and for the file to reside in the current working directory
or for the path to the file to be
listed in the BLASTMAT environment variable.
If both a path and file name to a scoring matrix file are specified,
such as in matrix=/usr/local/blast/matrix/aa/BLOSUM62
or matrix=aa/blosum62,
the search programs will claim not to be able to find the file even though
it may indeed exist and be readable.
This is a security measure that may allow managers
of network- or web-based search services to expose
the command line to users without opening up access to potentially any file
on the server,
when the mere knowledge that a file exists might be considered a breach of security.
The gap penalty parameters
Q
and
R
of AB-BLAST
have similar but important differences in interpretation
from the parameters G and E of NCBI Gapped BLAST.
While the two extension penalties
R
(AB-BLAST) and
E
(NCBI-BLAST) are analogous,
Q
(AB-BLAST) is analogous to the sum of G and E
with NCBI-BLAST.
In other words, where
Q
represents the total penalty for a gap of length 1,
NCBI Gapped BLAST computes this penalty as G + E.
The default sort order for reporting database hits
is by increasing E-value (most-to-least significant ordering),
but for a given database hit,
the alignments or HSPs with that sequence are sorted
primarily by query strand, secondarily by the database (“subject”) strand,
and only then by E-value.
For example,
if any alignments of a given database sequence are to the minus strand of the query,
they will be reported after any alignments to the plus strand,
even if alignments to the plus strand are less significant.
In a TBLASTX search,
in which both the query and subject are translated nucleotide sequences,
for each strand of the query,
hits to the plus strand of the subject will be reported
before any hits to its minus strand.
Consequently, identifying the HSP ascribed with the greatest statistical significance
may require many lower-significance alignments to be parsed first.
Naturally, this consideration is not an issue for BLASTP searches,
where only one “strand” of query and subjects is searched.
On those rare computing platforms today that do not support “large” files
(files >2 GB in size),
users will be unable to search nucleotide sequence databases
larger than about 8 billion nucleotides or 2 billion amino acids.
Migrating to a contemporary 32-bit operating system
—
or to a 64-bit computing platform that provides “large file” support
—
is sufficient to break through the “2 GB barrier”.
The statistical significance of gapped alignment scores is computed using
values for λ,K and H
obtained from built-in,
precomputed tables.
(The values for λ,K and H used to assess the
significance of ungapped alignment scores are still computed at run time,
as is practical).
These parameter values are
determined by the scoring matrix and gap penalties being used.
Precomputed values are necessarily not available for
all scoring matrix and gap penalty combinations, though;
and the precomputed values may not be well-suited
to an unusual residue composition of the query or database sequences.
In cases when precomputed values are unavailable,
the programs issue a relevant WARNING message and proceed to evaluate gapped alignment scores
using values for λ,K and H
that are likely to be incorrect:
the values computed at run-time for ungapped alignments.
In such cases, the reported significance estimates may be highly inaccurate
and will be biased towards being overly significant.
If the user knows more accurate parameter values for their situation,
however,
the
gapK,
gapL
and
gapH
command line options
can be used to set them.
Selecting an alternative scoring matrix does not alter
the gap penalties
(Q
and
R)
from their default values.
Leaving gap penalties at their default values when choosing an alternative
scoring matrix
can not only result in alignments with undesirable gap characteristics but
can create a situation in which
the programs do not have precomputed values in their built-in tables
for λ,K and H.
Worst-case, the end result can be that the alignments represent horribly inaccurate
mappings between the query and subject sequences and the P-values ascribed
to the alignments are horribly inaccurate as well.
(Actually, a worst-case scenario might be when the alignments and statistics
are bad but not bad enough to be noticed by the user, who then proceeds to use
the results—both false positives and false negatives—as though they were meaningful.)
As described earlier, a WARNING message will be displayed when precomputed
values are not available,
but nevertheless the search will go on
and the alignments and statistics may be anywhere from
slightly to horribly misleading.
The
hspsepqmax
and
hspsepsmax
parameters are measures of distance
in residues along the sequences in the specific form in which they are
actually compared.
For instance, in a BLASTX search (conceptually translated nucleotide
query compared against a protein sequence database),
hspsepqmax
refers to a distance measured in amino acid residues, not the underlying
nucleotides in the query.
ASN.1
formatted output is not available from AB-BLAST.
XML and tab-delimited output formats are recommended instead.
(See the
mformat
parameter.)
Supported Platforms for Standard & Enterprise Editions
The computing platforms currently supported for
AB-BLAST Standard Edition and Enterprise Edition are listed below.
(The list of platforms supported for AB-BLAST Personal Edition is much
shorter).
Software for computing platforms other than those listed here
may be available upon request, but additional charges may apply.
Linux kernel versions 2.4 and 2.6 for 32-bit i786 (Pentium4) and 64-bit X64
Apple Mac OS X 10.4+ on 32-bit X86 and X64
Apple Mac OS X 10.4+ on 32-bit PowerPC G4 and 64-bit PowerPC G5
Sun Solaris 10 on 64-bit X64
The list of supported platforms is subject to change without notice.
Multiple processors
(multithreading or parallel processing)
are effectively and efficiently
supported by AB-BLAST on all of the above platforms.
AB-BLAST also supports large files
(files greater than 2 GB in size),
when the host operating system and file system support large files.
Installation
To install AB-BLAST,
the first step is to download the UN*X tar archive of executable files appropriate
for your computing platform
from the Advanced Biocomputing, LLC website.
To locate the software,
licensed users will have received a confidential URL via e-mail.
Please note that scoring matrix files and documentation,
which are not generally platform-specific,
are nevertheless included in each package.
No databases are included, however.
Unpack the tar archive in a new, empty directory.
For convenience, precompiled and optimized versions
of the low-complexity sequence filters
(e.g.,seg,
xnu,
and
dust)
are included (see the filter/ subdirectory that gets created),
along with two sequence redundancy removal programs
nrdb
and patdb.
Users of Mac OS X 10.6 and 10.7 (“Snow Leopard” and “Lion”) Only
To ensure proper, complete unpacking of tar archives on normal, case-insensitive HFS+ file systems,
use the Terminal app to execute the command:
gnutar zxf archive.tar.gz
where archive.tar.gz is substituted with the name of the AB-BLAST archive you downloaded.
The use of gnutar is needed to avoid a bug in the version of tar currently distributed with Snow Leopard
(at least up to version 10.6.1) that involves the treatment of hard-linked files.
If your web browser uncompressed the archive after downloading,
the file will lack the .gz extension, in which case the “z” should be
omitted from the gnutar command.
The executable programs from the tar archive may be moved as desired
into any directory listed in the PATH environment variable,
whether this means adding the newly created directory to the PATH
or moving the executables
into an existing directory already listed in the PATH.
(Lots of information about interrogating and setting environment variables
—
and about the PATH environment variable itself
—
can be found in Google and other
search engines using the query “path environment variable”).
If the software is installed in a directory that was already listed
in the PATH,
it may be necessary to exit the currently open shell and open a new one
in order for the shell to recognize the existence
of the newly installed programs.
Note that the files
blastp, blastn, blastx, tblastn and tblastx
are actually “hard links” to the same executable program,
blasta,
that encodes the integrated capabilities of all 5 search methods.
If desired, the links can be renamed, as long as the original names appear as substrings
within the new names.
Alphabetic case is unimportant.
For instance, a link named ab-blastp will still invoke blasta
in its blastp operational mode.
A Note to Mac OS X Users
AB-BLAST software is intended to be invoked via a CLI (command line interface).
Programs will need to be invoked either using the Terminal application
(located in the /Applications/Utilities folder)
or from within a script or other application provided by a third party.
The programs bundled with AB-BLAST are not themselves intended to be double-clicked to execute.
A Note About File Permissions and File Copying
The AB-BLAST package is copyrighted and only available under license.
To help ensure users of the software
do not unintentionally copy or distribute it,
all copies of binary files are recommended to be maintained
with execute-only permissions.
As delivered in the software archives from Advanced Biocomputing, LLC,
execute-only permissions have already been set,
but if the binary files should be copied by you,
these permissions may become altered and thus allow other users
to then copy the software in an unauthorized manner.
Restoration of execute-only permissions to an executable program file
can be accomplished by running the command:
chmod 0111 filename
where filename is the name of the executable file.
If you already had AB-BLAST (or WU-BLAST) installed (with BLAST-able databases),
your installation or update of AB-BLAST is essentially complete.
If you did not have AB-BLAST or WU-BLAST installed,
read on...
Unpacking the tar archive creates a matrix/ subdirectory containing scoring
matrix files.
Wherever this directory ultimately resides,
the BLASTMAT (or ABBLASTMAT) environment variable should be set to point there.
In the absence of this environment variable being set,
AB-BLAST programs first look for scoring matrix files
in any matrix/ subdirectory
of the directory in which the search programs reside
and then in the /usr/ncbi/blast/matrix directory.
Low-complexity sequence filters or masking programs —
e.g.,
seg,
xnu
and
dust —
are now included in the tar archives described here.
The bundled versions of these programs are precompiled and optimized.
While these filter programs are not required for running the search programs,
they can enormously reduce the amount of garbage output produced,
memory used, and search time taken.
Hence, it is highly recommended that these programs be made available to users.
Whatever directory you install the filter programs in,
the BLASTFILTER (or ABBLASTFILTER) environment variable should be set to point there.
In the absence of this environment variable being set,
the programs look for masking programs in any filter/
subdirectory of the directory in which the search programs themselves reside
and then in /usr/ncbi/blast/filter.
NOTE: unlike NCBI BLAST,
the AB-BLAST search programs do not employ sequence filtering by default.
This behavior might change in the future, though.
In case the search programs are updated on your system without warning
and you wish to guarantee in an automated analysis pipeline
that no filtering will ever be performed,
just specify filter=none on the command line.
The databases themselves are obviously not included with the software.
Once the source databases have been downloaded from any of many Internet sites,
the database files are typically uncompressed and processed into FASTA format,
if they are not in FASTA format already.
Included in the tar archives are several utility programs for converting
textual database files:
gb2fasta converts the nucleotide sequences in
GenBank flat files
into FASTA format.
gt2fasta converts the CDS translations in
GenBank flat files
into FASTA format.
sp2fasta converts
EMBL
or
Swiss-Prot
flat files into FASTA format.
The NCBI software
Toolbox also contains some relevant parsers.
One of these is
asn2fsa, which converts both nucleotide and peptide sequences in
GenBank ASN.1 format
into FASTA format files.
The asn2ff parser,
which converts GenBank ASN.1 data into other flat file formats,
may also come in handy, especially if you are inclined to parse GenBank
into FASTA using your own routines
or to using the gb2fasta and gt2fasta programs mentioned above.
All of the above parsers can read from standard input (sometimes signified by a single dash, “-”),
so their input files can be maintained on disk in compressed format and dynamically zcat-ed or gunzip-ed
directly into the parsers, thus saving the time and storage required for the uncompressed data.
Because a dash is often used to signify the start of each command line option,
if a dash is needed to specify standard input for the required input file name argument,
some of these programs require that a double-dash (--)
be specified on the command line before the single-dash.
This double-dash signifies the end of the command line options and the
start of the required arguments.
Once a source database is in FASTA format,
the xdformat program
should be used to convert it into “blastable” format.
Concise usage instructions for xdformat (and xdget)
can be obtained by invoking each program without any command line arguments.
By default,
xdformat produces 3 output files whose names are derived from the name
of the FASTA input file.
The 3 output files have distinct file name extensions and
together comprise the blastable database.
If sequence identifiers are optionally indexed during database creation,
the blastable database will consist of a total of 4 output files.
Databases formatted by xdformat
contain full ambiguity code information within the blastable database files it produces.
By default, if any unrecognized amino acid or nucleotide codes are encountered
or if the FASTA input file should otherwise appear corrupt,
xdformat will emit an error message and halt.
In such cases, if the blastable database was to be newly created,
xdformat will remove the blastable database files before halting.
If an existing blast database was being appended with new sequences when the error arose,
the blastable database will be rolled back to its original state prior to the update attempt
with none of the new sequences appended.
While formatting the database,
the xdformat program can optionally (-I option)
index the sequence identifiers
for later identifier-based retrieval with the xdget program.
XDF databases that were formatted without an identifier index
can have an index created post hoc by xdformat
with its -X option.
It may be of interest to note
for the purposes of their maintenance
that xdformat and xdget are actually one-and-the-same
program file,
merely invoked under the two different names to obtain the two
different program behaviors.
This helps ensure that the index created with xdformat
will be compatible with xdget.
See the file "FAQ-Indexing.html" for more details on identifier indexing.
For compatibility with legacy BLAST installations,
the xdformat program can function
in a setdb- and pressdb-compatibility mode,
wherein its behavior is similar to that of setdb and pressdb.
In its compatibility mode, a similar command line structure is used
and the output files produced have the same names as those produced by setdb and pressdb.
Compatibility mode is invoked
when xdformat is renamed or has links pointing
to it named setdb and pressdb.
While the files produced in compatibility mode have the same file names as those
produced by the original setdb and pressdb programs
(setdb.real and pressdb.real),
the content of these files is always XDF.
Versions of the BLASTA search program dated on or after 1999-12-14
are able to work with the more-capable XDF databases.
Note that two XDF databases —
one protein and one nucleotide —
can be created with the exact same name and exist in the exact same directory,
because the 3-letter extensions of XDF database file names are distinct
for protein sequence databases and nucleotide sequence databases.
If xdformat and the legacy setdb and pressdb programs
have all been used to create databases with the same name that reside in the same directory,
the BLAST search programs will preferentially search the databases
created with xdformat which will have the standard XDF database
file name extensions.
Note that two XDF databases —
one protein and one nucleotide —
can be created with the same name and exist in the same directory,
because the file name extensions of XDF database files are distinct
for protein sequence databases and nucleotide sequence databases.
Using the -t option to xdformat,
a descriptive name or title can be assigned to a database that will appear in BLAST search output.
The title of an existing database can be changed after its creation,
by appending an empty FASTA database and specifying the -t option with the desired new title.
For example,
xdformat -n -a mydb -t "Fancy New Title" /dev/null
The blastable database files can be placed anywhere,
but for convenience the BLASTDB environment variable
should include their directory location.
If the BLASTDB environment variable is not set,
the programs look for databases by default
in /usr/ncbi/blast/db and in the current working
directory.
If the old pressdb program (instead of xdformat)
is used to create the blastable database,
the associated nucleotide sequence FASTA file must be located
in the same directory as the three output files from pressdb,
if the BLAST search programs are to find the FASTA file.
It may sometimes be useful to maintain the FASTA files
in a separate directory — even on another disk partition — and
provide UNIX soft links in the BLASTDB directory that point to the
real location of the FASTA files.
In addition, on systems where NCBI BLAST will not be in use,
blastable databases can be maintained in multiple directories listed
in the BLASTDB environment variable,
with each directory name delimited from the next by a colon (:),
just as directory names are often delimited
in the PATH environment variable.
On multi-processor computer systems,
the search programs will employ
as many CPUs as are installed;
when more than about 4 CPUs are used,
this default behavior cause efficiency of hardware utilization
to be quite low,
compared to running individual single-threaded BLAST jobs
on each CPU.
Memory use also increases linearly with the number of CPUs or threads
employed.
One way to govern the number of processors employed is
to wrap the search programs in a shell script that sets a lower
number of CPUs via the cpus=# command line option.
Another, simpler approach to changing the default number of CPUs
for all users follows below,
for implementation by BLAST system managers possessing
“root” or “SuperUser” privileges.
Distributions of AB-BLAST
include a sample file named sysblast.sample,
that illustrates the system-wide configuration parameters
that can be established to govern the execution of BLAST jobs
and, thereby, provide a more productive, trouble-free level of service.
When the sysblast file is installed under the name /etc/sysblast,
all BLAST jobs executed on a given computer system can be made
subject to the parameters:
cpusmax=<n>: a hard limit on the number of CPUs or threads employed
by each BLAST job;
it is possible to prohibit BLAST searches entirely
on a given computer by configuring a negative value for cpusmax;
cpus=<n>: the default number of CPUs or threads employed per BLAST job;
nice=<n>: a “nice” value for altering the priority of BLAST processes;
As is standard for UNIX operating systems:
positive nice values correspond to lower priority
only the root user can run at negative nice values (higher priority);
any nice value set in /etc/sysblast is added to the current nice value of a BLAST process.
memmax=<n>: the maximum amount of memory that may be
allocated by any single BLAST job.
The interpretation and recommended usage of memmax are:
memmax is expressed in units of bytes,
with optional modifiers
k (kilobytes), m (megabytes), and g (gigabytes).
It is almost certainly a bad idea to set memmax to a value that is
greater than the actual amount of memory (silicon RAM) installed
in the computer;
If memmax=0, the effective limit is “unlimited”, or the natural
upper limit for a process executing under the given operating system;
Values of memmax < 0 are ignored, in which case
the standard UNIX datasize resource limit set by the
user’s command shell governs BLAST memory usage instead;
The sysblast file is only effective when installed in the /etc directory.
The /etc directory generally resides locally to any given computer system,
so parameter settings can be tailored to each computer,
even if the BLAST software is maintained on a shared disk partition.
The /etc directory should only be writable by “root”.
Unlike the shell script wrapper approach described above,
the limits set in /etc/sysblast typically can not be circumvented
by normal (non-root) users of a computer system.
See the comments included in the sample sysblast file for further details.
Differences between AB-BLAST and WU-BLAST
Apart from bug fixes, the most outward differences in usage and appearance
of AB-BLAST and WU-BLAST include:
The default scoring system for AB-BLASTN is
match/mismatch scores M=+1 N=−3 with gap penalties Q=7 R=2;
whereas WU-BLASTN uses
M=+5 N=−4 with gap penalties Q=10 R=10 by default.
In all search modes, the default value for the gapped alignment drop-off score
gapX
is ≈50% higher for AB-BLAST,
which will tend to make the AB-BLAST search programs
slightly more sensitive and just slightly slower.
AB-BLAST
supports an expanded amino acid alphabet,
compared to the amino acid alphabet used by WU-BLAST.
Programs in the WU-BLAST package are consequently unable to search or modify
protein sequence databases that were created or modified by AB-xdformat.
Once a protein sequence database created with WU-xdformat
has been modified by AB-xdformat,
it can no longer be searched or modified by any of the WU-BLAST programs.
Databases created by WU-xdformat can be searched
and modified by the AB-BLAST programs.
At least for the time being,
the AB-BLAST search programs can also search virtual databases
that are a combination of databases created
with WU-xdformat and AB-xdformat.
No difference currently exists between the nucleotide alphabets used
by AB-BLAST and WU-BLAST
or the ability of programs in either package
to search/modify nucleotide sequence databases created/modified
by programs in the other package.
The bundled BLOSUM30 and BLOSUM35 scoring matrices
have been re-scaled to provide better precision.
The bundled amino acid scoring matrices
— and the matrices output by the pam program —
now contain a J row and a J column.
These matrices are incompatible with WU-BLAST,
which does not support the letter J and will report a FATAL error
when reading the files.
The AB-BLAST amino acid scoring matrices are slightly different
from the matrices distributed by the NCBI,
which also indicate scores for the letter J,
but at the time of this writing the NCBI matrices are cross-compatible with AB-BLAST.
AB-BLAST supports the amino acid letter code O (“oh”)
normally used to represent Pyrrolysine (Pyl),
whereas WU-BLAST does not.
The letter O may appear in query sequences, database sequences, scoring matrix files
and with command line parameters such as the
altscore
option,
but the scoring matrices bundled with AB-BLAST do not actually utilize this letter.
The default score for aligning any other letter
with O is the same score as for aligning with X,
whereas the O self-alignment score defaults to zero (0).
The AB-BLAST 3.0 search programs support a new
compat2.0
option to obtain roughly equivalent parameter
settings to those used by WU-BLAST 2.0.
The analog to wu-blastall is named ab-blastall.
The analog to wu-formatdb is named ab-formatdb.
AB-BLAST programs preferentially use settings
of the new environment variables
ABBLASTMAT, ABBLASTDB and ABBLASTFILTER.
See the section on
Environment Variables for important details.
When upgrading from WU-BLAST to AB-BLAST,
due to the support for the letter J in AB-BLAST,
it is important to ensure that the AB-BLAST search programs
use the bundled scoring matrices
rather than the old matrices that were distributed with WU-BLAST,
because of the latter matrices’ lack of support for the letter J.
The maximum allowable value for the
dbslice
parameter has been increased.
Each release of AB-BLAST is generally distributed in 3 “Editions”
— Personal, Standard and Enterprise —
which differ from each other in the degree of parallelism they support,
whereas WU-BLAST was distributed in a single version.
The first few lines of output, including the program declaration line
and copyright notice, are different.
See
Citing BLAST for examples of the program declaration line
from AB-BLAST.
Programs in the AB-BLAST package that use the UNIX standard
getopt() function to parse the command line
will now uniformly across all computing platforms
produce “POSIXLY_CORRECT” behavior.
(N.B. The BLAST search programs do not use getopt(),
but most other programs in the package,
including xdformat and xdget, do).
This means some command lines that are acceptable to WU-BLAST
on some computing platforms (usually Linux)
may be rejected by AB-BLAST and need to be restructured.
This can happen if all parameters and flags are not specified
before (to the left of) the required command line arguments.
Better thread management under energy conservation conditions.
Better memory management under Mac OS X.
Citing BLAST
Citations or acknowledgments of AB-BLAST usage are greatly appreciated,
as are any personal accounts of how the software is being used
that you might wish to share.
When URLs are acceptable, please cite with:
Gish, W. (1996-2009) http://blast.advbiocomp.com
When URLs are not acceptable, please use:
Gish, W. (unpublished).
In scientific communications,
it is important to report
both the program name and the specific version used.
In the case of AB-BLAST,
the version is a combination of the version number, edition (Personal, Standard, or Enterprise),
release date,
target platform,
and build date.
The release date is the first (left-most) date displayed on the first line
of output and corresponds to the completion date of the source code.
The build date is the second date reported
and corresponds to the date and time the executables were built for the indicated target platform.
Both dates are reported in
ISO 8601 format.
For example, consider this introductory line of output
from AB-BLAST 3.0 Standard Edition:
Here the program name is BLASTN, the software version is “3.0SE”
from “AB” (Advanced Biocomputing, LLC),
the release date is May 29, 2009,
and the build date of the 64-bit Solaris 10 X64 binary
is May 30, 2009, at 1:25 AM.
“ILPF64” in the target platform description indicates
integers (I), long integers (L), memory pointers (P), and file pointers (F) were all compiled with 64-bits precision.
The first line of output from AB-BLAST Personal Edition
substitutes the letters “PE” for SE,
as shown in this example:
The original description of the
(1-hit)
BLAST algorithm
was published by
Altschul et al. (1990).
In addition to the algorithm itself,
BLASTP and BLASTN functionality are described,
without referring to the programs by name.
BLASTX-like functionality is briefly mentioned as being in progress
(again not by name),
but TBLASTN was actually the third BLAST search mode implemented.
Statistical significance of the ungapped alignments found by the programs
was assessed using
“Karlin-Altschul” statistics
— sometimes also referred to as
“Karlin-Dembo-Altschul” statistics,
due to the major contribution of
Amir Dembo.
In December 1989,
prior to the development of the World Wide Web,
the NCBI
Experimental BLAST Network Service
was opened to the public.
The BLAST network service provided fast, convenient client-server access
from anywhere on the Internet
to the very latest versions of the recently parallelized BLAST search programs
running on powerful 8–16 processor Silicon Graphics servers at the NCBI.
The BLAST servers searched against a comprehensive set of public sequence databases
that were updated daily.
Users could access the BLAST servers transparently using a UN*X command line client
that was invoked just like the BLAST application programs themselves,
or via a graphical client named HyperBLAST (J.M. Cherry, 1990, unpublished)
created with
HyperCard.
At about this time, the
“nr”
(quasi-non-redundant) protein and nucleotide sequence databases
were also established (W. Gish, unpublished).
The nr database — protein and nucleotide —
quickly became the standard database searched with BLAST,
and users could often do so in a matter of just a few seconds.
The experimental BLAST service was ultimately discontinued
a decade later, in March 2000.
Experience gained from providing a service that could
arbitrate many simultaneous and diverse requests for BLAST
helped Gish design a more flexible and robust network service
architecture known as the NCBI “Dispatcher”,
which was then largely implemented by others at the NCBI
(principally Jonathan Epstein) and went into operation ca. 1995.
At the request of NCBI management,
the experimental BLAST service was never published
and remains W. Gish (unpublished).
Awareness of the service nevertheless spread quickly by word-of-mouth,
as was the case for the later WU-BLAST.
The BLASTX program first appeared in the release of
BLAST version 1.1 in July 1990.
The program was later described and evaluated by
Gish and States (1993).
The BLAST3
program
(
Altschul and Lipman, 1990) was also folded into the BLAST 1.1 release
and parallelized.
The use of Poisson statistics
to evaluate the joint probability of multiple HSPs from a given (query,subject)
sequence pair,
as had been suggested by
Karlin and Altschul (1990),
was also first featured in BLAST 1.1.
The BLASTC program,
a specialized version of BLASTX that considered codon usage information
in addition to sequence similarity
(States and Gish, 1994),
appeared only once,
in the
BLAST 1.3
distribution.
The BLAST 1.3 distribution was also the last to include the BLAST3
program.
BLAST 1.4
(W. Gish, 1994, unpublished)
was the first version to use
Karlin and Altschul (1993)
“Sum” statistics
to evaluate the joint probability of finding multiple HSPs between a given pair of sequences.
Sum statistics were found to be more practical in a biological context
than the Poisson statistics utilized by default in BLAST 1.3.
The TBLASTX program first appeared in BLAST 1.4
and remains attributable to W. Gish (1994, unpublished).
All five of the supported BLAST programs in BLAST 1.4
(BLASTP, BLASTN, TBLASTN,
BLASTX and TBLASTX)
were for the first time coded
using a standard API (application programming interface)
to a generalized BLAST function library.
This function library made maintenance and improvements to the five core programs easier
and aided the development of more specialized BLAST applications,
such as Entrez sequence neighboring tools and specialized EST analysis tools.
The first release of WU-BLAST was numbered 1.4,
which was virtually identical to the public domain NCBI BLAST 1.4,
save for a few bug fixes.
The WU-BLAST Archives (original URL http://blast.wustl.edu)
first appeared on the Internet in 1995,
to provide continuity of support for the work Warren Gish began at the NCBI,
as well as to provide a central resource where the community could find
BLAST-related software, information and earlier versions.
In late 1994,
at the invitation of Warren Gish,
who had recently moved to Washington University in St. Louis,
Stephen Altschul and he engaged in a collaboration to test
several of Gish’s hypotheses:
Sum statistics
(Karlin and Altschul, 1993)
allowed the evaluation of multiple ungapped alignment scores,
using the analytically computed ungapped parameters
λu, Ku and Hu.
Extreme Value statistics
— analogous to the statistics for ungapped alignment scores published by
Karlin and Altschul (1990)
—
had been shown empirically to be good estimators
of the statistical significance of individual gapped alignments
from Smith-Waterman comparisons,
using empirical estimates for the gapped parameters
λg and Kg
(Collins and Coulson, 1990;
Mott, 1992;
Waterman and Vingron, 1994).
It stood to reason that
Sum statistics might be empirically extended to evaluating
multiple gapped alignment scores, using empirically estimated parameters
λg, Kg and Hg;
While good estimates for
λg, Kg and Hg could be
computed through lengthy (computationally expensive) Monte Carlo simulations
for a specific scoring system and particular pair of sequences,
fixed estimates for these parameters precomputed
for sequences of “average” composition
would work well enough as to be of practical use
in comprehensive database searches;
and
For an improved search algorithm,
multiple, locally optimal gapped alignments between two sequences
could be approximated by a two-stage BLAST implementation
that would:
remain fast, yet be far more sensitive than ungapped BLAST;
produce more-easily interpreted alignments;
and yield alignment scores suitable for evaluation
with the expanded role proposed for Sum statistics.
If the effort panned out as hoped,
the new gapped BLAST method
would in some cases be more sensitive and selective
than even the standard Smith-Waterman algorithm,
due to the newer method’s ability to find multiple gapped alignments
between a pair of sequences
and to evaluate their significance jointly with Sum statistics.
While Altschul set to work empirically testing Sum statistics
on gapped alignment scores,
Gish focused on the alignment problem.
Early results from their work appeared in
Altschul and Gish (1996)
and provided much of the foundation for WU-BLAST 2.0
and later NCBI blastall.
The first complete implementation of gapped BLAST
(BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX)
with statistical significance estimates (both Poisson and Sum statistics)
was publicly released as WU-BLAST version 2.0d1
(W. Gish, unpublished),
in time for presentation
at the Cold Spring Harbor conference on Genome Mapping and Sequencing
in May 1996.
The NCBI published its BLAST version 2 or “Gapped BLAST”,
including a description of a new 2-hit ungapped BLAST algorithm
and the PSI-BLAST program,
in
Altschul et al. (1997),
in September 1997.
All search modes, except BLASTN, used the new 2-hit
algorithm by default.
Within days of their publication,
a faster, more sensitive 2-hit algorithm
was deployed in WU-BLAST 2.0.
In late 2008,
rights to WU-BLAST were acquired from
Washington University in St. Louis
by the author, Warren R. Gish.
The right to license the software to the community were acquired by
Advanced Biocomputing, LLC in 2009.
Dembo, A, and S Karlin (1991).
Strong limit theorems of empirical functionals for large exceedances
of partial sums of i.i.d. variables.
Ann. Probab.19:1737–55.
Dembo, A, and S Karlin (1992).
Limit distributions of maximal segmental score among Markov dependent
partial sums.
Adv. Appl. Probab.24:113–40.
RF Mott (1992).
Maximum-likelihood estimation of the statistical distribution of Smith-Waterman
local sequence similarity scores.
Bull. Math. Biol.54:59–75.