AB-BLAST 3.0 is a powerful
software package for gene and protein identification,
using sensitive, selective and rapid similarity searches of
protein and nucleotide sequence databases.
The feature list for AB-BLAST is long and continues to expand,
while performance is improved.
Much of this is outlined
below.
A complete suite of BLAST search programs
(blastp, blastn, blastx, tblastn and tblastx)
is provided in the package,
along with several database management and support programs that
include
nrdb,
patdb,
xdformat,
xdget,
seg,
dust
and
xnu.
AB-BLAST has been built to be the most trusted database search tool
in your software toolbox,
doing what you tell it, reporting precisely what it’s doing
— even telling you what it could not do because
of specific parameter restrictions you might wish to change —
and able to handle the biggest jobs with aplomb.
Users of other BLAST implementations have suffered every few years through a series
of expensive and time-consuming rewrites, alpha releases, beta releases, output format changes,
database format changes, specialized spin-off programs, and new program parameters and behaviors
that may be important to NCBI operations but not to anyone else.
Meanwhile AB-BLAST was built from scratch
to offer performance, reliability and flexibility —
plus backward compatibility
with every AB-BLAST release for over 24 years.
AB-BLAST represents the most rigorous,
sensitive implementation of the core BLAST algorithm available,
yet it often runs faster than the rest.
AB-BLAST has a simple, easy-to-use command line syntax;
offers consistent behavior across all search modes;
runs on general purpose computer hardware;
can uniquely categorize and filter results based on biologically relevant criteria;
and much more.
All of these features help AB-BLAST users
be more productive and save time and money.
AB-BLAST is not a re-hashed version of NCBI-BLAST.
AB-BLAST shares virtually no code with NCBI-BLAST
except for some portions that both packages copied
from the public domain, ungapped BLAST version 1.4 released in 1994
(W. Gish, unpublished).
A brief history of AB-BLAST development
is available here.
The AB-BLAST lineage includes the original gapped BLAST,
first released in May 1996.
Before that, its author created (and maintained) the "nonredundant" databases and BLAST Network Service
but was asked by NCBI management not to publish this work.
Some of the key features of AB-BLAST are described below.
AB-BLAST is the premier gapped BLAST with statistics.
AB-BLAST is derived from the original gapped BLAST with statistics
(W. Gish, 1996, unpublished).
Gapped alignment routines are available and used by default
in all AB-BLAST search modes
(BLASTP, BLASTN, TBLASTN, BLASTX and TBLASTX),
with purely ungapped alignments available as an option.
Faster and More Sensitive.
AB-BLAST is up to twice as fast as NCBI BLAST in all search modes, while being more sensitive.
AB-BLAST uses distinctly different, more-sensitive search algorithms
than NCBI-BLAST, that have been painstakingly implemented to be faster, as well.
Due to algorithmic differences (not to mention differences in statistics),
no setting of parameters can guarantee the
same results will be produced by the two packages,
but with comparable parameter settings,
the classical 1-hit BLAST used by default in AB-BLAST
is faster and more sensitive
than the 1-hit BLAST available as an option in NCBI BLAST.
At comparable sensitivity levels,
the AB-BLAST 1-hit implementation is nearly as fast
and uses much less memory than the 2-hit algorithm described
by Altschul et al. (1997).
For users who desire maximum speed,
an AB-exclusive 2-hit algorithm (W. Gish, unpublished)
is available in all search modes (including BLASTN)
that is faster,
more sensitive and more memory-efficient than the NCBI 2-hit algorithm.
(See the
hitdist
option.)
Formatting databases is faster
with AB-BLAST—up to 4x faster—which
allows your BLAST servers to start scanning the databases that must sooner.
Multiple Local Alignments with Joint Evaluation.
Virtually all database search programs will find sequence similarities
(locally optimal alignments or approximations thereof)
that are by themselves statistically insignificant
and thus are not reported,
but AB-BLAST can identify alignments
none of which are statistically significant on their own
but which are statistically significant as a group.
Alignments are clustered into “consistent” sets
under a variety of user-configurable constraints,
including maximum allowable separation distance and maximum allowable overlap.
This combination of sensitivity and selectivity
— original to AB-BLAST and used by default in all AB-BLAST search modes —
increases the biological relevance of the results.
This feature is essential for finding:
all exons in a multi-exon gene sequence, not just high-scoring exons;
all complete or partial copies of a repetitive element
in a genomic sequence, not just high-scoring ones;
and
multiple, discrete domains of similarity between sequences,
not just the highest-scoring domain.
Often More Sensitive/Selective than Smith-Waterman.
The combination of well-chosen heuristics and statistics in AB-BLAST
can be more sensitive and selective than
the full dynamic programming approach of the classical
Smith-Waterman (1981) algorithm,
which reports only the single highest scoring alignment between two sequences,
as well as other approaches or BLAST implementations that may
identify multiple regions of local similarity but then only evaluate the alignments
individually for their statistical significance.
Full Smith-Waterman Option.
With the
postsw
option,
a full Smith-Waterman alignment is performed between pairs of query-subject
sequences that are already scheduled to be reported by BLASTP.
The Smith-Waterman alignments are combined with
the heuristic BLAST results, any redundancy between them is removed, and the statistics are recomputed.
In addition to providing alignments guaranteed to be optimal,
this post-processing can significantly improve the P-values and relative ranking of database hits,
often while increasing the execution time only marginally.
Choice of Statistical Methods.
AB-BLAST uses
“Sum” statistics
(Karlin and Altschul, 1993)
by default in all search modes,
with Poisson statistics available as an option
(poissonp).
Sum statistics and Poisson statistics involve
joint probability calculations on sets of one or more alignments.
To evaluate the significance of individual alignments (or alignment scores),
simple
Karlin-Altschul (1990)
statistics
are also available with the
kap
option.
BLASTN Flexibility.
Unique to AB-BLAST are these features of the BLASTN search mode:
Nucleotide scoring matrices.
AB-BLASTN supports fully-specified scoring matrices,
not just simple match/mismatch scoring systems.
This allows transitions to be scored differently than transversions;
and positive G-A substitution scores for the design of siRNAs (small interfering RNAs)
where G-U base pairing is allowed.
Scoring matrices can also be tailored to improve the design of PCR primers
or applied to areas of research where a simple match/mismatch scoring system
can not adequately discriminate.
Contrary to
W. Miller (2001),
scoring matrices were first supported by the NCBI ungapped BLASTN version 1.4
(Gish, W., 1994, unpublished;
see https://blast.advbiocomp.com/pub/blast-1.4).
Support for nucleotide scoring matrices was indeed dropped
by the NCBI when its blastall program was released in 1997,
but this feature was maintained continuously in all WU versions of BLASTN
since the migration to Washington University in St. Louis in 1994,
continuously through the introduction of the original gapped BLAST (WU-BLAST) in 1996,
and on through to today with AB-BLASTN.
Flexible Word Lengths.
AB-BLASTN supports BLAST word lengths as short as 1
(re: the
W
parameter).
Nucleotide Neighborhood Words.
Nucleotide neighborhood words are supported by AB-BLASTN
using the standard neighborhood word score threshold parameter,
T.
Using neighborhood words, nucleotide sequence similarity can be detected
even in the absence of any identical residues between two sequences.
Users are cautioned, however, that careless use of the
T
parameter can result in crushing amounts
of memory being requested by BLASTN.
For this reason,
T
should likely be used only in conjunction with very short word lengths.
Consistently Accurate Statistics with BLASTN.
Since the release of the first gapped BLAST with statistics in 1996
(W. Gish, unpublished),
the statistical significance of gapped alignment scores
in all search modes — including BLASTN —
has been evaluated using appropriately pre-computed “gapped” values
for the statistical parameters λ, K and H,
rather than the potentially very different values for these parameters that are computed
at run-time for evaluating ungapped alignment scores.
If precomputed values are not available for the specific
combination of scoring system and gap penalties requested by the user,
a prominent warning has always been issued by AB-BLAST.
In contrast, NCBI-BLASTN only relatively recently began
using precomputed gapped values for λ, K and H.
For many years prior, going all the way back to 1997,
NCBI-BLASTN was without warning
always using parameter values computed for ungapped alignments
to evaluate the significance of gapped alignments.
Virtual Gene Structures.
Linkage information describing “consistent” groups or chains of local alignments (HSPs)
are provided by AB-BLAST
when the
topcomboN
or
links
options are used.
This facility can help with construction of overall gene structures
from what might otherwise be a barrage of individual local alignments
scattered throughout a 2-dimensional search space.
The
hspsepQmax,
hspsepSmax,
olfraction,
olmax,
golfraction
and
golmax
parameters can also help ensure the reported structures are more biologically relevant.
Ease of Database Management.
AB-BLAST supports the eXtended Database Format (XDF),
a power user’s dream
for working with peptide and nucleotide sequences.
Both the NCBI-BLAST 2.0 database format and the NCBI implementation
of the BLAST search algorithm were originally restricted to sequences under 16 Mbp
in length,
whereas human genome contigs exceeded 25 Mbp in the last millennium
(Hattori et al., 2000)
and extended to hundreds of megabytes many years ago.
In contrast, XDF databases, which were introduced in 1999,
have the facility to accurately store individual sequences
of up to 1 Gbp (1 billion bp) in length with ambiguity codes intact.
Other BLAST software, such as the NCBI’s,
limits database files to 2 gigabytes each,
whereas from its inception XDF database files
could be of virtually unlimited size —
provided of course that the host operating system and file system
support such “large files” (as most modern operating systems and file systems do).
To support XDF databases,
the database formatting tool named xdformat
is provided with AB-BLAST.
Among other distinct capabilities
and advantages to using XDF and xdformat are:
fast appends of new sequences to existing databases
(both protein and nucleotide).
There is no need to reformat
an XDF database just to add one or more sequences to it.
xdformat runs 2-4 times faster than the NCBI formatdb program,
while offering a superset of features and greater reliability;
the full complement of NCBI standard “FASTA” sequence identifiers is indexed
by xdformat,
including standard identifiers the NCBI formatdb program does not index;
See this
FAQ
for further details;
duplicated sequence identifiers (that should be unique) in public databases
distributed by the NCBI
are reported by xdformat that the NCBI formatdb does not catch;
safe roll-backs of database updates when file I/O
(e.g., disk-full errors) or parse errors are encountered;
huge databases need not be broken into multiple volumes
individually composed of several files but can be managed simply
with as few as 3 files (4 files when sequence identifier index is included),
regardless of the database size up to 1 TB;
flexible indexing of all sequence identifiers
— not just a subset of the NCBI identifiers —
including user-defined identifiers;
index support for duplicate occurrences of the same identifier,
even the identical “gi” identifiers that cause some indexing programs
to abort;
identifier indexing is supported not only when creating an XDF database
but when appending new sequences to an existing XDF database;
and if a database was originally created without an identifier index,
an index can be added later in one, relatively fast step;
identifier indexes can be quickly re-built
if necessary using different indexing policies,
without having to reformat the entire database;
intelligent retrieval of indexed sequences
uses a complementary program named xdget.
Xdget can retrieve sequences by identifier
even if the program is not told what name space
(e.g., gi, accession, locus, user-defined, etc.)
the identifier came from.
For more on identifier indexing, see
this;
xdformat and xdget accept and work intelligently
with identifiers that obey the International DDBJ/EBI/NCBI collaboration’s
Accession.Version identifier syntax (e.g.,
the programs know that BAA84643.2 is a newer version of BAA84643,
but will retrieve BAA84643.1 if specifically requested);
And the parsing programs that come with AB-BLAST
for converting GenBank and EMBL database flat files into “FASTA” format
not only report gi identifiers but Accessions with Versions;
greatly reduced memory requirements and BLAST search initiation times
for databases containing large numbers of entries,
which is particularly important when memory is in short supply
or when multiple processors are standing by
waiting for a single-threaded initialization phase to be completed;
the ability to dump (or recover) the contents of an XDF database back into FASTA format
with the original annotation and ambiguity codes intact;
both the X and N ambiguity codes are supported in nucleotide sequences,
thus permitting the use of distinct substitution
scores for these letters and the use of PHRED/PHRAP sequence output
“as is” for input
to xdformat.
Compared to the classical BLAST 1.4 database format,
XDF provides
the ability to use FASTA/Pearson format
input files with unjustified (i.e., ragged or blank) input lines.
With nucleotide sequence databases,
there is also no longer the need to retain the original FASTA input file
in order to access the ambiguity codes during a database search.
Support for XDF by the BLAST search programs
does not come at the expense of backward compatibility.
AB-BLAST can search databases in either XDF or the classical BLAST 1.4
database formats.
Furthermore, by simply installing new versions of setdb and pressdb,
the migration to using XDF can be performed swiftly and transparently,
without making any changes whatsoever to existing database maintenance scripts.
While providing this drop-in upgrade path to XDF,
support for legacy databases in the BLAST 1.4 database format
is retained transparently, as well:
the AB-BLAST search programs automatically
identify the database format being used
and adjust their operations accordingly.
This allows users to migrate incrementally to XDF,
at their own pace and as they see fit,
without losing the ability to study or reproduce results
obtained with older databases.
Even so, users are encouraged to make the migration to XDF,
as there are definite benefits to the new format,
including an improved nucleotide sequence data representation
and the ability to index sequences by their identifiers.
When searching very large databases,
virtual memory requirements are dramatically reduced in AB-BLAST,
eliminating program failures that occurred
when system resource limits were unexpectedly reached.
Virtual databases are supported by AB-BLAST.
Virtual databases can be
specified on the command line as a white space-delimited list of
component database names.
Virtual databases can be comprised of
components in either XDF or classical BLAST 1.4 format,
as long as the formats are not mixed on the same command line.
For example, this command might be used to search the pri, rod, mam,
vrt, and htg divisions of GenBank:
blastn "pri rod mam vrt htg" myquery.nt
Virtually no file size limits exist for databases and other files,
provided the host operating system supports large files.
Operating systems such as Linux (kernel version 2.2 and earlier)
for 32-bit Intel computing platforms
are often incapable of using files larger than 2 GB,
although virtual database support (see above) helps avoid this limitation
by allowing large databases to be segmented into files of a manageable size.
Linux users in need of large file support should use at least a version 2.4.*
kernel or ideally a 2.6.* kernel.
AB-BLAST supports segmented query sequences,
such as the contigs that result from shotgun sequencing assembly
or perhaps multiple short probes for a given gene.
For example,
all of the contigs from a given clone can be concatenated together with a single hyphen (-) character
to delimit each contig.
Segment boundaries are therefore clearly distinguishable from purely ambiguous regions
of the sequence, while consuming little storage.
AB-BLAST honors segment boundaries by guaranteeing that no alignment,
be it ungapped or gapped, will cross a boundary.
Support for segmented database sequences is in progress.
Multi-sequence query files are supported,
such that every sequence in the FASTA file is searched against the specified database.
Previous versions of
the software only compared the first sequence in the query file against the database.
Each search result is separated from the next by a single ASCII form feed
character (control-L). See the new qrecmin and qrecmax options.
The format of all dates reported in BLAST output can be controlled by the
UN*X standard CFTIME environment variable.
For example, dates will be reported in
ISO 8601
standard format,
if CFTIME is set to '%Y-%m-%dT%H:%M:%S'.
Date strings produced by the
xdformat program are also governed by CFTIME.
Note that the format of many date and time strings reported for
XDF databases is determined by the setting (if any)
of CFTIMEwhen the database was created or last modified.
Both sequence filtering and word masking of query sequences
are supported.
The terms “filter” and ”mask” are sometimes used alone
and interchangeably,
however there are two distinct techniques people can use
which deserve separate names.
Lower case alphabetic letters in the query sequence can be used to
inform the BLAST search program as to which residues it should either filter
(convert to X or N)
or mask
(skip when generating neighborhood words but otherwise leave the sequence intact).
See the
lcfilter
and
lcmask
options, respectively.
Multiple filter=<filter> directives can be specified
on the BLAST command line.
Each of the filters is executed independently
and their results are OR-ed at the end.
NCBI-BLAST 2.0 uses either built-in (closed) complexity filters
or the original external filtering technique
of BLAST 1.3
(Gish, W., unpublished),
which uses the UNIX popen() system call and temporary/intermediate files.
AB-BLAST provides open access to its complexity filters by using
filter programs that are distinct from the search programs,
while simultaneously avoiding the use of problematic
system call interfaces and temporary files.
One or more word masks can be specified on the command line, using
the wordmask=<mask> option,
where <mask> may be a classical
filter program such as seg, xnu, or dust.
Whereas sequence filters
convert certain letters in the query sequence into ambiguity codes (X
for amino acid and N for nucleotide), word masks do not alter the
sequence. Word masks instead cause the indicated portion(s) of the query
sequence to be skipped during BLAST neighborhood word generation.
This leaves the query sequence intact for generating alignments that are
seeded by word hits arising in flanking, unmasked regions of the
sequence.
The BLAST algorithm word length parameter,
W, can be set from 1 to 1024 in all search modes
(BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX).
This wide-ranging flexibility and the resultant speed are due in part
to the highly optimized DFA (deterministic finite-state automaton) used
by AB-BLAST.
AB-BLAST reliably supports parallel processing on a variety of SMP
(symmetric multiprocessing) computing platforms.
AB-BLAST was the first BLAST to
support Mac OS X,
the first to thread properly (i.e., robustly and efficiently) under Mac OS X,
and the first to support 64-bit computing
under Mac OS X on PowerPC G5 and Intel X64 processors.
POSIX threads are used under
Linux and Mac OS X.
Although POSIX threads are available under Solaris,
Solaris native threads are used instead for slightly better performance.
To illustrate just some of the flexibility available with AB-BLAST,
the software bundle includes a
PERL script named ab-blastall
(formerly wu-blastall) that translates an NCBI blastall command line
into a roughly equivalent AB-BLAST
command line and then invokes the appropriate AB-BLAST search mode.
The output remains in AB-BLAST format,
but the ab-blastall script may help
users of the NCBI blastall program migrate
to AB-BLAST and start to discover its power.
NOTE:
the ab-blastall script is not at all intended to provide
a literal replacement for the NCBI blastall program
and is not an appropriate method
for assessing the relative performance
(sensitivity, specificity, accuracy or speed)
of the NCBI and AB software.
By coercing AB-BLAST
to substitute for CrossMatch
(Phil Green, unpublished),
the unique combination of flexibility
and speed of AB-BLAST can yield
up to a 30-fold increase in performance of
RepeatMasker in its slow mode,
while maintaining virtually identical sensitivity.
Many command line options are available.
The complete list of options and parameters along with their descriptions
is provided in the parameters.html file that comes bundled
with the software.
The latest version of this file is maintained on-line at
https://blast.advbiocomp.com/doc/parameters.html.
For convenience,
the AB-BLAST package includes an optimized version
of the classic sequence redundancy removal program
nrdb.
A newer program named patdb is also included
that can find identical sequences
and perfect substrings of others in the input
nearly as fast as nrdb finds identical sequences alone.
A reverse chronological list of changes to the AB-BLAST software
is available in the file named HISTORY that comes bundled
with the software.
When possible,
any bugs that have been found have typically been fixed within 24 hours
of their being reported.
Please send
us
bug reports, questions, or suggestions.
Licensing
Full information about licensing of AB-BLAST is provided
here.
Manifest
The AB-BLAST 3.0 package
includes the following data analysis and utility programs:
blasta — the unified database search program, which provides
blastp, blastn, blastx, tblastn, and tblastx
search functionality.
xdformat — the recommended program for rapidly converting sequences from
FASTA format into the native XDF format read by blasta. The program
can also append new sequences to an existing database;
automatically rollback on errors;
provides flexible indexing and verification services;
and can dump data back into FASTA format.
xdget — a flexible tool for retrieving sequences
(or segments thereof)
from an indexed XDF database;
retrieved sequences are
optionally reverse-complemented and translated
in the case of nucleotide sequences.
xdformat and xdget are actually one and the same program
to help ensure their mutual compatibility during upgrades.
nrdb — a tool for rapidly removing trivial redundancy
(i.e., duplicate sequences)
from one or more input files in FASTA format.
The nrdb program is often many times faster and uses much less memory
than “competing” solutions.
patdb — a tool much like nrdb for rapidly removing trivial redundancy
from one or more input files in FASTA format,
but with the important option
of identifying sequences that are perfect substrings of others.
The program attains high speed by using a Patricia tree
that is often combined with finite state automata.
The substring removal option can be usefully applied to protein sequences
which may differ in their inclusion of the initiator methionine;
or in mapping short read sequences onto a genome.
On protein sequence data,
with or without its substring option,
patdb
runs at about the same speed as nrdb and requires about the
same amount of memory.
When operating on nucleotide sequences,
nrdb may be more practical,
because nrdb uses data compression techniques
that are unavailable in patdb;
nrdb just can not identify perfect substrings.
ab-blastall — a PERL script for converting an NCBI blastall command line
into a roughly equivalent blasta command line
and then invoking blasta.
The output is still in AB-BLAST format.
This is primarily intended as a technology demonstration tool
but may also assist users in their migration
from NCBI BLAST to the more accurate AB-BLAST.
For benchmarking of BLAST,
careful tweaking of parameters may be required, but even with great care,
benchmarking for speed can still be confounded by inaccuracies in NCBI BLAST.
ab-formatdb — a PERL script
for converting an NCBI formatdb command line
into the equivalent xdformat command line and then invoking xdformat.
This is primarily intended as a technology demonstration tool but may also
assist users in their migration from NCBI BLAST to AB-BLAST.
pam — a program to compute amino acid substitution scoring matrices
having arbitrary scales, using the Dayhoff PAM model.
pressdb.real — the legacy pressdb program for any rare users
who may be reliant on the NCBI-BLAST 1.4 database format for nucleotide sequences.
setdb.real — the legacy setdb program for any rare users
who may be reliant on the NCBI-BLAST 1.4 database format
for amino acid sequences.
gb2fasta — a parser to extract nucleotide sequences from GenBank flat files
into FASTA format.
gt2fasta — a parser to extract amino acid sequences from CDS features
in GenBank flat files and output them in FASTA format.
sp2fasta — a parser to extract protein or nucleotide sequences from
EMBL, TrEMBL, or Swiss-Prot database files and output them in FASTA format.
pir2fasta — a parser to extract protein sequences from the old NBRF PIR database
files and output them in FASTA format.
dust — a low-complexity filter for nucleotide sequences
(Hancock and Armstrong, 1994;
Tatusov and Lipman, unpublished).
xnu — a low-complexity filter for protein sequences
(Claverie and States, 1993).
The program identifies short-periodicity repeats.
sysblast.sample — a sample configuration file that system
administrators may wish to modify and install as /etc/sysblast.
Parameter settings in this file can be used to:
limit the number of threads employed by each BLAST search process;
change the default number of threads employed per process;
alter the “nice” value for BLAST processes;
limit the amount of memory utilized by each BLAST process.
AB-BLAST Command Line Options and Parameters
A complete list of command line options and parameters
for modifying the behavior of the AB-BLAST search programs
is available
here.
Comparable AB/NCBI BLAST Parameters
A brief comparison of the some of the most important
parameters for controlling sensitivity, selectivity and speed
of AB-BLAST and NCBI BLAST
is available
here.
Environment Variables
AB-BLAST can utilize the settings
of a few environment variables
to adapt its behavior to different computing environments:
BLASTDB, BLASTFILTER and BLASTMAT.
To allow for triple AB/WU/NCBI BLAST installations,
AB-BLAST also supports the environment variables
ABBLASTDB, ABBLASTFILTER and ABBLASTMAT,
as well as
WUBLASTDB, WUBLASTFILTER and WUBLASTMAT.
Settings of the AB versions of these variables take precedence over all others,
and WU variable settings take precedence
over the corresponding base name variables.
In AB-BLAST, the BLASTDB (or ABBLASTDB) environment variable
can be a list of one or more directory names in which the programs
are to look for database files.
In UNIX parlance, such an environment variable might be called a path
for the database files.
Directory names should be delimited from one another by a colon
(“:”) and listed in the order that they should be searched.
If the BLASTDB environment variable is not set, the programs use a default
path of .:/usr/ncbi/blast/db, such that the programs first look in the
current working directory (“.”) for the requested database
and then look in the /usr/ncbi/blast/db directory.
For backward compatibility with
programs that expect BLASTDB to be a single directory specification and
not a path, if the user has set a value for BLASTDB but omitted the current
working directory,
AB-BLAST will still look for database files
in the current working directory as a last resort.
This usage is unchanged from NCBI/WU BLAST version 1.4 (1994),
except multiple directories could be specified with the BLASTDB
variable beginning with WU-BLAST 2.0 ca. 1997.
The BLASTFILTER (or ABBLASTFILTER) environment variable
can be set to the directory containing the sequence filter programs,
such as
seg and
xnu.
The BLASTMAT (or ABBLASTMAT)
environment variable can be set to the parent
directory for all scoring matrix files.
The default directory location for scoring matrix files is beneath the
matrix/ subdirectory of the AB-BLAST software installation.
And beneath this directory exist 4 subdirectories to accommodate
the 4 combinations of query-subject alphabets the search programs can use.
Before looking anywhere else, though, the search programs first look for the requested
scoring matrix in the current working directory.
For more information about environment variables, see the
Installation instructions.
Filters and Masks
AB-BLAST provides a highly flexible means
of applying both “hard” and “soft” masks to a query sequence,
of supporting alternative, user-defined filter programs
and non-standard parameters to the standard filters.
The filter (for hard masking) and
wordmask (for soft masking)
command line options provide the basic interface.
Multiple specifications of each type are acceptable
on the BLAST command line and are executed in left-to-right order.
Individual filter and wordmask specifications may
consist of pipelines of commands.
For example, three filters are used in succession by this pipeline:
filter="myfilter1 | myfilter2 | myfilter3 -x5 -"
The first two filters in this case expect to read their input from UN*X
standard input (also known as stdin),
whereas myfilter3 apparently needs to be told
explicitly to read data from stdin,
using the conventional “-” symbol for stdin.
The standard output (stdout)
from myfilter1 will be read via stdin
by myfilter2,
which in turn processes the query
before handing its results to myfilter3;
finally, myfilter3
reports its results to stdout,
which the BLAST program itself reads to obtain the fully masked sequence.
The final output from the filter pipeline is expected by the BLAST
program to be in FASTA format.
Instead of running all 3 filters in the above example as part of one
pipeline, they could instead be specified as three separate filter options
like this:
The same choice of running as a pipeline or running separately is available
for wordmasks, too.
Naturally, the two approaches can also be combined on the same command line.
An advantage to using the pipeline approach is that all 3 filters
in the example above may complete a little bit faster,
because some I/O overhead is avoided.
Furthermore,
when used in a pipeline,
there is no requirement that the output from myfilter1
and myfilter2 actually be in FASTA format.
Those two programs could potentially pass information between
themselves and to myfilter3 in a proprietary format.
The only absolute format requirements are that the first filter in the pipeline
must read FASTA data
from stdin, and the last filter in the pipeline
must output FASTA format.
The final output from a filter or pipeline must also have the same length
as the original input sequence.
It should be noted that with some filter programs,
passing the query sequence sequentially through
a pipeline of filters may yield
a different result than processing the query independently with each filter
and OR-ing the results.
The script seg+xnu included in the filter/ directory provides
an example with which to test this.
Specifying filter=seg+xnu on the BLAST command line
invokes a seg and xnu pipeline that is built-in to the search programs;
whereas specifying filter="seg+xnu -"
causes the seg+xnu script to be invoked on the query, which independently
executes seg and xnu,
then logically “ORs” the results with the bundled pmerge
utility program.
The built-in seg+xnu pipeline is historically the way these two filters have
been invoked together,
but the somewhat slower method employed
by the seg+xnu script with pmerge may be more desirable.
The echofilter option can be used to display the filtered sequence
near the beginning of search program output.
Precomputed Statistical Parameters
Nucleotide Scoring Systems
Precomputed values for λ, K and H
are available for BLASTN searches
with the following match,mismatch
(M,N)
scoring systems,
using the sets of gap penalties
{Q,R}:
Precomputed Nucleotide Scoring Systems
M
N
{Q,R}
+1
−3
{3,3} {3,2} {3,1} {7,2}
+1
−2
{2,2} {2,1} {1,1}
+3
−5
{10,5} {6,3} {5,5}
+4
−5
{10,5}
+1
−1
{3,1} {2,1}
+5
−4
{20,10} {10,10}
+5
−11
{22,22} {22,11} {12,2} {11,11}
Precomputed values are also available for a Purine-Pyrimidine scoring matrix
named “pupy”:
PuPy Matrix
Q
R
20
10
10
10
Protein Scoring Systems
Precomputed values for λ, K and H
are available for protein-level searches
(BLASTP, BLASTX, TBLASTN and TBLASTX)
with the following scoring matrix and
gap penalty combinations (or gap penalty ranges for R) {Q, R}:
BLOSUM50
Q
R
16
1–4
15
1–4, 6, 8
14
1–5, 8
13
1–5, 8
12
2–5, 7
11
2–4, 6, 8
10
2–6, 8
9
3–5, 7
8
4–8
7
6, 7
BLOSUM55
Q
R
16
1–4
15
1–4, 6, 8
14
1–5, 7
13
2–5, 8
12
2–5, 8
11
2–6, 8
10
3–6, 9
9
3–5, 7
8
4–8
7
7
BLOSUM62
Q
R
12
1–3
11
1–3
10
1–4
9
1–5
8
2–7
7
2–6
6
3–5
5
5
BLOSUM80
Q
R
12
2–12
11
2–11
10
2–10
9
3–9
8
4–8
7
5–7
PAM40
Q
R
12
1, 2, 6
11
1, 2, 7
10
1–3, 7
9
1–3, 6
8
1–4
7
1–4
6
2–5
5
2–5
4
3, 4
PAM120
Q
R
12
1, 2, 4
11
1–3
10
1–3, 5
9
1–3, 5
8
1–4, 6
7
2–4, 6
6
2–5
5
3–5
PAM250
Q
R
16
1–4
15
1–5
14
1–6
13
1–6
12
2–7
11
2–7
10
3–8
9
3–7
8
5–7
7
7
Bugs
AB-BLAST is certainly not bug free, but historically
bugs have been fixed typically within a day of their being reported.
The currently known bugs are:
The scale of the included BLOSUM80 scoring matrix is 1/3 bit, rather than
the 1/2 bit scale used otherwise for BLOSUM60 and above (BLOSUM60, 62, 70, 90, and 100).
This anomaly—which goes all the way back to NCBI BLAST 1.3 in 1993—may
be corrected (along with revised gapped lambda, K and H parameters) in a future release.
If you think you might be experiencing the effects of a bug,
please contact
us.
AB-BLAST exhibits a few different behaviors worth mentioning here,
because they could trip up or confuse even the most knowledgeable of BLAST users.
Any unexpected behavior might rightfully be construed as being a bug,
so the following information is provided here in the Bugs section to help avoid the unexpected.
If you should encounter problems or confusing areas
other than those described below,
or if you have questions or suggestions for improvement,
please send them to
us.
With the December 2018 release of AB-BLAST,
a valid license must be installed
to run some of the most important programs in the suite.
See the Installation section for details.
AB-BLAST 3.0 establishes a new default scoring system for BLASTN.
The new scoring system is
match and mismatch scores
M=1N=−3
and gap penalties
Q=7,
R=2.
The +1/−3 scoring system is more efficient at finding
nearly identical sequences
— the most frequent use for BLASTN —
compared to the old +5/−4 scoring system.
The +1/−3 scoring system is also more consistent
with the BLASTN default word length of 11,
which also selects for nearly identical sequences.
The WU-BLASTN default scoring system
(M=5 N=-4 Q=10 R=10)
will be restored if the
compat2.0
option is specified.
Much confusion had been caused over the years
by the default WU-BLASTN scoring system,
which dates back to the earliest incarnations of NCBI BLAST in 1989.
The NCBI changed its default scoring system to +1/-3 upon
introduction of blastall in 1997.
For the sake of long-term compatibility and consistency,
the scoring system had been left unchanged in WU-BLAST.
NOTE: The amino acid scoring system used by default in blastpblastx, tblastn and tblastx
remains unchanged
in AB-BLAST but differs slightly from the NCBI amino acid scoring system.
A major difference in default behavior between AB-BLAST and NCBI-BLAST
—
whether to filter query sequences for low-complexity regions
—
remains unchanged in AB-BLAST.
Namely,
just as WU-BLAST did not filter query sequences by default,
AB-BLAST does not filter query sequences by default either.
With its newer “BLAST+” package,
the NCBI seems intent on confusing the marketplace and squelching
the competing AB-BLAST effort,
by abruptly using conflicting program names (blastp, blastn, blastx, tblastn and tblastx)
that it had abandoned in 1997.
These program names had been in use
by AB-BLAST and WU-BLAST before it for 17+ years.
The NCBI BLAST+ programs use an entirely different command line syntax
than vintage 1994 NCBI/WU-BLAST (as well as vintage 1997 NCBI-BLAST).
In contrast, through considerable effort,
WU-BLAST and AB-BLAST have maintained a high degree of backward
compatibility in command line usage and have used the original BLAST program names
continuously since 1994.
Advanced Biocomputing suggests the behavior of the U.S. Government agency is
anti-competitive and amounts to abuse of its monopoly position.
The conflict can be mitigated on an individual basis
by renaming the search programs in either package.
We encourage you to seek a better solution by writing to your
U.S. Congressional
Representative
and
Senators.
The Congressional mandate which created the NCBI in 1988, did not endow the agency
with the right to impede outside R&D or discourage business and it was not intended by Congress to do so.
Thank you for your support.
Due to support added for the amino acid codes J and O,
XDF protein sequence databases produced with the AB- version of xdformat
are not readable by programs in the old WU-BLAST package.
For an XDF protein sequence database named “foo”
created by WU-xdformat,
the AB-xdformat command:
xdformat -p -i foo
will report the alphabet name as
“NCBIstdaa(1)” (NCBIstdaa version 1).
The larger amino acid alphabet normally used
by the AB- version of xdformat
is named “NCBIstdaa(2)” (NCBIstdaa version 2).
WU-BLAST only uses the version 1 alphabet,
whereas AB-BLAST creates new databases using version 2,
can read and update existing databases in either alphabet,
and can read combinations of the two alphabets in virtual databases.
N.B. If a protein database created by WU-xdformat
is updated using AB-xdformat,
the alphabet is silently updated to NCBIstdaa version 2, which will render
the database subsequently unreadable by programs in the WU-BLAST package.
This warning does not currently apply to nucleotide sequence databases,
because no change has thus far been necessary in the nucleotide alphabet
used by AB-BLAST.
The amino acid codes
U (selenocysteine or Sec)
and
O (pyrrolysine or Pyl)
are acceptable in query and database sequences,
but the scoring matrices distributed with AB-BLAST do not specify scores
for these letters.
By default these letters are scored the same as alignment with an X (unknown residue)
would be scored,
except for their self-alignment scores
(i.e.,U with U and O with O)
which are set to 0 by default.
If more meaningful scores are known, alternative scores for these letters
can be set explicitly in the amino acid scoring matrices.
The only accepted way to specify an alternative scoring matrix file
is to refer to the file by name
(e.g.,matrix=BLOSUM55)
and for the file to reside in the current working directory
or for the path to the file to be
listed in the BLASTMAT environment variable.
If both a path and file name to a scoring matrix file are specified,
such as in matrix=/usr/local/blast/matrix/aa/BLOSUM62
or matrix=aa/blosum62,
the search programs will claim not to be able to find the file even though
it may indeed exist and be readable.
This is a security measure that may allow managers
of network- or web-based search services to expose
the command line to users without opening up access to potentially any file
on the server,
when the mere knowledge that a file exists might be considered a breach of security.
The gap penalty parameters
Q
and
R
of AB-BLAST
have similar but important differences in interpretation
from the parameters G and E of NCBI Gapped BLAST.
While the two extension penalties
R
(AB-BLAST) and
E
(NCBI-BLAST) are analogous,
Q
(AB-BLAST) is analogous to the sum of G and E
with NCBI-BLAST.
In other words, where
Q
represents the total penalty for a gap of length 1,
NCBI Gapped BLAST computes this penalty as G + E.
The default sort order for reporting database hits
is by increasing E-value (most-to-least significant ordering),
but for a given database hit,
the alignments or HSPs with that sequence are sorted
primarily by query strand, secondarily by the database (“subject”) strand,
and only then by E-value.
For example,
if any alignments of a given database sequence are to the minus strand of the query,
they will be reported after any alignments to the plus strand,
even if alignments to the plus strand are less significant.
In a TBLASTX search,
in which both the query and subject are translated nucleotide sequences,
for each strand of the query,
hits to the plus strand of the subject will be reported
before any hits to its minus strand.
Consequently, identifying the HSP ascribed with the greatest statistical significance
may require many lower-significance alignments to be parsed first.
Naturally, this consideration is not an issue for BLASTP searches,
where only one “strand” of query and subjects is searched.
On those rare computing platforms today that do not support “large” files
(files >2 GB in size),
users will be unable to search nucleotide sequence databases
larger than about 8 billion nucleotides or 2 billion amino acids.
Migrating to a contemporary 32-bit operating system
—
or to a 64-bit computing platform that provides “large file” support
—
is sufficient to break through the “2 GB barrier”.
The statistical significance of gapped alignment scores is computed using
values for λ,K and H
obtained from built-in,
precomputed tables.
(The values for λ,K and H used to assess the
significance of ungapped alignment scores are still computed at run time,
as is practical).
These parameter values are
determined by the scoring matrix and gap penalties being used.
Precomputed values are necessarily not available for
all scoring matrix and gap penalty combinations, though;
and the precomputed values may not be well-suited
to an unusual residue composition of the query or database sequences.
In cases when precomputed values are unavailable,
the programs issue a relevant WARNING message and proceed to evaluate gapped alignment scores
using values for λ,K and H
that are likely to be incorrect:
the values computed at run-time for ungapped alignments.
In such cases, the reported significance estimates may be highly inaccurate
and will be biased towards being overly significant.
If the user knows more accurate parameter values for their situation,
however,
the
gapK,
gapL
and
gapH
command line options
can be used to set them.
Selecting an alternative scoring matrix does not alter
the gap penalties
(Q
and
R)
from their default values.
Leaving gap penalties at their default values when choosing an alternative
scoring matrix
can not only result in alignments with undesirable gap characteristics but
can create a situation in which
the programs do not have precomputed values in their built-in tables
for λ,K and H.
Worst-case, the end result can be that the alignments represent horribly inaccurate
mappings between the query and subject sequences and the P-values ascribed
to the alignments are horribly inaccurate as well.
(Actually, a worst-case scenario might be when the alignments and statistics
are bad but not bad enough to be noticed by the user, who then proceeds to use
the results—both false positives and false negatives—as though they were meaningful.)
As described earlier, a WARNING message will be displayed when precomputed
values are not available,
but nevertheless the search will go on
and the alignments and statistics may be anywhere from
slightly to horribly misleading.
The
hspsepqmax
and
hspsepsmax
parameters are measures of distance
in residues along the sequences in the specific form in which they are
actually compared.
For instance, in a BLASTX search (conceptually translated nucleotide
query compared against a protein sequence database),
hspsepqmax
refers to a distance measured in amino acid residues, not the underlying
nucleotides in the query.
ASN.1
formatted output is not available from AB-BLAST.
XML and tab-delimited output formats are recommended instead.
(See the
mformat
parameter.)
Supported Platforms
The computing platforms currently supported for
AB-BLAST are listed below.
Linux kernel versions 2.6+ for 64-bit X64
Apple macOS 10.15+ (“Mavericks” and later) for 64-bit X64
The list of supported platforms is subject to change without notice.
Through multithreading, multiple processors and processor cores
are supported by AB-BLAST on all of the above platforms.
Installation
Prior to software installation,
make sure you have obtained a license file named license.xml.
This file will normally have been sent
as an email attachment from Advanced Biocomputing, LLC.
An active license is required to run all of the
AB-BLAST search programs plus several of the support programs
(nrdb, patdb and all of the *2fasta programs).
An active license will never be required to run
the xdformat and xdget programs,
to ensure you can recover your data from
AB-BLAST databases even if your license has expired;
and to allow creation of searchable databases
for users who do have active AB-BLAST licenses.
At the confidential URL that was emailed to you,
download the compressed tar.gz archive
for your computing platform.
The compressed tar archive will unpack into a subdirectory named
ab-blast-YYYYMMDD-os-arch,
where the date or version of the AB-BLAST package
is indicated by YYYYMMDD.
The operating system is indicated by os
(e.g., linux or macos)
and arch describes the hardware architecture (e.g., x64).
Mac users should download the
ab-blast-YYYYMMDD-macos-x64.dmg
file instead,
then double-click the file to open and reveal an installer.
After successfully running the installer, you should find the AB-BLAST
programs installed beneath the /usr/local/ab-blast directory.
You will likely want to add that directory to the PATH environment variable
for your login shell.
Individual users of AB-BLAST must place their license.xml file
in the directory ~/.config/ab-blast,
where ~ (tilde) signifies the home directory.
If the directory does not exist, it must first be created.
For site licensees only, the license.xml file can be conveniently placed in the same directory
as the AB-BLAST software, to enable access for all users of the computer system.
Note that the programs
blastp, blastn, blastx, tblastn and tblastx
are actually “hard links” to the same executable program
(blasta)
that encodes the integrated capabilities of all 5 search methods.
If desired, these links can be renamed, as long as the original names appear as substrings
within the new names.
For instance, a link named ab-blastp will still invoke the search program
in its blastp operational mode.
Similarly, the xdformat and xdget programs are hard links to the very same program
that operates differently depending on the name by which it is invoked.
If you previously had AB-BLAST (or WU-BLAST) installed with BLAST-able databases,
your installation or update of AB-BLAST is likely complete.
If you did not have AB-BLAST or WU-BLAST already installed,
read on...
Low-complexity sequence filters or masking programs —
e.g.,
seg,
xnu
and
dust —
are included in AB-BLAST distributions.
The bundled versions of these programs are precompiled and optimized.
While these filter programs are not required for running the search programs,
they can enormously reduce search times, the amount of garbage output produced,
and the memory used by the programs.
NOTE: unlike NCBI-BLAST,
AB-BLAST does not employ sequence complexity filtering by default.
This behavior might change in the future, though.
In case the search programs are updated to a version that does perform
complexity filtering by default
and you wish to guarantee an automated analysis pipeline
will not perform this filtering,
you can specify filter=none
on the BLAST command line to maintain the behavior.
The databases themselves are not included with the AB-BLAST software.
Once the source databases have been downloaded from any of many Internet sites,
the database files are typically uncompressed and processed into FASTA format,
if they are not in FASTA format already.
Included in the tar archives are several utility programs for converting
plain text database files into FASTA format:
gb2fasta converts the nucleotide sequences in
GenBank flat files
into FASTA format.
gt2fasta converts the CDS translations in
GenBank flat files
into FASTA format.
The NCBI software
Toolbox also contains some relevant parsers.
One of these is
asn2fsa, which converts both nucleotide and peptide sequences in
GenBank ASN.1 format
into FASTA format files.
The asn2ff parser,
which converts GenBank ASN.1 data into other flat file formats,
may also come in handy, especially if you are inclined to parse GenBank
into FASTA using your own routines
or use the gb2fasta and gt2fasta programs mentioned above.
All of the above parsers can read from standard input (signified by a hyphen, “-”),
so their input files can be maintained on disk in compressed format and streamed
uncompressed directly into parsers with zcat, gunzip or other relevant decompression program.
Because command line options themselves start with hyphens,
if a hyphen is needed to specify standard input for the input file name,
some of these programs require that a double-dash (--)
be entered on the command line before the single-dash.
This double-dash signifies the end of options and the
start of the required filename arguments.
Once a source database is in FASTA format,
the xdformat program
should be used to convert it into “blastable” format.
Concise usage instructions for xdformat (and xdget)
can be obtained by invoking each program without any command line arguments.
By default,
xdformat produces 3 output files whose names are derived from the name
of the FASTA input file.
The 3 output files have distinct file name extensions and
together comprise the blastable database.
If sequence identifiers are optionally indexed during database creation,
the blastable database will consist of a total of 4 output files.
Databases formatted by xdformat
contain full ambiguity code information within the blastable database files it produces.
By default, if any unrecognized amino acid or nucleotide codes are encountered
or if the FASTA input file should otherwise appear corrupt,
xdformat will emit an error message and halt.
In such cases, if the blastable database was to be newly created,
xdformat will remove the blastable database files it was creating before halting.
If an existing blast database was being appended with new sequences when the error arose,
the blastable database will be rolled back to its original state prior to the attempted update,
with none of the new sequences appended.
While formatting a database,
the xdformat program can optionally (-I option)
index the sequence identifiers
for later identifier-based retrieval with the xdget program.
XDF databases that were formatted without an identifier index
can have an index created post hoc by xdformat
with its -X option.
It may be of interest to note
for the purposes of their maintenance
that xdformat and xdget are actually one-and-the-same
program file,
merely invoked under the two different names to obtain the two
different program behaviors.
This helps ensure that the index created with xdformat
will be compatible with xdget.
See the file "FAQ-Indexing.html" for more details on identifier indexing.
For compatibility with legacy BLAST installations,
the xdformat program can function
in a setdb- and pressdb-compatibility mode,
wherein its behavior is similar to that of setdb and pressdb.
In its compatibility mode, a similar command line structure is used
and the output files produced have the same names as those produced by setdb and pressdb.
Compatibility mode is invoked
when xdformat is renamed or has links pointing
to it named setdb and pressdb.
While the files produced in compatibility mode have the same file names as those
produced by the original setdb and pressdb programs
(setdb.real and pressdb.real),
the content of these files is always XDF.
Versions of the BLASTA search program dated on or after 1999-12-14
are able to work with the more-capable XDF databases.
Note that two XDF databases —
one protein and one nucleotide —
can be created with the exact same name and exist in the exact same directory,
because the 3-letter file name extensions of XDF databases are completely distinct
for protein and nucleotide sequence databases.
If xdformat and the legacy setdb and pressdb programs
have all been used to create databases with the same name that reside in the same directory,
the BLAST search programs will preferentially search the databases
created with xdformat which will have the standard XDF database
file name extensions.
Note that two XDF databases of the same name —
one protein and one nucleotide —
can reside in the same directory,
because the file name extensions of XDF databases are distinct
for protein and nucleotide sequence databases.
Using the -t option to xdformat,
a descriptive name or title can be assigned to a database that will appear in BLAST search output.
The title of an existing database can be changed after its creation,
by appending an empty FASTA database and specifying the -t option with the desired new title.
For example,
xdformat -n -a mydb -t "Fancy New Title" /dev/null
The blastable database files can be placed anywhere,
but for convenience the BLASTDB environment variable
should include their directory location.
If the BLASTDB environment variable is not set,
the programs look for databases by default
in /usr/ncbi/blast/db and in the current working
directory.
If the old pressdb program (instead of xdformat)
is used to create the blastable database,
the associated nucleotide sequence FASTA file must be located
in the same directory as the three output files from pressdb,
if the BLAST search programs are to find the FASTA file.
It may sometimes be useful to maintain the FASTA files
in a separate directory — even on another disk partition — and
provide UNIX soft links in the BLASTDB directory that point to the
real location of the FASTA files.
In addition, on systems where NCBI BLAST will not be in use,
blastable databases can be maintained in multiple directories listed
in the BLASTDB environment variable,
with each directory name delimited from the next by a colon (:),
just as directory names are often delimited
in the PATH environment variable.
On multi-processor computer systems,
the search programs will employ
as many CPUs as are installed;
when more than about 4 CPUs are used,
this default behavior cause efficiency of hardware utilization
to be quite low,
compared to running individual single-threaded BLAST jobs
on each CPU.
Memory use also increases linearly with the number of CPUs or threads
employed.
One way to govern the number of processors employed is
to wrap the search programs in a shell script that sets a lower
number of CPUs via the cpus=# command line option.
Another, simpler approach to changing the default number of CPUs
for all users follows below,
for implementation by BLAST system managers possessing
“root” or “SuperUser” privileges.
Distributions of AB-BLAST
include a sample file named sysblast.sample,
that illustrates the system-wide configuration parameters
that can be established to govern the execution of BLAST jobs
and, thereby, provide a more productive, trouble-free level of service.
When the sysblast file is installed under the name /etc/sysblast,
all BLAST jobs executed on a given computer system can be made
subject to the parameters:
cpusmax=<n>: a hard limit on the number of CPUs or threads employed
by each BLAST job;
it is possible to prohibit BLAST searches entirely
on a given computer by configuring a negative value for cpusmax;
cpus=<n>: the default number of CPUs or threads employed per BLAST job;
nice=<n>: a “nice” value for altering the priority of BLAST processes;
As is standard for UNIX operating systems:
positive nice values correspond to lower priority
only the root user can run at negative nice values (higher priority);
any nice value set in /etc/sysblast is added to the current nice value of a BLAST process.
memmax=<n>: the maximum amount of memory that may be
allocated by any single BLAST job.
The interpretation and recommended usage of memmax are:
memmax is expressed in units of bytes,
with optional modifiers
k (kilobytes), m (megabytes), and g (gigabytes).
It is almost certainly a bad idea to set memmax to a value that is
greater than the actual amount of memory (silicon RAM) installed
in the computer;
If memmax=0, the effective limit is “unlimited”, or the natural
upper limit for a process executing under the given operating system;
Values of memmax < 0 are ignored, in which case
the standard UNIX datasize resource limit set by the
user’s command shell governs BLAST memory usage instead;
The sysblast file is only effective when installed in the /etc directory.
The /etc directory generally resides locally to any given computer system,
so parameter settings can be tailored to each computer,
even if the BLAST software is maintained on a shared disk partition.
The /etc directory should only be writable by “root”.
Unlike the shell script wrapper approach described above,
the limits set in /etc/sysblast typically can not be circumvented
by normal (non-root) users of a computer system.
See the comments included in the sample sysblast file for further details.
Differences between AB-BLAST and WU-BLAST
Apart from bug fixes, the most outward differences in usage and appearance
of AB-BLAST and WU-BLAST include:
The default scoring system for AB-BLASTN is
match/mismatch scores M=+1 N=−3 with gap penalties Q=7 R=2;
whereas WU-BLASTN uses
M=+5 N=−4 with gap penalties Q=10 R=10 by default.
In all search modes, the default value for the gapped alignment drop-off score
gapX
is ≈50% higher for AB-BLAST,
which will tend to make the AB-BLAST search programs
slightly more sensitive and just slightly slower.
AB-BLAST
supports an expanded amino acid alphabet,
compared to the amino acid alphabet used by WU-BLAST.
Programs in the WU-BLAST package are consequently unable to search or modify
protein sequence databases that were created or modified by AB-xdformat.
Once a protein sequence database created with WU-xdformat
has been modified by AB-xdformat,
it can no longer be searched or modified by any of the WU-BLAST programs.
Databases created by WU-xdformat can be searched
and modified by the AB-BLAST programs.
At least for the time being,
the AB-BLAST search programs can also search virtual databases
that are a combination of databases created
with WU-xdformat and AB-xdformat.
No difference currently exists between the nucleotide alphabets used
by AB-BLAST and WU-BLAST
or the ability of programs in either package
to search/modify nucleotide sequence databases created/modified
by programs in the other package.
The bundled BLOSUM30 and BLOSUM35 scoring matrices
have been re-scaled to provide better precision.
The bundled amino acid scoring matrices
— and the matrices output by the pam program —
now contain a J row and a J column.
These matrices are incompatible with WU-BLAST,
which does not support the letter J and will report a FATAL error
when reading the files.
The AB-BLAST amino acid scoring matrices are slightly different
from the matrices distributed by the NCBI,
which also indicate scores for the letter J,
but at the time of this writing the NCBI matrices are cross-compatible with AB-BLAST.
AB-BLAST supports the amino acid letter code O (“oh”)
normally used to represent Pyrrolysine (Pyl),
whereas WU-BLAST does not.
The letter O may appear in query sequences, database sequences, scoring matrix files
and with command line parameters such as the
altscore
option,
but the scoring matrices bundled with AB-BLAST do not actually utilize this letter.
The default score for aligning any other letter
with O is the same score as for aligning with X,
whereas the O self-alignment score defaults to zero (0).
The AB-BLAST 3.0 search programs support a new
compat2.0
option to obtain roughly equivalent parameter
settings to those used by WU-BLAST 2.0.
The analog to wu-blastall is named ab-blastall.
The analog to wu-formatdb is named ab-formatdb.
AB-BLAST programs preferentially use settings
of the new environment variables
ABBLASTMAT, ABBLASTDB and ABBLASTFILTER.
See the section on
Environment Variables for important details.
When upgrading from WU-BLAST to AB-BLAST,
due to the support for the letter J in AB-BLAST,
it is important to ensure that the AB-BLAST search programs
use the bundled scoring matrices
rather than the old matrices that were distributed with WU-BLAST,
because of the latter matrices’ lack of support for the letter J.
The maximum allowable value for the
dbslice
parameter has been increased.
The first few lines of output, including the program declaration line
and copyright notice, are different.
See
Citing BLAST for examples of the program declaration line
from AB-BLAST.
Programs in the AB-BLAST package that use the UNIX standard
getopt() function to parse the command line
will now uniformly across all computing platforms
produce “POSIXLY_CORRECT” behavior.
(N.B. The BLAST search programs do not use getopt(),
but most other programs in the package,
including xdformat and xdget, do).
This means some command lines that are acceptable to WU-BLAST
on some computing platforms (usually Linux)
may be rejected by their AB-BLAST counterpart and need to be restructured.
This mostly can happen if all options are not specified
before (to the left of) required arguments.
Better thread management under energy conservation conditions.
Better memory management under macOS.
AB-BLAST no longer looks for matrix files or complexity filter programs
beneath /usr/ncbi/blast.
This behavior was a relic of WU-BLAST's early lineage.
Citing BLAST
Citations or acknowledgments of AB-BLAST usage are greatly appreciated,
as are any personal accounts of how the software is being used
that you might wish to share.
When URLs are acceptable, please cite with:
Gish, W. (1996-2019) https://blast.advbiocomp.com
When URLs are not acceptable, please use:
Gish, W. (unpublished).
In scientific communications,
it is important to report
both the program name and the specific version used.
In the case of AB-BLAST 3.0,
the version is a combination of the version number,
release date,
target platform,
and build date.
The release date is the first (left-most) date displayed on the first line
of output and corresponds to the freeze date of the source code.
The build date is the second date reported
and corresponds to the date and time the executables were built for the indicated target platform.
Both dates are reported in program output in
ISO 8601 format.
For example, consider this introductory line of output
from AB-BLAST 3.0:
Here the program name is BLASTN, the software version is “3.0”,
the release date is December 16, 2018,
and the build date and time of the 64-bit macOS X64 binary
is December 19, 2018, at 10:14PM.
Historical Notes
The original description of the
(1-hit)
BLAST algorithm
was published by
Altschul et al. (1990).
In addition to the algorithm itself,
BLASTP and BLASTN functionality are described,
without referring to the programs by name.
BLASTX-like functionality is briefly mentioned as being in progress
(again not by name),
but TBLASTN was actually the third BLAST search mode implemented.
Statistical significance of the ungapped alignments found by the programs
was assessed using
“Karlin-Altschul” statistics
— sometimes also referred to as
“Karlin-Dembo-Altschul” statistics,
due to a major contribution of
Amir Dembo.
In December 1989,
prior to the development of the World Wide Web,
the NCBI
Experimental BLAST Network Service
was opened to the public.
The BLAST network service provided fast, convenient client-server access
from anywhere on the Internet
to the very latest versions of the recently parallelized BLAST search programs
running on powerful 8–16 processor Silicon Graphics servers at the NCBI.
The BLAST servers searched against a comprehensive set of public sequence databases
that were updated daily.
Users could access the BLAST servers transparently using a UN*X command line client
that was invoked just like the BLAST application programs themselves,
or via a graphical client named HyperBLAST (J.M. Cherry, 1990, unpublished)
created with
HyperCard.
At about this time, the
“nr”
(quasi-non-redundant) protein and nucleotide sequence databases
were also established (W. Gish, unpublished).
The nr database — protein and nucleotide —
quickly became the standard database searched with BLAST,
and users could often do so in a matter of just a few seconds.
The experimental BLAST service was ultimately discontinued
a decade later, in March 2000.
Experience gained from providing a service that could
arbitrate many simultaneous and diverse requests for BLAST
helped Gish design a more flexible and robust network service
architecture known as the NCBI “Dispatcher”,
which was then largely implemented by others at the NCBI
(principally Jonathan Epstein) and went into operation ca. 1995.
At the request of NCBI management,
the experimental BLAST service was never published
and remains W. Gish (unpublished).
Awareness of the service nevertheless spread quickly by word-of-mouth,
as was the case for the later WU-BLAST.
The BLASTX program first appeared in the release of
BLAST version 1.1 in July 1990.
The program was later described and evaluated by
Gish and States (1993).
The BLAST3
program
(
Altschul and Lipman, 1990) was also folded into the BLAST 1.1 release
and parallelized.
The use of Poisson statistics
to evaluate the joint probability of multiple HSPs from a given (query,subject)
sequence pair,
as had been suggested by
Karlin and Altschul (1990),
was also first featured in BLAST 1.1.
The BLASTC program,
a specialized version of BLASTX that considered codon usage information
in addition to sequence similarity
(States and Gish, 1994),
appeared only once,
in the
BLAST 1.3
distribution.
The BLAST 1.3 distribution was also the last to include the BLAST3
program.
BLAST 1.4
(W. Gish, 1994, unpublished)
was the first version to use
Karlin and Altschul (1993)
“Sum” statistics
to evaluate the joint probability of finding multiple HSPs between a given pair of sequences.
Sum statistics were found to be more practical in a biological context
than the Poisson statistics utilized by default in BLAST 1.3.
The TBLASTX program first appeared in BLAST 1.4
and remains attributable to W. Gish (1994, unpublished).
All five of the supported BLAST programs in BLAST 1.4
(BLASTP, BLASTN, TBLASTN,
BLASTX and TBLASTX)
were for the first time coded
using a standard API (application programming interface)
to a generalized BLAST function library.
This function library made maintenance and improvements to the five core programs easier
and aided the development of more specialized BLAST applications,
such as Entrez sequence neighboring tools and specialized EST analysis tools.
The first release of WU-BLAST was numbered 1.4,
which was virtually identical to the public domain NCBI BLAST 1.4,
save for a few bug fixes.
The WU-BLAST Archives (original URL http://blast.wustl.edu)
first appeared on the Internet in 1995,
to provide continuity of support for the work Warren Gish began at the NCBI,
as well as to provide a central resource where the community could find
BLAST-related software, information and earlier versions.
In late 1994,
at the invitation of Warren Gish,
who had recently moved to Washington University in St. Louis,
Stephen Altschul and he engaged in a collaboration to test
several of Gish’s hypotheses:
Sum statistics
(Karlin and Altschul, 1993)
allowed the evaluation of multiple ungapped alignment scores,
using the analytically computed ungapped parameters
λu, Ku and Hu.
Extreme Value statistics
— analogous to the statistics for ungapped alignment scores published by
Karlin and Altschul (1990)
—
had been shown empirically to be good estimators
of the statistical significance of individual gapped alignments
from Smith-Waterman comparisons,
using empirical estimates for the gapped parameters
λg and Kg
(Collins and Coulson, 1990;
Mott, 1992;
Waterman and Vingron, 1994).
It stood to reason that
Sum statistics might be empirically extended to evaluating
multiple gapped alignment scores, using empirically estimated parameters
λg, Kg and Hg;
While good estimates for
λg, Kg and Hg could be
computed through lengthy (computationally expensive) Monte Carlo simulations
for a specific scoring system and particular pair of sequences,
fixed estimates for these parameters precomputed
for sequences of “average” composition
would work well enough as to be of practical use
in comprehensive database searches;
and
For an improved search algorithm,
multiple, locally optimal gapped alignments between two sequences
could be approximated by a two-stage BLAST implementation
that would:
remain fast, yet be far more sensitive than ungapped BLAST;
produce more-easily interpreted alignments;
and yield alignment scores suitable for evaluation
with the expanded role proposed for Sum statistics.
If the effort panned out as hoped,
the new gapped BLAST method
would in some cases be more sensitive and selective
than even the standard Smith-Waterman algorithm,
due to the newer method’s ability to find multiple gapped alignments
between a pair of sequences
and to evaluate their significance jointly with Sum statistics.
While Altschul set to work empirically testing Sum statistics
on gapped alignment scores,
Gish focused on the alignment problem.
Early results from their work appeared in
Altschul and Gish (1996)
and provided much of the foundation for WU-BLAST 2.0
and later NCBI blastall.
The first complete implementation of gapped BLAST
(BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX)
with statistical significance estimates (both Poisson and Sum statistics)
was publicly released as WU-BLAST version 2.0d1
(W. Gish, unpublished),
in time for presentation
at the Cold Spring Harbor conference on Genome Mapping and Sequencing
in May 1996.
The NCBI published its BLAST version 2 or “Gapped BLAST”,
including a description of a new 2-hit ungapped BLAST algorithm
and the PSI-BLAST program,
in
Altschul et al. (1997),
in September 1997.
All search modes, except BLASTN, used the new 2-hit
algorithm by default.
Within days of their publication,
a faster, more sensitive 2-hit algorithm
was deployed in WU-BLAST 2.0.
In late 2008,
rights to WU-BLAST were acquired from
Washington University in St. Louis
by the author, Warren R. Gish.
The right to license the software to the community were acquired by
Advanced Biocomputing, LLC in 2009.
Dembo, A, and S Karlin (1991).
Strong limit theorems of empirical functionals for large exceedances
of partial sums of i.i.d. variables.
Ann. Probab.19:1737–55.
Dembo, A, and S Karlin (1992).
Limit distributions of maximal segmental score among Markov dependent
partial sums.
Adv. Appl. Probab.24:113–40.
Karlin, S, Dembo, A, and T Kawabata (1990).
Statistical composition of high scoring segments from molecular sequences.
Ann. Stat.18:571–81.
RF Mott (1992).
Maximum-likelihood estimation of the statistical distribution of Smith-Waterman
local sequence similarity scores.
Bull. Math. Biol.54:59–75.