AB-BLAST and NCBI BLAST are distinctly different software packages,
with different default behaviors and different command line options.
For existing users of NCBI BLAST,
to ease the transition to using AB-BLAST,
a PERL script
named ab‑blastall
is bundled with AB-BLAST
that converts NCBI blastall
command line
arguments into their (sometimes rough) equivalent AB-BLAST parameters
and then invokes the appropriate AB-BLAST program.
The ab‑blastall
output format remains unchanged
as one of the AB-BLAST native formats,
depending on the format requested on the ab‑blastall
command line.
The remainder of this page is primarily devoted to highlighting the differences between NCBI BLAST and AB-BLAST and illustrating some of the ways that these differences can be smoothed out or eliminated.
For fair performance comparisons between any two approaches, one must be cognizant of the factors affecting speed, sensitivity and selectivity. Through parameter settings, as well as one's choice of test data, the performance characteristics of software tools can often be dramatically altered to achieve any desired goal, whether the goal is to improve the perceived performance of an existing tool or showcase the performance of a new one. When different algorithms and statistical approaches are employed, “apples-to-oranges” comparisons may be entirely unavoidable. That said, rough observations of performance may be informative and useful, if sufficient care is taken in the preparations.
At the bottom of this page,
parameter sets (or command line options) are provided that may be useful
for comparing the
relative performance (sensitivity, selectivity and speed) of AB-BLAST 3.0 (blasta
) and
NCBI Gapped BLAST 2.0 (blastall
)
in the various search modes these programs offer.
The command line arguments shown for NCBI BLAST are merely those
that are required for any search,
thus yielding the “default” behavior of this software.
The optional parameter settings indicated for AB-BLAST reduce
its sensitivity to approximately that of the NCBI BLAST defaults.
As the speed of the BLAST algorithm is inversely related to its
sensitivity,
any speed comparisons should be made at comparable sensitivity levels.
Outlined specifically with respect to NCBI and AB-BLAST, the relevant speed, sensitivity and selectivity considerations include:
seg
or dust
,
whereas AB-BLAST must be told explicitly to filter query sequences
and which filter method to use.
For speed and selectivity comparisons between the two programs,
it is important that the presence or absence of query filtering be factored out.
With filtering turned on,
some hits may be missed but specificity may be higher.
blastall
— except BLASTN —
utilize a 2-hit BLAST algorithm by default,
whereas AB-BLAST continues to use the classical 1-hit BLAST algorithm
by default in all search modes.
A more sensitive and efficient version of the 2-hit algorithm is available
as an option in all search modes of AB-BLAST.
In all search modes but BLASTN,
both programs utilize a BLAST algorithm word length of 3 amino acid residues,
index on BLOSUM62 neighborhood words by default,
and apparently use the same value for the neighborhood word score threshold,
T.
By normalizing for factors such as those described above, a reasonably fair evaluation of relative performance can often be obtained, but is certainly not guaranteed. Differences may exist between the NCBI's built-in low-complexity filters and the external filters employed by AB-BLAST. With AB-BLAST, the filters are external plug-in programs provided with the software distribution (or user's can plug in filters of their own design), so the user can generate filtered sequences independently of performing an actual search; and the AB-BLAST echofilter option allows the user to capture in the output the precise filtered sequence used internally by the search programs. All of this is just to say that with AB-BLAST, one has more complete control and can more easily verify correct behavior of the software, while differences with the NCBI software can be difficult to eliminate with complete confidence.
Other differences in alignment procedures and statistics remain, as well, some of which can impact speed, sensitivity and selectivity. For example, NCBI BLASTP does not use "Sum" statistics to identify multiple regions of similarity; and NCBI BLASTN curiously uses the same lambda, K and H values to evaluate the significance of gapped alignments as it uses for ungapped alignments, regardless of how relaxed the gap penalties are.
Last, but certainly not least, NCBI blastall has been known to report lower values for score thresholds than the values it actually used, which can confound even the most careful of performance comparisons. While the inaccuracies may seem small and therefore benign, their effect on speed can be exponential and make NCBI blastall appear significantly faster than it really is. Even more important than speed, though, reporting of incorrect parameter values conveys wrong information about the sensitivity of a search.
In the examples below, the hitdist option invokes the 2-hit algorithm of AB-BLAST. Alternate AB-BLAST command lines are shown that increase the value of the T parameter for the 1-hit BLAST algorithm, to yield roughly the same level of sensitivity (and speed) as the default parameterization of the NCBI 2-hit algorithm. The more-efficient 2-hit BLAST implementation in AB-BLAST 3.0 may be used to obtain still more speed if desired — running significantly faster than the NCBI 2-hit BLAST — albeit with the reduced sensitivity associated with the 2-hit algorithm.
Benchmarking should be performed on computer systems over which one has full control. For example, avoid benchmarking via a web server whose configuration and operational state are unknown. As an example of how surprisingly important this can be, users of SGI IRIX 6.x may have noted that versions of this operating system released from 1997, until about 1999-2000, reported extremely inaccurate (i.e., low) execution times for programs like BLAST that use POSIX threads. Typically, the CPU time reported was actually 1/N of its actual value, where N was the number of CPUs or threads employed. Only for about 1 in N searches would the correct CPU time be reported. NCBI computers at the time were typically configured with 8-16 CPUs, so the CPU time reported was typically 8- to 16-fold lower than its actual value. This explains why the NCBI BLAST servers usually reported execution times of just a few seconds for lengthy database searches. It is also curious that the BLAST binaries and source code posted by the NCBI for users to download for local database searching did not report execution times at all, whereas supposedly the same software running on their servers did report CPU times. In any case, this particular bug seems to have been fixed in IRIX 6.5, the release of which correlates well with when the NCBI stopped reporting CPU times on their BLAST servers. ;-)
Database I/O can be a significant contributor --- even the major contributor -- to the overall search time. To minimize the overhead and impact on search speed of database I/O, search times should be performed on cached database files. Working with cached files is generally recommended, not just when benchmarking, to avoid contention for slow physical devices such as disk drives. Contemporary operating systems more-or-less do a good job of automatically caching files in what would otherwise be unused memory; hence, BLAST software moved away from using System V shared memory segments for storing database files and instead began using memory-mapped I/O and file caching, starting with BLAST version 1.4 (W. Gish, unpublished). Pre-caching of files can be accomplished by first performing an untimed search to prime the cache with the desired database files before the actual benchmark run(s) are executed. Of course, the host computer must have sufficient free memory available that the relevant database files can indeed by cached.
Even when copious amounts of physical memory are present, operating systems sometimes seem to limit the amount of file system data that can be cached. Sometimes these limits are configurable, as in Solaris, but other times there may be no apparent way to increase the amount of unused memory that can be utilized for file caching. Personal experience with Linux 2.4 falls in the latter category. Your “mileage” may vary.
When file caching can not be exploited, the overhead of database I/O may be reduced by using longer (less trivial) query sequences, such that the search programs spend relatively more time actually comparing sequences than they do reading and parsing the database.
blastall
is to use a single thread,
for at least some versions of this program,
even when a single-thread was requested,
some fraction of its time has been observed running multithreaded
and consuming
CPU time faster than wall clock time on multiprocessor computers.
This can make it difficult to benchmark fairly,
unless an actual single processor system is used.
With some incarnations of NCBI BLAST,
the wall clock time is significantly greater than the total of user and
system CPU time, even with database files cached.
Consequently, it is important for benchmarking procedures
to collect both real and CPU time measurements.
blastall
has been observed on multiple occasions to report
far more than 50 periods (.) for a complete
database search,
whereas 1 period consistently corresponds to 2% of the database
searched by AB-blasta
.
The command lines below are presented in pairs for
NCBI blastall
and
AB-blasta
3.0 with its optional 2-hit algorithm.
NCBI: blastall -p blastp -d nr -i query.aa AB: blastp nr query.aa cpus=1 hitdist=40 T=11 kap s2=41 gaps2=62 x=16 gapx=38 \ q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg
NCBI: blastall -p blastx -d nr -i query.nt AB: blastx nr query.nt cpus=1 hitdist=40 T=12 s2=41 gaps2=68 x=16 gapx=38 \ q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg
NCBI: blastall -p tblastn -d nr -i query.nt AB: tblastn nr query.nt cpus=1 hitdist=40 T=13 s2=41 gaps2=62 x=16 gapx=38 \ q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg
NCBI: blastall -p tblastx -d nr -i query.nt AB: tblastx nr query.nt cpus=1 nogaps hitdist=40 T=13 s2=41 x=16 filter=seg
NCBI: blastall -p blastn -d nr -i query.nt AB: blastn nr query.nt cpus=1 kap x=6 gapx=25 filter=dust
Last modified: 2009-10-17
Return to the AB-BLAST Archives home page