Comparable BLAST Parameters

Comparable AB-BLAST and NCBI BLAST Parameters

Introduction

AB-BLAST and NCBI BLAST are distinctly different software packages, with different default behaviors and different command line options. For existing users of NCBI BLAST, to ease the transition to using AB-BLAST, a PERL script named ab‑blastall is bundled with AB-BLAST that converts NCBI blastall command line arguments into their (sometimes rough) equivalent AB-BLAST parameters and then invokes the appropriate AB-BLAST program. The ab‑blastall output format remains unchanged as one of the AB-BLAST native formats, depending on the format requested on the ab‑blastall command line.

The remainder of this page is primarily devoted to highlighting the differences between NCBI BLAST and AB-BLAST and illustrating some of the ways that these differences can be smoothed out or eliminated.

Performance Comparisons

For fair performance comparisons between any two approaches, one must be cognizant of the factors affecting speed, sensitivity and selectivity. Through parameter settings, as well as one's choice of test data, the performance characteristics of software tools can often be dramatically altered to achieve any desired goal, whether the goal is to improve the perceived performance of an existing tool or showcase the performance of a new one. When different algorithms and statistical approaches are employed, “apples-to-oranges” comparisons may be entirely unavoidable. That said, rough observations of performance may be informative and useful, if sufficient care is taken in the preparations.

At the bottom of this page, parameter sets (or command line options) are provided that may be useful for comparing the relative performance (sensitivity, selectivity and speed) of AB-BLAST 3.0 (blasta) and NCBI Gapped BLAST 2.0 (blastall) in the various search modes these programs offer. The command line arguments shown for NCBI BLAST are merely those that are required for any search, thus yielding the “default” behavior of this software. The optional parameter settings indicated for AB-BLAST reduce its sensitivity to approximately that of the NCBI BLAST defaults. As the speed of the BLAST algorithm is inversely related to its sensitivity, any speed comparisons should be made at comparable sensitivity levels.

Outlined specifically with respect to NCBI and AB-BLAST, the relevant speed, sensitivity and selectivity considerations include:

By default, NCBI BLAST filters query sequences for low-complexity regions using seg or dust, whereas AB-BLAST must be told explicitly to filter query sequences and which filter method to use. For speed and selectivity comparisons between the two programs, it is important that the presence or absence of query filtering be factored out. With filtering turned on, some hits may be missed but specificity may be higher.
When run on a multiprocessor computer system, the NCBI software uses one CPU or thread by default, whereas AB-BLAST Standard Edition and Enterprise Edition will attempt to use multiple CPUs by default (up to a maximum of 4 CPUs for BLASTN, unless more are specifically requested). As the use of multiple CPUs (or threads) is never as efficient as using a single CPU, and because memory-use is a linear function of the number of CPUs employed, speed and memory-use comparisons should be made between the two programs when using the same number of CPUs.
All of the search modes of NCBI blastall — except BLASTN — utilize a 2-hit BLAST algorithm by default, whereas AB-BLAST continues to use the classical 1-hit BLAST algorithm by default in all search modes. A more sensitive and efficient version of the 2-hit algorithm is available as an option in all search modes of AB-BLAST. In all search modes but BLASTN, both programs utilize a BLAST algorithm word length of 3 amino acid residues, index on BLOSUM62 neighborhood words by default, and apparently use the same value for the neighborhood word score threshold, T.
NCBI TBLASTX can not perform gapped alignments, whereas AB-TBLASTX produces gapped alignments by default, with the option to turn them off. Turning off gapped alignments will make searches execute faster and bias the specificity of the search toward finding more highly conserved sequences.
The default gap penalties, score thresholds and other statistical parameters are also different between the two sets of programs, with NCBI score cutoffs typically set to higher values by default, which demands exponentially less post-processing time but simultaneously reduces sensitivity, as well. Confounding matters, the thresholds actually used by NCBI-BLASTN have been known to be higher (faster and less sensitive) than the values reported by the software.

By normalizing for factors such as those described above, a reasonably fair evaluation of relative performance can often be obtained, but is certainly not guaranteed. Differences may exist between the NCBI's built-in low-complexity filters and the external filters employed by AB-BLAST. With AB-BLAST, the filters are external plug-in programs provided with the software distribution (or user's can plug in filters of their own design), so the user can generate filtered sequences independently of performing an actual search; and the AB-BLAST echofilter option allows the user to capture in the output the precise filtered sequence used internally by the search programs. All of this is just to say that with AB-BLAST, one has more complete control and can more easily verify correct behavior of the software, while differences with the NCBI software can be difficult to eliminate with complete confidence.

Other differences in alignment procedures and statistics remain, as well, some of which can impact speed, sensitivity and selectivity. For example, NCBI BLASTP does not use "Sum" statistics to identify multiple regions of similarity; and NCBI BLASTN curiously uses the same lambda, K and H values to evaluate the significance of gapped alignments as it uses for ungapped alignments, regardless of how relaxed the gap penalties are.

Last, but certainly not least, NCBI blastall has been known to report lower values for score thresholds than the values it actually used, which can confound even the most careful of performance comparisons. While the inaccuracies may seem small and therefore benign, their effect on speed can be exponential and make NCBI blastall appear significantly faster than it really is. Even more important than speed, though, reporting of incorrect parameter values conveys wrong information about the sensitivity of a search.

In the examples below, the hitdist option invokes the 2-hit algorithm of AB-BLAST. Alternate AB-BLAST command lines are shown that increase the value of the T parameter for the 1-hit BLAST algorithm, to yield roughly the same level of sensitivity (and speed) as the default parameterization of the NCBI 2-hit algorithm. The more-efficient 2-hit BLAST implementation in AB-BLAST 3.0 may be used to obtain still more speed if desired — running significantly faster than the NCBI 2-hit BLAST — albeit with the reduced sensitivity associated with the 2-hit algorithm.

Benchmarking should be performed on computer systems over which one has full control. For example, avoid benchmarking via a web server whose configuration and operational state are unknown. As an example of how surprisingly important this can be, users of SGI IRIX 6.x may have noted that versions of this operating system released from 1997, until about 1999-2000, reported extremely inaccurate (i.e., low) execution times for programs like BLAST that use POSIX threads. Typically, the CPU time reported was actually 1/N of its actual value, where N was the number of CPUs or threads employed. Only for about 1 in N searches would the correct CPU time be reported. NCBI computers at the time were typically configured with 8-16 CPUs, so the CPU time reported was typically 8- to 16-fold lower than its actual value. This explains why the NCBI BLAST servers usually reported execution times of just a few seconds for lengthy database searches. It is also curious that the BLAST binaries and source code posted by the NCBI for users to download for local database searching did not report execution times at all, whereas supposedly the same software running on their servers did report CPU times. In any case, this particular bug seems to have been fixed in IRIX 6.5, the release of which correlates well with when the NCBI stopped reporting CPU times on their BLAST servers. ;-)

Database I/O can be a significant contributor --- even the major contributor -- to the overall search time. To minimize the overhead and impact on search speed of database I/O, search times should be performed on cached database files. Working with cached files is generally recommended, not just when benchmarking, to avoid contention for slow physical devices such as disk drives. Contemporary operating systems more-or-less do a good job of automatically caching files in what would otherwise be unused memory; hence, BLAST software moved away from using System V shared memory segments for storing database files and instead began using memory-mapped I/O and file caching, starting with BLAST version 1.4 (W. Gish, unpublished). Pre-caching of files can be accomplished by first performing an untimed search to prime the cache with the desired database files before the actual benchmark run(s) are executed. Of course, the host computer must have sufficient free memory available that the relevant database files can indeed by cached.

Even when copious amounts of physical memory are present, operating systems sometimes seem to limit the amount of file system data that can be cached. Sometimes these limits are configurable, as in Solaris, but other times there may be no apparent way to increase the amount of unused memory that can be utilized for file caching. Personal experience with Linux 2.4 falls in the latter category. Your “mileage” may vary.

When file caching can not be exploited, the overhead of database I/O may be reduced by using longer (less trivial) query sequences, such that the search programs spend relatively more time actually comparing sequences than they do reading and parsing the database.

General Tips for Benchmarking BLAST

If accurate benchmarks are desired, score thresholds will often have to be altered from their values below, but doing this accurately can be a challenge (see above).
Be sure to use an accurate timing method that measures both CPU time and real (or wall clock) time. Even though the default for NCBI blastall is to use a single thread, for at least some versions of this program, even when a single-thread was requested, some fraction of its time has been observed running multithreaded and consuming CPU time faster than wall clock time on multiprocessor computers. This can make it difficult to benchmark fairly, unless an actual single processor system is used. With some incarnations of NCBI BLAST, the wall clock time is significantly greater than the total of user and system CPU time, even with database files cached. Consequently, it is important for benchmarking procedures to collect both real and CPU time measurements.
Be sure to let benchmark runs proceed to completion, rather than counting a few periods in the "Searching..." output and halting time-consuming runs early. Extrapolation to the total run time from the number of periods emitted during a partial search can be inaccurate. NCBI blastall has been observed on multiple occasions to report far more than 50 periods (.) for a complete database search, whereas 1 period consistently corresponds to 2% of the database searched by AB-blasta.

Comparable Commands

The command lines below are presented in pairs for NCBI blastall and AB-blasta 3.0 with its optional 2-hit algorithm.

BLASTP

NCBI:  blastall -p blastp -d nr -i query.aa

AB:  blastp nr query.aa cpus=1 hitdist=40 T=11 kap s2=41 gaps2=62 x=16 gapx=38 \
		q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

BLASTX

NCBI:  blastall -p blastx -d nr -i query.nt

AB:  blastx nr query.nt cpus=1 hitdist=40 T=12 s2=41 gaps2=68 x=16 gapx=38 \
		q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

TBLASTN

NCBI:  blastall -p tblastn -d nr -i query.nt

AB:  tblastn nr query.nt cpus=1 hitdist=40 T=13 s2=41 gaps2=62 x=16 gapx=38 \
		q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

TBLASTX

NCBI:  blastall -p tblastx -d nr -i query.nt

AB:  tblastx nr query.nt cpus=1 nogaps hitdist=40 T=13 s2=41 x=16 filter=seg

BLASTN

NCBI:  blastall -p blastn -d nr -i query.nt

AB:  blastn nr query.nt cpus=1 kap x=6 gapx=25 filter=dust

Last modified: 2009-10-17

Return to the AB-BLAST Archives home page