Comparable AB-BLAST and NCBI BLAST Parameters

Introduction

AB-BLAST and NCBI BLAST are distinctly different software packages, with different default behaviors and different command line options. For existing users of NCBI BLAST, to ease the transition to using AB-BLAST, a PERL script named ab‑blastall is bundled with AB-BLAST that converts NCBI blastall command line arguments into their (sometimes rough) equivalent AB-BLAST parameters and then invokes the appropriate AB-BLAST program. The ab‑blastall output format remains unchanged as one of the AB-BLAST native formats, depending on the format requested on the ab‑blastall command line.

The remainder of this page is primarily devoted to highlighting the differences between NCBI BLAST and AB-BLAST and illustrating some of the ways that these differences can be smoothed out or eliminated.

Performance Comparisons

For fair performance comparisons between any two approaches, one must be cognizant of the factors affecting speed, sensitivity and selectivity. Through parameter settings, as well as one's choice of test data, the performance characteristics of software tools can often be dramatically altered to achieve any desired goal, whether the goal is to improve the perceived performance of an existing tool or showcase the performance of a new one. When different algorithms and statistical approaches are employed, “apples-to-oranges” comparisons may be entirely unavoidable. That said, rough observations of performance may be informative and useful, if sufficient care is taken in the preparations.

At the bottom of this page, parameter sets (or command line options) are provided that may be useful for comparing the relative performance (sensitivity, selectivity and speed) of AB-BLAST 3.0 (blasta) and NCBI Gapped BLAST 2.0 (blastall) in the various search modes these programs offer. The command line arguments shown for NCBI BLAST are merely those that are required for any search, thus yielding the “default” behavior of this software. The optional parameter settings indicated for AB-BLAST reduce its sensitivity to approximately that of the NCBI BLAST defaults. As the speed of the BLAST algorithm is inversely related to its sensitivity, any speed comparisons should be made at comparable sensitivity levels.

Outlined specifically with respect to NCBI and AB-BLAST, the relevant speed, sensitivity and selectivity considerations include:

By normalizing for factors such as those described above, a reasonably fair evaluation of relative performance can often be obtained, but is certainly not guaranteed. Differences may exist between the NCBI's built-in low-complexity filters and the external filters employed by AB-BLAST. With AB-BLAST, the filters are external plug-in programs provided with the software distribution (or user's can plug in filters of their own design), so the user can generate filtered sequences independently of performing an actual search; and the AB-BLAST echofilter option allows the user to capture in the output the precise filtered sequence used internally by the search programs. All of this is just to say that with AB-BLAST, one has more complete control and can more easily verify correct behavior of the software, while differences with the NCBI software can be difficult to eliminate with complete confidence.

Other differences in alignment procedures and statistics remain, as well, some of which can impact speed, sensitivity and selectivity. For example, NCBI BLASTP does not use "Sum" statistics to identify multiple regions of similarity; and NCBI BLASTN curiously uses the same lambda, K and H values to evaluate the significance of gapped alignments as it uses for ungapped alignments, regardless of how relaxed the gap penalties are.

Last, but certainly not least, NCBI blastall has been known to report lower values for score thresholds than the values it actually used, which can confound even the most careful of performance comparisons. While the inaccuracies may seem small and therefore benign, their effect on speed can be exponential and make NCBI blastall appear significantly faster than it really is. Even more important than speed, though, reporting of incorrect parameter values conveys wrong information about the sensitivity of a search.


In the examples below, the hitdist option invokes the 2-hit algorithm of AB-BLAST. Alternate AB-BLAST command lines are shown that increase the value of the T parameter for the 1-hit BLAST algorithm, to yield roughly the same level of sensitivity (and speed) as the default parameterization of the NCBI 2-hit algorithm. The more-efficient 2-hit BLAST implementation in AB-BLAST 3.0 may be used to obtain still more speed if desired — running significantly faster than the NCBI 2-hit BLAST — albeit with the reduced sensitivity associated with the 2-hit algorithm.

Benchmarking should be performed on computer systems over which one has full control. For example, avoid benchmarking via a web server whose configuration and operational state are unknown. As an example of how surprisingly important this can be, users of SGI IRIX 6.x may have noted that versions of this operating system released from 1997, until about 1999-2000, reported extremely inaccurate (i.e., low) execution times for programs like BLAST that use POSIX threads. Typically, the CPU time reported was actually 1/N of its actual value, where N was the number of CPUs or threads employed. Only for about 1 in N searches would the correct CPU time be reported. NCBI computers at the time were typically configured with 8-16 CPUs, so the CPU time reported was typically 8- to 16-fold lower than its actual value. This explains why the NCBI BLAST servers usually reported execution times of just a few seconds for lengthy database searches. It is also curious that the BLAST binaries and source code posted by the NCBI for users to download for local database searching did not report execution times at all, whereas supposedly the same software running on their servers did report CPU times. In any case, this particular bug seems to have been fixed in IRIX 6.5, the release of which correlates well with when the NCBI stopped reporting CPU times on their BLAST servers. ;-)

Database I/O can be a significant contributor --- even the major contributor -- to the overall search time. To minimize the overhead and impact on search speed of database I/O, search times should be performed on cached database files. Working with cached files is generally recommended, not just when benchmarking, to avoid contention for slow physical devices such as disk drives. Contemporary operating systems more-or-less do a good job of automatically caching files in what would otherwise be unused memory; hence, BLAST software moved away from using System V shared memory segments for storing database files and instead began using memory-mapped I/O and file caching, starting with BLAST version 1.4 (W. Gish, unpublished). Pre-caching of files can be accomplished by first performing an untimed search to prime the cache with the desired database files before the actual benchmark run(s) are executed. Of course, the host computer must have sufficient free memory available that the relevant database files can indeed by cached.

Even when copious amounts of physical memory are present, operating systems sometimes seem to limit the amount of file system data that can be cached. Sometimes these limits are configurable, as in Solaris, but other times there may be no apparent way to increase the amount of unused memory that can be utilized for file caching. Personal experience with Linux 2.4 falls in the latter category. Your “mileage” may vary.

When file caching can not be exploited, the overhead of database I/O may be reduced by using longer (less trivial) query sequences, such that the search programs spend relatively more time actually comparing sequences than they do reading and parsing the database.

General Tips for Benchmarking BLAST

Comparable Commands

The command lines below are presented in pairs for NCBI blastall and AB-blasta 3.0 with its optional 2-hit algorithm.

BLASTP

NCBI:  blastall -p blastp -d nr -i query.aa

AB:  blastp nr query.aa cpus=1 hitdist=40 T=11 kap s2=41 gaps2=62 x=16 gapx=38 \
		q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

BLASTX

NCBI:  blastall -p blastx -d nr -i query.nt

AB:  blastx nr query.nt cpus=1 hitdist=40 T=12 s2=41 gaps2=68 x=16 gapx=38 \
		q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

TBLASTN

NCBI:  blastall -p tblastn -d nr -i query.nt

AB:  tblastn nr query.nt cpus=1 hitdist=40 T=13 s2=41 gaps2=62 x=16 gapx=38 \
		q=12 r=1 gapL=.27 gapK=.047 gapH=.23 filter=seg

TBLASTX

NCBI:  blastall -p tblastx -d nr -i query.nt

AB:  tblastx nr query.nt cpus=1 nogaps hitdist=40 T=13 s2=41 x=16 filter=seg

BLASTN

NCBI:  blastall -p blastn -d nr -i query.nt

AB:  blastn nr query.nt cpus=1 kap x=6 gapx=25 filter=dust


Last modified: 2009-10-17


Return to the AB-BLAST Archives home page