This is a final copy release of version 1.4 of the BLAST application programs. This is the Washington University, St. Louis, version, contact: gish *AT* watson.wustl.edu This software is UNIX-compatible only. For a reverse chronological commentary on improvements, changes, and bug fixes that have been made to this software, see the HISTORY file. To build the BLASTP, BLASTN, BLASTX, TBLASTN, and TBLASTX programs, the following compressed UNIX tar archives should first be downloaded and built in the order shown: ncbi.tar.Z -- NCBI Toolbox for UNIX (see /toolbox/ncbi_tools/ncbi.tar.Z on ncbi.nlm.nih.gov) gish.tar.Z -- personal function library, including dfa functions blast.tar.Z -- blast function library blastapp.tar.Z -- version 1.4 blast application programs Note: significant portions of the NCBI software Toolbox are required for building the BLAST 1.4 software, not just a few of the .h header files as was the case with the 1.3 software. An ANSI C compiler is also required (gcc will do, but be sure it is a recent version and that it has been installed properly). The BLAST database format is unchanged with this release -- the same old pressdb and setdb programs are used to create blastable databases from FASTA-format input files. Features of the BLAST 1.4 distribution: o Karlin and Altschul (1993) "Sum" statistics is the default method used to evaluate the statistical significance of sets of HSPs, rather than Poisson statistics. Poisson statistics remains an option, but Sum statistics produces a relative ordering of the database matches that makes more intuitive sense and Sum statistics in many cases is more sensitive. o Fewer false positive reports are anticipated, through the use of more stringent HSP consistency rules than before. This has permitted HSP score thresholds (S2 parameter) to be lowered somewhat for improved sensitivity without too adversely affecting selectivity. The amount of HSP overlapping permitted with consistent HSPs can be adjusted with the parameter -olfraction, current default value 0.125, or 12.5% of the length of each HSP. o The wordlength (W parameter) in BLASTN can now be safely adjusted from its new default value of 11 down to as low as 1, to increase sensitivity but at the expense of speed. Many users may find no need to go below W=6. o BLASTN 1.4 uses a real scoring matrix to score alignments, instead of the simple match-mismatch scoring done by BLASTN 1.3, so that partial matching can be scored for ambiguous nucleotide codes (e.g., A vs. R). Of less utility, BLASTN 1.4 also has the capacity to generate "neighborhoods" on the W-mer words, an option that is invoked by using the T parameter. The price paid for using matrices and shorter word lengths is that BLASTN 1.4 uses more memory and is 30% slower or more. Be careful when using the T parameter to generate neighborhood words -- this parameter can cause the program's memory use to skyrocket with little trouble. o BLASTN 1.4 uses the E2 and S2 parameters in the same way these parameters had been used by previous versions of the other blast programs. o A new program TBLASTX is included that uses a nt. query and a nt. database and translates both in all 6 reading frames prior to comparison. With the 6 x 6 = 36 combinations for comparison, TBLASTX is considerably slower than BLASTX or TBLASTN, but it may be put to good use in searching databases such as dbEST and dbSTS with other anonymous sequences. o BLASTP can search with multiple scoring matrices in parallel. This feature is invoked by specifying multiple -matrix options on one BLASTP command line. Searching with multiple PAM matrices, for example, may provide better sensitivity in detecting similarity between proteins having domains that evolved at different rates. Note: the p-values and expectations reported by the program when using more than one matrix may be unduly low. o Combinations of two or more -sort_by... options can now be used together. o The "score vs. frequency of occurrence" histogram of the version 1.3 programs has been replaced with an "expected frequency of occurrence vs. actual frequency of occurrence" histogram. o New "-asn1" and "-asn1bin" options to all of the programs cause them to produce ASN.1 structured output ("print value" and binary encoded, respectively). XOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOXOX While it is recommended that users switch to using the version 1.4 programs for the improved sensitivity and selectivity, reasons not to switch to version 1.4 might include the following: o automated parsers may break on the version 1.4 output because it is somewhat different from version 1.3 output. Perhaps the easiest way to test this is to run a query through the NCBI BLAST E-mail server (blast@ncbi.nlm.nih.gov) and see if the output returned can be successfully parsed. o if reproducibility of results from the version 1.3 programs is highly important, the different statistics, cutoff scores, and re-written functions of version 1.4 can yield different results from the version 1.3 programs. The -compat1.3 option goes a long way towards making the new programs behave like the old ones, but not the whole way. o over all, the version 1.4 programs are about 10% slower than the version 1.3 programs. o the version 1.4 programs support parallel processing (threads) under Solaris 2.2 and higher, DEC OSF/1 (DEC UNIX) 3.2, and SGI IRIX 4 and higher. This does of course only benefit those users who have multiple CPUs in one box. o BLAST3 is not included in the new distribution. This program must still be obtained from the 1.3 distribution. Eventually, the source code for BLAST3 may be folded into the same distribution as the 1.4 programs. o BLASTN 1.4 does not support the "noclean" option of previous versions.