This is the reverse chronological modification history for BLAST 1.4. It does not encompass changes made to BLAST 2.0 (gapped alignments with statistics), whose development was forked off in late 1994, was first released in May 1996, and upon which work continues today. *************************************************************************** Modification history: 10/21/96 gb2fasta and gt2fasta now, respectively, parse NID and /db_xref="PID:g#" fields from GenBank flat files and report NCBI "gi" identifiers in their FASTA output. 9/3/96 Brought all genetic codes into synchrony with the NCBI Version 3.3. 8/30/96 Added a check for failed call to mmap() in gish library, and improved somewhat the error message reported when the database is too large for memory. 6/6/96 Added PDF format documentation (use Adobe's Acrobat Reader) and fixed a problem with the PostScript version. 5/9/96 Added an "identity" scoring matrix for BLASTN searches. Not perfect, though, it ascribes a penalty of only -10000 to mismatches. It's possible then to have one mismatch every 10 KB or so and still achieve a positive score. 4/29/96 Fixed statistical calculation in case of multiple consistent HSPs and sum statistics. When r consistent alignments were combined, the p-values computed were too low by a factor of about r!. Fixed a bug that would cause the unthreaded versions of the *blast* executables to crash immediately. 11/4/95 Fixed a bug in the parsing of sequence identifiers that could yield incorrectly justified text in the initial, one-line summary section of blast program output. When this bug arose, there were 25 columns of white space at the beginning of each line. 11/3/95 Updated the list of built-in genetic codes in blast/blast/gcode.h using the latest NCBI Toolbox ASN.1 data (toolbox/data/gc.prt). 10/26/95 Fixed a multiprocessing bug in the blast programs that could arise when searching small databases (<500 sequences). 10/3/95 Added support for NCBI (Wootton & Federhen) "nseg" program on the BLASTN command line, using "-filter seg" option. 9/27/95 Added "-WashU" tag to the program version numbers, to ensure there is no mistaking WashU distribution of these programs from the NCBI distribution. 9/26/95 Fixed a long-standing bug in pressdb regarding which sequences are tagged as having "ambiguous" nucleotide codes. Thanks to Colin Watanabe at Genentech for pointing this out. 9/18/95 The PRESSDB program (pressdb.c) can now append sequences to an existing BLAST database, using the -a option. (The SETDB program has not been so modified yet). 8/22/95 The file locking described on 6/7/95 has been disabled at least temporarily because it is not functioning in the intended manner with files that reside on NFS-mounted partitions. 8/14/95 gb2fasta now parses NCBI "gi" identifiers from the GenBank flat files. 6/7/95 See note on 8/22/95! Database file locking has been added to the BLAST search programs and to the database maintenance programs setdb and pressdb, to eliminate (or optionally reduce) the opportunity for collisions between database search and database maintenance activities. Previously, a setdb or pressdb invocation would cause active BLAST searches of the same database to fail. File locking now prevents the blastable database files from being modified by setdb/pressdb until they are no longer in use by a search program. This doesn't necessarily come without some risk. With strict file locking in force (the default), deadlock or near-deadlock may now be a concern within a production environment, as multiple simultaneous BLAST search production lines involving one database can effectively block setdb or pressdb forever -- unless all production lines happen to finish their searches at the same time. Having all production lines finish at virtually the same time may be an infrequent event if more than just a couple are running. This new situation seems more desirable, though, than not using file locks and unwittingly allowing setdb and pressdb to blow away databases out from under any searches. As an aid to diagnosing deadlock situations should they arise, when blocked, setdb and pressdb report their blocked status every 60 seconds. If deadlock is a real problem, one can revert to the former, ungoverned situation by completely disabling file locking with the new -l option to the setdb/pressdb programs. Significant file lock protection can still be obtained, though -- and without the risk of deadlock -- by using the -b option to setdb/pressdb instead of completely disabling it with -l. The -b option simply blocks any subsequently invoked BLAST searches until the current setdb/pressdb operation is finished, however any search that happened to be in progress when setdb/pressdb was invoked will get trashed. Through the use of locks, it is possible to update databases that are actively being searched or that reside on-line in a production area, without the need for off-line, ancillary working storage equivalent to a full copy of the database. N.B. One area not addressed by the present file locking is that of the FASTA-format nt. sequence file accessed by BLASTN, TBLASTN, and TBLASTX, which still causes problems if updated in the middle of a search. 6/1/95 Fixed a long-standing deadlock problem in the Solaris multithreaded executables (and more recently the OSF/1 executables). 5/28/95 Removed the link between X & S that existed in blastapp/lib/context.c. 5/24/95 Threads support (parallel processing) added for DEC OSF/1 3.0 (Digital UNIX). 5/20/95 Switched to using Robinson&Robinson (PNAS 1991) amino acid residue frequencies. Fixed a minor slowness problem in BLASTN, TBLASTN, and TBLASTX (all of the programs that would access the FASTA-format database file, doing so more often than necessary). Changed the name of the recently added "pgsper" command line option to the simpler name "progress". It's now described in the documentation file, blast1.1, too. 4/26/95 Added "-pgsper #" command line option to adjust the time-out period in progress messages. Alarm clock errors when using Solaris threads prompted the creation of this parameter. To avoid any possibility of the alarm clock error, set a time-out of 0. Changed basename() to misc_basename() for Linux compatibility. 3/30/95 Made memory management a little more flexible and robust. V & B command line options are supported in the ASN.1 form of the output now. Made changes for VMS compatibility kindly suggested by Scott Rose (GCG, Madison, WI). 3/8/95 pressdb and setdb now parse arbitrarily large FASTA input databases, expanding their memory buffers as much as necessary. No more need to modify ENTRY_MAX. 3/7/95 I lied on 2/1/95. Solaris threads support promises to be robust now. Famous last words. 2/13/95 The dfa library was consolidated into the gish library. 2/1/95 Too optimistic on 1/24/95 -- the Solaris threads/alarm problem was not fixed then. It truly seems to be fixed now. Also, fixed a bug in BLASTN's calculation of the Karlin-Altschul K value. Plus some slight performance improvements to BLASTN, TBLASTN and TBLASTX, related to the FASTA file access; because of this improvement, BLASTN is set to use up to 4 processors by default instead of the previous default of 3. 1/24/95 Fixed (for the last time?!) the interaction between Solaris threads and SIGALRM signals in the "gish" library. 12/19/94 Fixed a multiprocessing bug in all of the programs. The bug would often produce crashes (segmentation faults) when searching tiny databases. hsp_max is now used to truncate HSP lists _after_ statistical significance estimates have been made and after the list has been sorted for output. 12/16/94 Fixed handling of gap characters in the query sequence by blastx, tblastn, and tblastx. 12/15/94 blastp was stripping gap characters (-) from the query sequence. fixed. 10/16/94 Fixed a severe bug in the support for multiprocessing under Solaris 2. Some of the code involved in this bug fix is in the "gish" library. Program version numbers are unchanged by this fix; but the code release date displayed in the programs' introductory output is updated to day's date. 10/6/94 First "final copy" release of BLAST 1.4 software. 10/4/94 Changed "-overlap", "-overlap1", and "-overlap2" command line option names to "-span", "-span1", and "-span2", respectively. "-span2" is the default. 9/30/94 I'm now employed by the Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108 9/3/93 Fixed bug in gb2fasta's concatenation of long definitions. 8/8/93 Added -qoffset option to BLASTP, BLASTX, TBLASTN, and BLASTN, to permit segments of long sequences to be used as queries and still have their residues numbered correctly in alignments. 7/28/93 Changed the format of substitution matrix files read by BLASTP, BLASTX, TBLASTN and BLAST3. Substitution scores in the matrix files can now properly have non-integral values. The blast program still do their scoring using integral data types. Upon being read by the blast programs, each score value is rounded to the nearest integer. Matrices in the new format are generated by the pam program. Fixed the display of query sequence segments in BLASTX when its -codoninfo option is invoked. 7/7/93 Prompted by Erik Sonnhammer, a "-overlap2" command line option (also available as simply "-over2") was added to make the criteria for HSP overlap detection tighter. This option has a positive effect on the number of HSPs reported (fewer of them will satisfy the overlap2 criteria) for sequences that contain internal repeats, but will have a negative effect on their associated statistics. The additionally reported HSPs may have Poisson statistics inappropriately applied, because the HSPs may be incompatible with others in the same global alignment and hence can not be considered as independent events. For query sequences too short to satisfy the cutoffs or expectation thresholds, the minimum acceptable expect values that were reported by BLASTP, BLASTN, and TBLASTN were incorrect, now fixed. 7/2/93 Changed the way the cutoff score, S, and expectation cutoff, E, are reported. All output is now filtered based on its estimated statistical significance (E value), rather than using cutoff scores directly. 6/22/93 Fixed bug in consistp.c's implementation of R(i,3) found by Phil Green. Followed another suggestion of Phil Green's for making Poisson probability calculations more efficient. 6/21/93 Fixed bug in the calculation of "consistent N counts" for those HSPs found on minus strands in BLASTN, BLASTX, and TBLASTN. Plus strand hit counts were not affected. Pressdb on 64-bit platforms now produces databases that are readable on all platforms. 6/16/93 Fixed a conflict between static and global variables in bldaa.c and bldxa.c This produced a bug in the blast software under DEC Alpha OSF/1. 6/9/93 Added "-gapdecayrate" parameter (default=0.5), as suggested by Phil Green (Washington University, St. Louis). This parameter defines a geometric progression used to adjust Poisson probabilities upward, to account for the fact that many values for the N parameter in Poisson P(N) are considered when choosing the "best" alignments. If r is the decay rate (0 < r < 1) for the progression and n is the number of segments under consideration, then the number of gaps is n-1 and the Poisson probabilities will be _divided_ by the quantity: n-1 (1-r) r For n=1 (one HSP) and the default r=0.5, the adjustment is by a factor of 1/(1-0.5) = 2. Fixed a bug in lib/consistp.c that produced undetected overflows in factorial calculations. This was occasionally problematic in TBLASTN queries with hits against extremely long database sequences. 5/9/93 In TBLASTN, fixed discrepancies in alignments when a database sequence contained one or more ambiguity (non-ACGT) codes. Previously, the original FASTA format database sequence was only examined at the end of the search; now it is examined during the search, so that it is known up front what the real alignment score and extent of alignment is. The HSP cutoff score in TBLASTN is now S2. Previously, there had to be at least one match scoring at least as high as S, after which the database sequence was re-scanned using a cutoff of S2. Now each database sequence is scanned only once, using the lower cutoff. Better sensitivity results for short exons. Something not done now, however, is to scan the entire diagonal on which an HSP is found. 5/8/93 Fixed severe bug in BLASTN. Word hits on the plus- and minus- strands were being managed in a single pool, rather than separate pools. Consequence: hits on one strand could obscure hits on the other strand. In typical use, this would rarely cause a problem because of the improbably long wordlength used by BLASTN (W=12) and the requirement for the word hits to appear in a particular order. This bug was present since BLASTN's inception. In BLASTN, fixed discrepancies in alignments when a database sequence contained one or more ambiguity (non-ACGT) codes. Previously, the original FASTA format database sequence was only examined at the end of the search; now it is examined during the search, so that it is known up front what the real alignment score and extent of alignment is. 5/6/93 Fixed a bug introduced to BLASTN on 5/4/93, wherein the first residue in the complementary strand (i.e., the complementary residue to the last residue on the "plus" strand) was not initialized. This bug would reveal itself iff the query contained one or more non-ACGT codes and the first residue on the complementary strand should have continued a matched with a database sequence. Tweaked the default value of E2 upward from 0.1 to 0.15, in reaction to the bug-fix on 5/5/93 which had raised the value of S2 calculated from E2. 5/5/93 Stupid bug fixed in all blast programs. The units that had been assumed for the Karlin-Altschul H statistic in the function stolen() were "nats per position", whereas the karlin() function was calculating H in units of "bits per position". The karlin() function was modified to calculate H in nats, and all equations that were functions of H and had been (correctly) assuming H was in units of bits were modified to account for the change to nats. H is still reported in units of bits, because of the automated parsers in the world. The consequences of this error were (1) that the expected length estimated for an alignment of any particular score was too short by a factor of log(2); and (2) the probability estimates reported by the programs were often higher (lower in statistical significance) than they should have been. 5/4/93 In BLASTN, ambiguous nucleotides in the query sequence are handled consistently throughout the program as mismatching all other letters, so that, e.g., strings of N's can be used to mask a query sequence. In addition, gap letters (hyphens) in the query sequence will never appear in an alignment (although they may appear in the database sequence half of an alignment). Ambiguity codes in the database sequences (only) can still lead to discrepancies between the scores obtained during the search and the scores reported after the search. 4/23/93 Recently, in all of the blast programs, a "consistent" N parameter was used in the Poisson statistics, to reflect the number of HSPs likely to be consistent with one another in the same gapped alignment. Now, all of the blast programs build upon this by using another enhancement of Stephen Altschul's, which is to adjust the Poisson probabilities downwards (making them more significant) to account for the consistency requirement. There is no effect on single-HSP probabilities. Some reordering of the database sequences will be observed in the output, with multiple-hit cases often moving up a few notches relative to the single-hit cases. With the consistency-adjusted Poisson P-values, sensitivity is expected to be marginally improved, being practically confined to matches which would anyway come close to satisfying the statistical significance threshold. If the threshold is set at a point within or just above background, it will be more common to see the new program report false positives than the previous version. Improved sensitivity will also be noticed more often with longer sequences, which provide greater opportunity to accumulate multiple hits with a single database sequence. The consistency feature (which includes both the consistent N and consistent Poisson statistics) can be turned off with the "-consistency" command line option. The statistics of consistent HSPs is discussed by Karlin and Altschul in a manuscript recently submitted to Proc. Natl. Acad. Sci. USA. 4/6/93 HSP == high-scoring segment pair, the unit of BLAST output In all of the BLAST programs, the Poisson event count (or the N parameter used in the Poisson statistics) assigned to each HSP is now estimated more accurately, using positional information as well as scores. A simple midpoint rule of Stephen Altschul's design is used to estimate the number of HSPs that would be consistent with each other in the same gapped alignment. Let (x,y) represent the location in 2-dimensional space of the midpoint of an HSP. In a "consistent" set of HSPs, if the HSPs are sorted in increasing order of their x coordinates, then the y coordinates of the sorted list also produce a strictly increasing sequence. For any given HSP, the maximum number of other HSPs that can be made consistent with it (plus 1 for the HSP under consideration) becomes the Poisson N parameter. The effect of this change is to reduce the number of false positives reported (improved selectivity), which sets the stage for the following... In BLASTP and TBLASTN, a much lower cutoff score (S2 instead of S) for reporting HSPs is used in conjunction with the consistent event count. HSPs are filtered from the output based on their statistical significance as estimated using Poisson statistics. Due to Altschul's consistency rule, a lower cutoff score can be used without introducing too much extra noise in the output, while providing increased sensitivity in detecting homologs in the presence of insertion/deletion errors and mutations. This change has not yet been documented in the blast manual page, and the values of S2 and E2 (E2 defined to be the number of chance matches expected when comparing two random sequences each 300 amino acids in length) can not currently be modified from their default values through the NCBI BLAST E-mail Service. With previous versions of BLASTP and TBLASTN, a database sequence had to produce at least one segment (HSP) scoring at least as high as the cutoff score, S, in order to be reported. And if this high threshold was met, the database sequence was scanned a second time using a lower cutoff, S2. This repeat scanning no longer occurs--all database sequences are scanned using the lower cutoff. The former cutoff score parameter, S, and expect parameter, E, now establish a threshold of statistical significance that must be satisfied by the Poisson P-values of the HSPs regardless of their individual scores. The evaluation of HSPs works like this: if a single database sequence yields one or more HSPs each scoring S2 or higher with the query, the list of HSPs is first sorted by score just as before; consistent event counts are then assigned; Poisson probabilities are calculated; and finally the list is truncated after the last HSP having a Poisson P-value that satisfies the S or E significance threshold. If no Poisson P-values satisfy the threshold, then the whole list is thrown away and none of the HSPs is reported. S might be thought of as the score that must be achieved by an HSP observed in isolation (Poisson event count = 1) for it to be reported. While use of a lower cutoff score is the default for BLASTP and TBLASTN, a similar low cutoff has been made an option for BLASTX, which may become the future default. It is presently only an option because it is feared that some automated parsers of BLASTX output might break if the lower cutoff method was suddenly instituted as the default. To invoke the option in BLASTX, specify a value for either E2 or S2 on the BLASTX command line. E2 is the number of HSPs expected to be observed by chance when comparing a random sequence 100 codons in length against another random sequence 300 amino acids in length. A suggested starting choice for E2 is 0.1. This change to BLASTX has not yet been documented in the blast manual page, and the option is also not presently selectable through the NCBI BLAST E-mail Service. A lower cutoff was not introduced to BLASTN, because the sensitivity of this program with its fixed wordlength W=12 is low. BLAST3 has always used a low cutoff. Symmetric multiprocessing can now be employed by the BLAST programs under SunSoft's Solaris 2.2 operating system, as well as the previous Silicon Graphics' IRIX operating system. The code has only been tested under a beta release of Solaris 2.2. Code is also included to putatively use threads in an OSF/1 environment such as Digital's OSF/1 on the Alpha AXP platform, however it has not been possible to test this code. Many more enhancements in the software are included, not all of which are documented yet or bundled here--e.g., support for the low-compositional complexity SEG filter of Wootton and Federhen (wootton@ncbi.nlm.nih.gov) and the short-periodicity repeat XNU filter of Claverie and States (jmc@ncbi.nlm.nih.gov). Also, optional use by BLAST of codon bias information read from *.cdi files (States and Gish, manuscript submitted). The interfaces to these features are not well developed, subject to change, and are presently provided "as is" in an effort to expedite moving the earlier-mentioned improvements into users' hands. 3/25/93 The default neighborhood word score threshold (T parameter) was raised a notch in TBLASTN only, to obtain a roughly compensatory increase in speed for the performance hit that was incurred in the switch to using the new default BLOSUM62 matrix on 3/19/93. 3/19/93 Changed the default substitution matrix used by BLASTP, BLASTX, TBLASTN and BLAST3 from PAM120 to BLOSUM62. Speed declines by about 30-40% as a result. 3/5/93 Changed the format of the sequence identifiers output by the programs gb2fasta, gt2fasta, pir2fasta, and sp2fasta. LOCUS and ACCESSION identifiers are now included. 9/19/91 Removed one last dependency of the software on the alphabetical case of residues in the FASTA databases. This change was localized to one line in blastn.c. 9/20/91 Better compatibility with Cray UNICOS (version 7.0) 9/23/91 Marginal improvement in speed of BLASTP and TBLASTN (re: zero-ing of diagonal hit structures in search_aa()), with a concomittant correction to the hit statistics reported by these programs. Only a minor change was made with respect to BLAST3, but since all three of these programs include the same searcha.inc file, the version number on BLAST3 was bumped up one. 9/25/91 Improved reporting of individual HSP statistics (including the number of bits of information associated with the alignment scores), and a more consistent report style across all blast programs. 9/27/91 BLASTN is now rigid in its interpretation of matching/mismatching. Residues must be either A, C, G, T(U) to match with any other residue. And T now matches U. There is no concept of a partial match with BLASTN. For example, R (purine) does not half-match with a G or A, but rather is scored as a complete MISMATCH. The blast1.1 manual page is better. 10/4/91 Hits on opposite strands of a query or database sequence are now considered to be distinguishable events, and so are counted separately in the Poisson statistics calculations. The default value for E used by BLASTP, BLASTN, BLASTX, and TBLASTN has been reduced from 25 down to 10, to avoid reporting quite so many hits which are statistically insignificant under the random sequence model. The experienced user may well want to routinely use even a lower value for E, e.g. E=1 or E=2. 10/23/91 Fixed frame reference bug in blastx.print_parms. 11/11/91 Neglected to initialize the pts[] array to NULL pointers in blast3.c. 11/13/91 The mode parameter of mfile.mfil_open() was not being passed to fopen() when USE_SHM was undefined. 12/11/91 Fixed bug in blast3.print_p which arose if USE_MPROC was _not_ defined and the database was not resident in shared memory. Fixed semaphore SETVAL bug in shmutil.c and minor bug in memfile.c. 12/18/91 Improved signal handling in multiprocessing situations. 12/23/91 Improved commande line parsing. New -overlap option added to all blast programs to turn off HSP overlap detection and removal. 12/24/91 Fixed filesize bug in shmutil.c. Only applicable to users of shared memory. 12/29/91 Fixed bug in blastx.c and others, in vicinity of isspace() macro usage. 12/30/91 Added sp2fasta utility for converting SWISS-PROT text format into FASTA format. 12/31/91 In searchn.inc, which is used by BLASTN, the strand (frame) of each HSP was not being set. 1/2/91 Fixed severe multiprocessing bug in TBLASTN--has no effect on uniprocessing. 1/6/91 Only the frequencies of occurrence of unambiguous letters (non-X for protein and non-N for nucleotide sequences) are used to calculate the Karlin parameters K and Lambda (and H). This change can lead to occasional warning messages (usually not fatal errors and not serious) about the score probabilities not adding up to 1.0. The "pam" v1.0.3 utility program now calculates a weighted average substitution score against the ambiguity letter X; a command line option permits the user to set a constant substitution score instead. Several .h and .c files had some ANSI-incompatibilities fixed; in particular "Boolean" parameters were changed to "int" because of the use of old-style function declarations. 1/17/92 Minor bug fix in lib/mfile.c and a major bug fix in BLAST3's out3.c. Both bugs were introduced recently; the former one prevented compilation of mfile.c; the latter one sent the 3-way search phase of BLAST3 into an infinite loop on single-processor architectures. Version numbers are not being incremented. 1/23/92 Fixed bug in sp2fasta.c that caused the last character of each DE line to be omitted. 2/10/92 Changed SGI IRIX compiler optimization flag from -O3 to -O2 in main copy of Makefile.sgi, for compatibility with IRIX 4.0. 2/18/92 Switched the BLAST application programs over to using a new version of the dfa library. The new dfa library is required. 2/20/92 Made changes to the Makefiles. Verified that all required libraries (ncbi, gish, dfa) and programs can be built. New copies of all dependent source code should be gotten. 3/9/92 Faster K calculations now performed. Accuracy is 2+ decimal places for the PAM120 and 2- places for PAM250. This generally translates into only a small error (<1%) in the dependent P-values, expectations, and bit scores, which seems acceptable for an approximate 20-fold improvement in the speed of calculating K. Furthermore, the error in K is on the high side, so P-values etc. tend to be conservative. The speed is achieved by performing fewer iterations in the main K loop and compensating for this by adding in several corrective terms from a geometric progression of Altschul's design. 3/27/92 Better handling by BLASTN of cases where the database sequence contains ambiguity letters. BLASTN now does not require the original FASTA-format nucleotide sequence database file. (TBLASTN still does, however). 3/28/92 Better handling by TBLASTN of cases where the database sequence contains nucleotide ambiguity codes. Now neither BLASTN nor TBLASTN requires the original FASTA-format nucleotide sequence database file. Long strings that had been static are now allocated dynamically. 3/29/92 blastp, blastn, blastx, tblastn, and blast3 have no theoretical limit on the line length in the query sequence file; setdb and pressdb have no theoretical limit on the length of lines in the input FASTA database files. Several programs were modified to accommodate a change in the gish library's misc/basename() function--an updated copy of the gish library must be obtained for compatibility. 3/30/92 Fixed bug in blastn's overlap checking function, ovlap_n(), that caused minus-strand HSPs to be reported that were intended to be filtered out. Merged versions of pvals_a(), pvals_n(), and pvals_t() into a single pvals() function. Fixed a bug in pressdb that would appear only if each sequence in the input FASTA-format database file resided on a single (possibly very long) line. 3/31/92 Added a "gap" character, '-', to the amino acid alphabet used by BLASTP, BLASTX, TBLASTN, and BLAST3, which breaks alignments into separate segments. BLASTN does not support gap characters. Fixed a severe bug in the multiprocessing version of TBLASTN: the translate() function failed to set s_len, the database sequence length, in frame 1. Until the gap letter was introduced to the amino acid alphabet today, it is not clear that this deficiency caused any problems. It certainly did not affect the results on uniprocessing platforms. Default value for the H (histogram) parameter is now 0 to omit reporting the histogram. 4/2/92 Added function etop(), which uses new function fct_expm1() in the gish library, to calculate probabilities from expect values. Changed the letter 'X' in the nucleotide alphabet to '-', which is supposed to represent a gap (as it does in the amino acid alphabet), but currently is treated by BLASTN like a mismatch character. 4/8/92 Pressdb still requires sequence lines to be of equal length (except for the last line of each sequence, which can be shorter), but it now tolerates one or more blank lines at the end of each sequence. 4/17/92 Fixed a bug in the single-processor version of blast3(out3.c) that produced an infinite loop. 5/15/92 Fixed a bug in blast3 that caused it to produce an unexpected number of pair-wise alignments. Often no pairwise alignments were displayed at all. This bug had no effect on the 3-way alignments produced. 6/16/92 Added several Hitlist sorting options to each of the BLAST programs except BLAST3. -sort_by_pvalue is the default for all. -sort_by_count sorts by the number of HSPs in each database sequence's hitlist. -sort_by_highscore sorts by the highest HSP score in a hitlist. -sort_by_totalscore sorts by the total of all HSP scores in a hitlist. Example: blastp pir myquery -sort_by_totalscore 6/18/92 In blastx, corrected the statistic reported for the highest observed score in each reading frame. 6/25/92 Corrected the way averaging was performed to calculate substitution scores against letters B and Z in the matrices produced by the pam program (pam.c). Standard Dayhoff PAM-250 matrix is now included in the distribution, under the filename "dayhoff". 7/1/92 Corrected a bug in lib/getseq.c that would cause BLASTN and TBLASTN to crash when reporting hits on single-processor platforms when the compressed nucleotide database file *.csq was loaded in shared memory. No effect if shared memory was not actively in use. 8/5/92 Fixed a bug in the single-processor version of blast3(out3.c) that produced an infinite loop. (How does this bug keep reappearing??) 8/14/92 Changed one fatal error message to what should have been merely a warning in BLASTN. Added a warning message to BLASTP and TBLASTN. No change in version numbers. 8/25/92 Made the software compatible with DEC Ultrix and other operating systems running on "little endian" platforms. BLAST databases, which contain binary encoded integers, can be shared between big and little endian platforms. Big endian platforms will be only marginally more efficient. 9/3/92 Corrected the substitution scores for B-X and Z-X reported by pam program. Current version of pam is 1.0.5. 9/4/92 Added several BLOSUM matrix files to the distribution. Moved all matrix files into a new "matrix" subdirectory. Renamed BLASTPAM environment variable to BLASTMAT, and changed its default value from "/usr/ncbi/blast/pam" to "/usr/ncbi/blast/matrix". 9/4/92 Corrected a bug in lib/hsppool.c that caused occasional bus errors and segmentation violations. 9/7/92 Moved bulk of the low-level multiprocessing support into the "gish" library. 10/1/92 Added gt2fasta program for extracting coding sequence (CDS) feature translations from files in the GenBank(R) flat file format, saving the results in a FASTA format file. 10/2/92 Made code compatible with architectures having 8-byte long integers, e.g. DEC Alpha. 10/26/92 Fixed a bug in searcha.inc regarding the handling of segmented sequences in BLASTP and TBLASTN. During examination of a diagonal for hits while ignoring X, the programs had been halting the diagonal search when a gap character was encountered in either the query or the database sequence. 11/4/92 Renamed include/blast.h to include/blastapp.h, to prepare for migration to using a blast function library which contains blast.h. 11/5/92 Moved lib/shmutil.c and lib/mfile.c into the "gish" library, and removed the USE_SHM macro. 11/16/92 BLASTP prunes its hitlists at the point where the expectation E/S is no longer satisfied. E2/S2 is now the cutoff for saving HSPs for subsequent pruning by the E/S criterion; after pruning, no HSPs may remain. Noise is reduced by the pruning, and better sensitivity is obtained by using a lower cutoff score followed by filtering on Poisson P-values. 12/8/92 sp2fasta now strips carriage-return characters from the definition lines, so the program now works well when parsing sequences files on the EMBL CD-ROM.