NOTE: PRE-BUILT BINARIES OF nrdb ARE INCLUDED WITH DISTRIBUTIONS OF THE LICENSED VERSION OF WU BLAST 2.0. See http://blast.wustl.edu. Source code last updated 2001/03/19 Description last updated 1998/05/25 The file nrdb2.tar.Z is a compressed UNIX tar archive of UNIX-compatible C source code for version 2 of a program called "nrdb" that can be used to generate quasi-nonredundant protein and nucleotide sequence databases. The program merges 100% identical sequence entries into a single entry in the output, with the associated descriptions concatenated into a single description. The program will read one or more input databases that are in FASTA/Pearson format to produce a single, compacted output file that is also in FASTA format. Data sources for producing a comprehensive protein sequence database include the SWISS-PROT, PIR, PDB, and "GenPept" databases, and the cumulative daily GenPept updates. A quasi-nonredundant nucleotide sequence database can be built from the GenBank major release and cumulative GenBank daily updates. All of the aforementioned input databases are presently available in their native formats via anonymous FTP. GenBank/GenPept, PIR, SWISS-PROT and PDB are on ncbi.nlm.nih.gov; and the EMBL Data Library and its updates are available on ftp.ebi.ac.uk. There exist a variety of parsers from these databases' native formats into FASTA format (see below). The FASTA-format output from nrdb can then be processed with "setdb" and "pressdb" to produce blastable databases. The size of a comprehensive, "nonredundant" protein sequence database is roughly half of the total size of the input databases. A nonredundant database is consequently easier and faster to search, yet is no less informative than searching the input databases individually. The statistical significance ascribed to BLAST alignment scores is also increased by a factor of two, because the size of the search space is cut in half. When definition lines for identical sequences are concatenated, the component definitions are separated from one another in the output by a '\001' character (ASCII Control-A or SOH [start of header]). The BLAST application programs use the Control-A character(s) to know where to break the sequence descriptions for output in BLAST ASN.1 format. A single file is acceptable input to the nrdb program -- duplicate entries are found by the program both within files and between files. A comprehensive nonredundant protein sequence database can be generated in about 10 minutes by a 200 MHz PentiumPro-based system with 128 MB RAM and fast disk drives. This may be only a little faster than a PERL script written to accomplish the same thing. Memory use for the nrdb program will be less, however; this is particularly true when operating on nucleotide sequences, where the nrdb program rapidly compresses all-ACGT sequences 4:1 and all-IUPAC nucleotide sequences (including ambiguity codes) 2:1 to conserve memory. Additional source code is required to compile and link the nrdb program: the ncbi.tar.Z and gish.tar.Z archives posted beneath the /blast-14 directory. This additional code may already be available on your system if you have already built the BLAST database search software there (see /blast-14). The BLAST software distributed here also includes parsers to convert GenBank, PIR, and SWISS-PROT flat files into FASTA format; the SWISS-PROT parser can be used to parse EMBL flat files, too. The NCBI software Toolbox includes a demonstration program called "asn2fast" that converts NCBI ASN.1 sequence data into FASTA format. See Makefile for customizations that may be necessary to build the nrdb program on your system. CAUTION: the nrdb program performs no validation of the letter codes it reads. Warren Gish NCBI/NLM May 9, 1992 Washington University, St. Louis 1994- Modification history: 2001/03/19 Cleaned up some plock()-related code for platforms such as Linux that don't support it. 1999/11/11 Converted from using signed long to unsigned long values for accumulating statistics. 1998/07/15 Changes to system errno-related code. 1998/05/25 Posted nrdb version 2.0.1 -- bug fix in nrf.c 1997/05/23 Posted nrdb version 2, which is copyrighted. The public domain version 1 is archived. Added 2:1 compression for IUPAC nucleotide sequences containing ambiguity codes (e.g., ESTs) -- but any ambiguities must be identical for the sequences to be merged. Removed alphabetic case-sensitivity from the sequence comparisons; any input sequences that contain lower-case letters will be converted to upper case. 1994/09/15 Fixed a bug in util.c that could arise when processing long sequences (>256 KB). 1994/06/30 The character separating the concatenated definitions was changed from a '>' to '\001' (ASCII Control-A, start of header). This permits the current versions of the BLAST programs to unambiguously determine the start points of each definition line by searching for each occurrence of Control-A, since Control-A is a nonprinting character that should not appear in normal definition lines. 1992/07/17 Corrected some errors in the Makefile and now #include in nrdb.c. Added Makefile.gcc for use with the GNU gcc 2.+ compiler. The bundled Sun Microsystems cc compiler has not been tested for compatibility. 1992/09/21 Sequences which are entirely ACGT are now compressed in memory to conserve RAM. All FASTA input files are now read by calls to vfgets(), which confounds attempts to act intelligently when the -L option is specified and physical memory is exceeded--the -L option should not be used until vfgets is enhanced to permit substitution of alternatives to malloc() and realloc() for its use.