NCBI Helper

Background
On January 25, 2024, the NCBI announced that starting in April, NCBI BLAST databases posted on its FTP server will no longer be accompanied by the same data in FASTA format. In recent years, the posted FASTA databases have included nr, nt, pdbaa and swissprot, but just a few years prior, the FASTA databases also included env_nr, env_nt, mito, pataa, patnt, vector and several more. Posting by the NCBI of nr, nt, swissprot and others in FASTA format dates back to 1990.

Problem
FASTA format is a simple, accessible, widely supported format for storing and processing sequence data, whereas NCBI BLAST databases can not be readily processed by many existing tools and the difficulty of accessing data in this specialized format impedes experimentation and discovery. Henceforth, people who work with FASTA-format data who wish to utilize data only available in NCBI BLAST format will need to install the NCBI BLAST software, download and store the NCBI BLAST databases of interest, unpack the tar.gz files, and convert the databases to FASTA format themselves—with optional compression, too, for archival storage.

Mitigation
To simplify the FASTA conversion process and improve the speed and storage requirements of the naïve, slow approach offered by the NCBI in its announcement, a few scripts are provided here. These scripts can utilize multiple CPU cores to accelerate both the conversion to FASTA and optional data compression. Additional third party tools are required—GNU parallel, pigz and zstd— in addition to the NCBI BLAST software (blastdbcmd) of course. If installed, pushover is used to notify of job start and end. The scripts posted here are configured to use pigz but by un-commenting the appropriate lines, they will work with the faster, more efficient zstd.

Two of the scripts (pn2f and pn2fz) do not require the NCBI BLAST database to be unpacked in advance of conversion to FASTA. This alleviates the need for storage sufficient to maintain the entire NCBI BLAST database in its more voluminous unpacked (untarred) form. Instead, these scripts work directly with “packed” NCBI BLAST databases in tar.gz files. Components of the database are transiently unpacked only for as long as necessary, before removing the unpacked data. The packed tar.gz files are not touched and remain intact.

A script named pmd5 is provided here that can dramatically speed up the calculation of MD5 checksums. It accomplishes this by first calculating MD5 checksums in parallel on individual segments of a specified file until the entire file has been processed. The resultant list of MD5 checksums is then piped through MD5 once more, to calculate a single MD5 checksum that is reported for the entire file. The script also supports a “-c filename” option to reevaluate and compare a new result against a previously stored result. It is recommended to store pmd5 output in a file with a .pmd5 extension to its name, to distinguish this checksum-of-checksums from a standard MD5 checksum often stored with a .md5 extension.

Limitations
These tools convert entire NCBI BLAST databases to FASTA, not subsets (such as taxonomic subsets). The good news is that much—if not most—of the speed-up of the scripts n2fz and pn2fz derives from the use of compression utilities pigz and zstd that very effectively employ parallel processing. Just piping FASTA output from blastdbcmd into pigz or (even better) zstd is likely to confer a major speed-up over using gzip to compress the output. zstd operates not just faster than gzip but compresses somewhat better with default parameters.

All of the parallel methods described here are likely to perform faster or more smoothly if the data are stored on fast, solid-state storage media. This of course makes the storage of voluminous datasets significantly more expensive.

To download any of the scripts listed below, click on its name and, in a terminal window, run the command "chmod a+x filename" to make the downloaded file executable. Then mv the script to the desired directory.

NOTE: The NCBI monitors usage of its software by default. To opt-out of this monitoring, set the variable BLAST_USAGE_REPORT=false in a file named .ncbirc stored in your home directory as illustrated in the .ncbirc file posted here.

File Description Last Modified
.ncbirc Sample .ncbirc to opt-out of the NCBI monitoring your usage 2024-02-24T09:32:18
n2f Dump an NCBI BLAST database to stdout in FASTA format 2025-05-27T11:09:58
n2fz Convert an NCBI BLAST database to a single, compressed FASTA file 2025-05-27T11:10:58
pn2f Dump a packed NCBI BLAST database to stdout in FASTA format 2025-05-27T11:11:40
pn2fz Convert a packed BLAST database to a single, compressed FASTA file 2025-05-27T11:12:50
pmd5 Parallel calculation of MD5 checksum on a file 2024-04-22T15:38:56
 Permission is granted to copy, modify and redistribute the code posted
 here as long as acknowledgement of the original author is maintained within.

 Send questions, comments, snide remarks to:

 Warren Gish
 gish@advbiocomp.com
 Advanced Biocomputing, LLC

For information about licensing AB-BLAST, see here.