NCBI Helper

Background
Posting of comprehensive sequence databases like nr and nt in FASTA format by the NCBI dates back to 1990. In January 2024, the NCBI announced that starting in April of that year, BLAST databases posted on its FTP server would no longer be accompanied by the same data in FASTA format. In recent years prior to this discontinuation, the posted FASTA databases included nr, nt, pdbaa and swissprot, but just a few years earlier, the FASTA databases also included env_nr, env_nt, mito, pataa, patnt, vector and several more.

Problem
FASTA format is a simple, accessible, widely supported format for storing and processing sequence data, whereas BLAST databases are rather opaque and not readily usable by many tools. The difficulty of accessing data in this specialized format impedes experimentation and discovery. People who work with FASTA-format data and wish to utilize data that is only distributed in NCBI BLAST format need to install the NCBI Toolbox software, download and store the BLAST databases of interest, unpack the tar.gz files, and convert the databases to FASTA format themselves—with the possible extra expense of compressing the FASTA for archival storage. Basically, people must install software they otherwise might not need, store databases they otherwise don't need, in order to run the NCBI-recommended, notoriously slow blastdbcmd tool to convert the BLAST format data to FASTA format.

Mitigation
To improve the speed and storage requirements of converting NCBI BLAST databases to FASTA, a few scripts are provided here. The scripts n2f and pn2f boost speed by parallelizing blastdbcmd using GNU Parallel.

When compressed FASTA output is desired, another speed boost is obtained with the scripts n2fz and pn2fz, by using parallel compression utilities zstd and pigz. zstd operates not just faster than gzip but usually compresses somewhat better with default parameters. Both of these compression utilities and GNU Parallel are readily available through standard software distribution channels like Homebrew.

Two of the scripts (pn2f and pn2fz) do not even require source BLAST databases to be unpacked prior to FASTA conversion, as these scripts work directly with “packed” databases in tar.gz files: Individual tar.gz components of the database are only transiently unpacked by the scripts for as long as necessary to convert the component to FASTA, before the unpacked copy is erased. The packed tar.gz files are not touched and remain intact.

A script named pmd5 is provided here that can dramatically speed up the calculation of MD5 checksums of huge files, such as a single-file FASTA dump of nr or nt. It accomplishes this by first calculating MD5 checksums in parallel of non-overlapping segments of the file until the entire file has been processed. The resultant MD5 checksums for the segments are themselves then piped through MD5, yielding a single MD5 checksum that is reported for the entire file. The script also supports a “-c filename” option to reevaluate and compare a new result against a previously stored result. It is recommended to store pmd5 output in a file with a .pmd5 extension to its name, to distinguish this checksum-of-checksums from a standard MD5 checksum often stored with a .md5 extension.

Limitations
While these tools can only convert entire NCBI BLAST databases to FASTA—not subsets thereof—besides the enormous nr and nt, the NCBI distributes several subset databases any of which can be converted to FASTA with these scripts.

You will likely need to edit the *n2f* scripts to indicate the particular directory in which you have stored the BLAST databases (packed and/or unpacked) and the directory where you would like any compressed FASTA output files to be saved. These directories can alternatively be configured with environment variables set automatically in login scripts.

zstd typically operates considerably faster than gzip while compressing somewhat better (with default parameters).

All of the parallel methods described here are likely to perform faster and more smoothly if the input and output data are stored on fast, solid-state storage media. This of course makes storage of the datasets more expensive.


To download any of the scripts listed below, click on its name and, in a terminal window, run the command "chmod a+x filename" to make the downloaded file executable. Then mv the script to the desired directory. If pushover is installed and one registers for the service, it can be used to notify of lengthy job start and end events via the Pushover smartphone app.

NOTE: The NCBI monitors usage of its software by default. To opt out of monitoring, set the variable BLAST_USAGE_REPORT=false in a file named .ncbirc stored in your home directory as illustrated in the .ncbirc file posted below. (If this file is downloaded, the leading dot will likely be stripped by your browser.)

File Description Last Modified
n2lib Shared script library required by the n2f scripts below 2026-06-01T05:59:17
n2f Dump an NCBI BLAST database to stdout in FASTA format 2026-06-07T11:58:58
n2fz Convert an NCBI BLAST database to a single, compressed FASTA file 2026-06-01T06:13:46
pn2f Dump a packed NCBI BLAST database to stdout in FASTA format 2026-06-01T06:13:41
pn2fz Convert a packed BLAST database to a single, compressed FASTA file 2026-06-01T06:13:50
pmd5 Parallel calculation of MD5 checksum of a huge file 2026-05-29T17:41:07
pushover Sample Pushover smartphone notification script 2026-05-29T18:31:08
.ncbirc Sample .ncbirc to opt out of NCBI monitoring your usage 2024-02-24T09:32:18
 Permission is granted to copy, modify and redistribute the code posted
 here as long as acknowledgement of the original author is maintained within.

 Send questions, comments, snide remarks to:

 Warren Gish
 gish@advbiocomp.com
 Advanced Biocomputing, LLC

For information about licensing AB-BLAST, see here.