Background
Posting of comprehensive sequence databases like nr and nt in FASTA format
by the NCBI dates back to 1990.
In January 2024, the NCBI
announced
that starting in April of that year,
BLAST databases posted on its FTP server would no longer
be accompanied by the same data in FASTA format.
In recent years prior to this discontinuation,
the posted FASTA databases included
nr, nt, pdbaa and swissprot,
but just a few years earlier, the FASTA databases also included
env_nr, env_nt, mito, pataa, patnt, vector
and several more.
Problem
FASTA format is a simple, accessible, widely supported format for storing and processing
sequence data,
whereas BLAST databases are rather opaque and not readily usable by many tools.
The difficulty of accessing data
in this specialized format impedes experimentation and discovery.
People who work with FASTA-format data and wish to utilize
data that is only distributed in NCBI BLAST format
need to install the NCBI Toolbox software,
download and store the BLAST databases of interest,
unpack the tar.gz files,
and convert the databases to FASTA format themselves—with
the possible extra expense of compressing the FASTA for archival storage.
Basically, people must install software they otherwise might not need,
store databases they otherwise don't need,
in order to run the NCBI-recommended,
notoriously slow blastdbcmd tool
to convert the BLAST format data to FASTA format.
Mitigation
To improve the speed and storage requirements
of converting NCBI BLAST databases to FASTA,
a few scripts are provided here.
The scripts n2f and pn2f
boost speed by parallelizing blastdbcmd using
GNU Parallel.
When compressed FASTA output is desired, another speed boost is obtained with the scripts n2fz and pn2fz, by using parallel compression utilities zstd and pigz. zstd operates not just faster than gzip but usually compresses somewhat better with default parameters. Both of these compression utilities and GNU Parallel are readily available through standard software distribution channels like Homebrew.
Two of the scripts (pn2f and pn2fz) do not even require source BLAST databases to be unpacked prior to FASTA conversion, as these scripts work directly with “packed” databases in tar.gz files: Individual tar.gz components of the database are only transiently unpacked by the scripts for as long as necessary to convert the component to FASTA, before the unpacked copy is erased. The packed tar.gz files are not touched and remain intact.
A script named pmd5 is provided here that can dramatically speed up the calculation of MD5 checksums of huge files, such as a single-file FASTA dump of nr or nt. It accomplishes this by first calculating MD5 checksums in parallel of non-overlapping segments of the file until the entire file has been processed. The resultant MD5 checksums for the segments are themselves then piped through MD5, yielding a single MD5 checksum that is reported for the entire file. The script also supports a “-c filename” option to reevaluate and compare a new result against a previously stored result. It is recommended to store pmd5 output in a file with a .pmd5 extension to its name, to distinguish this checksum-of-checksums from a standard MD5 checksum often stored with a .md5 extension.
Limitations
While these tools can only convert entire NCBI BLAST databases
to FASTA—not subsets thereof—besides
the enormous nr and nt,
the NCBI distributes several subset databases any of which can be converted
to FASTA with these scripts.
You will likely need to edit the *n2f* scripts to indicate the particular directory in which you have stored the BLAST databases (packed and/or unpacked) and the directory where you would like any compressed FASTA output files to be saved. These directories can alternatively be configured with environment variables set automatically in login scripts.
zstd typically operates considerably faster than gzip while compressing somewhat better (with default parameters).
All of the parallel methods described here are likely to perform faster and more smoothly if the input and output data are stored on fast, solid-state storage media. This of course makes storage of the datasets more expensive.
To download any of the scripts listed below, click on its name and, in a terminal window, run the command "chmod a+x filename" to make the downloaded file executable. Then mv the script to the desired directory. If pushover is installed and one registers for the service, it can be used to notify of lengthy job start and end events via the Pushover smartphone app.
NOTE: The NCBI monitors usage of its software by default. To opt out of monitoring, set the variable BLAST_USAGE_REPORT=false in a file named .ncbirc stored in your home directory as illustrated in the .ncbirc file posted below. (If this file is downloaded, the leading dot will likely be stripped by your browser.)