Background
On January 25, 2024,
the NCBI
announced
that starting in April,
NCBI BLAST databases posted on its FTP server will no longer
be accompanied by the same data in FASTA format.
In recent years, the posted FASTA databases have included
nr, nt, pdbaa and swissprot,
but just a few years prior, the FASTA databases also included
env_nr, env_nt, mito, pataa, patnt, vector
and several more.
Posting by the NCBI of nr, nt, swissprot and others in FASTA format
dates back to 1990.
Problem
FASTA format is a simple, accessible, widely supported format for storing and processing
sequence data,
whereas
NCBI BLAST databases can not be readily processed by many existing tools
and the difficulty of accessing data
in this specialized format impedes experimentation and discovery.
Henceforth, people who work with FASTA-format data who wish to utilize
data only available in NCBI BLAST format
will need to install the NCBI BLAST software,
download and store the NCBI BLAST databases of interest,
unpack the tar.gz files,
and convert the databases to FASTA format themselves—with
optional compression, too, for archival storage.
Mitigation
To simplify the FASTA conversion process and
improve the speed and storage requirements
of the naïve, slow approach offered by the NCBI in its announcement,
a few scripts are provided here.
These scripts can utilize multiple CPU cores to accelerate
both the conversion to FASTA and optional data compression.
Additional third party tools are required—GNU parallel,
pigz and zstd—
in addition to the NCBI BLAST software (blastdbcmd) of course.
If installed, pushover is used to notify of job start and end.
The scripts posted here are configured to use pigz but by un-commenting the
appropriate lines, they will work with the faster, more efficient zstd.
Two of the scripts (pn2f and pn2fz) do not require the NCBI BLAST database to be unpacked in advance of conversion to FASTA. This alleviates the need for storage sufficient to maintain the entire NCBI BLAST database in its more voluminous unpacked (untarred) form. Instead, these scripts work directly with “packed” NCBI BLAST databases in tar.gz files. Components of the database are transiently unpacked only for as long as necessary, before removing the unpacked data. The packed tar.gz files are not touched and remain intact.
A script named pmd5 is provided here that can dramatically speed up the calculation of MD5 checksums. It accomplishes this by first calculating MD5 checksums in parallel on individual segments of a specified file until the entire file has been processed. The resultant list of MD5 checksums is then piped through MD5 once more, to calculate a single MD5 checksum that is reported for the entire file. The script also supports a “-c filename” option to reevaluate and compare a new result against a previously stored result. It is recommended to store pmd5 output in a file with a .pmd5 extension to its name, to distinguish this checksum-of-checksums from a standard MD5 checksum often stored with a .md5 extension.
Limitations
These tools convert entire NCBI BLAST databases to FASTA, not subsets
(such as taxonomic subsets).
The good news is that much—if not most—of the speed-up
of the scripts n2fz and pn2fz derives
from the use of compression utilities
pigz
and
zstd
that very effectively employ parallel processing.
Just piping FASTA output from blastdbcmd
into pigz or (even better) zstd
is likely to confer a major speed-up over using gzip to compress the output.
zstd operates not just faster than gzip
but compresses somewhat better with default parameters.
All of the parallel methods described here are likely to perform faster or more smoothly if the data are stored on fast, solid-state storage media. This of course makes the storage of voluminous datasets significantly more expensive.
To download any of the scripts listed below, click on its name and, in a terminal window, run the command "chmod a+x filename" to make the downloaded file executable. Then mv the script to the desired directory.
NOTE: The NCBI monitors usage of its software by default. To opt-out of this monitoring, set the variable BLAST_USAGE_REPORT=false in a file named .ncbirc stored in your home directory as illustrated in the .ncbirc file posted here.