•

Frequently Asked Questions: Indexing of Sequence Identifiers

Last updated: 2002-09-26

What sequence identifiers do xdformat and xdget recognize?
Are Accessions from the DDBJ/EBI/NCBI collaboration treated specially?
What is a compound identifier?
What is a compound definition?
Can I use my own identifiers?
Are all identifiers indexed in compound identifier and a compound definition line?
How can indexing and retrieval be restricted to my own identifiers?
What is a “redundant” identifier?
What is a “duplicate” identifier?
What is are “qualified” and “unqualified” identifiers?
How are unqualified identifiers looked up in the index?
How can an entire class of identifiers be omitted from the index if it is not needed?
Can new sequences be appended to an existing XDF database – and will they be indexed?
How are ACCESSION.VERSION identifiers managed when indexing?
What limitations exist on identifier indexes?

What sequence identifiers do xdformat and xdget recognize?

All NCBI standard FASTA sequence identifiers (NSIDs) are supported for indexing. User-definable, uncontrolled identifiers (UCIDs) consisting of arbitrary text strings are also supported. The complete list of NSIDs is presented in Table 1. Note: the NSIDs include three user types denoted by the tags: lcl, gnl, and oth. In contrast, identifiers of the flexible UCID class do not use any tags. For a more complete description of UCIDs, see below.

Table 1. The complete NCBI standard FASTA sequence identifiers

Tag and Identifier Syntax	Identifier Source Description
bbm\|integer	NCBI GenInfo Backbone database identifier
bbs\|integer	NCBI GenInfo Backbone database identifier
dbj\|coll-accession\|locus	DNA Database of Japan
emb\|coll-accession\|entry	EBI EMBL Database
gb\|coll-accession\|locus	NCBI GenBank database
gi\|integer	NCBI GenInfo Integrated Database (“jee-aye”)
gim\|integer	NCBI GenInfo Import identifier
gnl\|database\|idstring	General (user-definable) database and identifier
gp\|coll-accession\|locus_cds#	GenPept (GenBank protein) identifier
lcl\|integer	Local (user-definable) identifier
oth\|accession\|name\|release	Other (user-definable) identifier*
pat\|country\|patentid\|serialno	Patent sequence identifier
pdb\|entry\|chainid	Brookhaven Protein Database
pir\|accession\|entry	Protein Information Resource International
prf\|accession\|name	Protein Research Foundation
ref\|coll-accession\|locus	NCBI RefSeq
sp\|coll-accession\|locus	SWISS-PROT database
tpd\|coll-accession\|name	Third party annotation, DDBJ
tpe\|coll-accession\|name	Third party annotation, EMBL
tpg\|coll-accession\|name	Third party annotation, GenBank

*The NCBI has discontinued support for “oth” identifiers, but support for them is maintained in xdformat/xdget.

Are Accessions from the DDBJ/EBI/NCBI collaboration treated specially?

Yes, while “accession” appears in several of the identifiers described above, Accessions assigned by the International Nucleotide Sequence Database Collaboration between the DDBJ, EBI (EMBL) and NCBI are guaranteed unique by these organizations. To reflect their special nature, the collaboration’s Accessions are labeled coll-accession in Table 1. These Accessions are all treated as being derived from the same identifier name space. Consequently, xdget can retrieve a sequence by Accession (or rather coll-accession) without having to know specifically which of the collaborating organizations assigned the identifier. Locus and Entry identifiers do not work this way, however, as the uniqueness of these identifiers is not controlled between the collaborators.

What is a compound identifier?

A compound identifier is a concatenation of multiple NCBI standard FASTA sequence identifiers (NSIDs) each separated from the next by a single vertical-bar character, ‘|’ (also known as the “logical-or”, “pipe”, “pling”, “gozinta” or “pipesinta” character). White space (e.g., one or more blank or tab characters) is used to delimit the identifier string from the accompanying sequence description.

Here is an example of a definition line containing a simple or atomic sequence identifier:

>gi|12346 hypothetical protein 185 – wheat chloroplast

Here is an example of a compound identifier, containing both a gi and a gp (GenPept) identifier:

>gi|12346|gp|CAA44030.1|CHTAHSRA_4 hypothetical protein 185 – wheat chloroplast

While the order of identifiers in a compound identifier is technically irrelevant, gi identifiers typically appear first.

What is a compound definition?

A compound definition is a concatenation of multiple component definitions, each separated from the next by a single Control-A character (sometimes symbolized ^A; hex 0x01; or ASCII SOH [start of header]). Compound definitions are frequently seen in “nr” (quasi-non-redundant) databases, where multiple instances of the exact same sequence are replaced by a single instance of the sequence with a concatenated definition line. Note: each component of a compound definition begins with an identifier which itself may be compound.

Can I use my own identifiers?

Yes, xdformat can index uncontrolled identifiers of your choosing (UCIDs), either alone or in combination with NCBI standard FASTA sequence identifiers (NSIDs). A UCID consists merely of a non-blank string of text, lacking any identifier tag that would be required of an NSID.

UCIDs are subject to a few restrictions:

only one UCID is permitted per definition line (or per definition line component in a compound definition line);
a UCID must appear last, following any NSIDs in a compound identifier;
no escape character exists for including special characters;
UCIDs must not contain the following special characters:

white space (e.g., blank, tab or linefeed characters);
a vertical-bar character (‘|’);
control-A character (hexadecimal byte 0x01; or ASCII SOH [start of header]), which is used to delimit component definitions in a compound definition line.

The purpose of imposing the above restrictions on UCIDs is to aid in the detection of syntax errors on input.

When an error is encountered in the left-to-right parsing of a string of identifiers, parsing stops and all subsequent identifiers in the current identifier string are ignored. Any identifiers parsed correctly prior to the error are indexed. In the case of a compound definition line, parsing and indexing resume at the identifier string in the next component definition. Regardless of whether any syntax errors are detected in the identifiers, the entire definition line will be stored in the XDF database “as is”.

Here are a few examples of definition lines whose identifiers will all be completely parsed and indexed. All but the first two examples contain a compound identifier.

>gi|12346

>MYID001 my first sequence (NOTE: UCID is acceptable as the first identifier, iff it is the only identifier in the string)

>gi|5902966|gp|AAD55586.1|AF055084_1 very large GPCR-1 [Homo sapiens]

>gp|AAD55586.1|AF055084_1|gi|5902966 (NOTE: order of NCIDs is unimportant)

>gp|AAD55586.1|AF055084_1| very large GPCR-1 [Homo sapiens] (NOTE: vertical-bar is acceptable at end of identifier string)

>gp|AAD55586.1|AF055084_1|gi|5902966|MYID001 my first sequence (NOTE: UCID at end of identifier string will be properly indexed)

Here are a few examples of improperly constructed strings that will cause an identifier – or the entire string of identifiers – to be omitted from the index.

>gi|5902966|gp|AAD55586.1 very large GPCR-1 [Homo sapiens] (NOTE: gp identifier is missing the locus token and will be skipped)

>fb|AAD55586.1|AF055084_1|gi|5902966 (NOTE: unrecognized tag “fb”; none of the identifiers will be indexed)

>gi|5902966|MYID001|gp|AAD55586|AF055084_1 (NOTE: UCID not listed last; gp identifier will not be indexed)

>MYID001|gp|AAD55586.1|AF055084_1|gi|5902966 (NOTE: UCID not listed last; none of the subsequent identifiers will be indexed)

Are all identifiers indexed in a compound definition line?

Yes, assuming no parse errors are encountered in any of the identifier strings among all component definition lines, all of the identifiers are indexed by default. If only a subset of identifier types needs to be indexed for later use in retrieval, indexing can be restricted to a subset of types with one or more ‑T specifications on the xdformat command line. Similarly, indexed retrieval can be restricted to a subset of identifier types by specifying one or more ‑T specifications on the xdget command line. Of course, ‑T restrictions are only effective if the corresponding identifiers actually appear in the database.

Any ‑T index restrictions imposed during database creation on the xdformat command line automatically (and unconditionally) remain in effect during appends of additional data to the same database; the restrictions need not be replicated on the xdget command line unless even tighter restrictions are desired during retrieval. Tighter restrictions upon retrieval can be obtained by specifying a subset of the ‑T restrictions originally indicated on the xdformat command line.

The size of the index and the speed of index creation and retrieval will be improved by limiting the index to those identifiers of interest.

NOTE: The left-to-right order of multiple ‑T specifications may be important in future versions of xdformat and xdget.

How can indexing and retrieval be restricted to my own identifiers?

Just as the ‑T<tag> option can be used to restrict indexing and retrieval to a subset of NSIDs, the special tag specification ‑Tuser will restrict indexing to UCIDs. NSID and UCID restrictions can be combined on the same command line. For example, “xdformat ‑Tuser ‑Tgi …” will restrict indexing to UCIDs and NCBI gi identifiers.

What is a “redundant” identifier?

When the definition line for a single sequence record contains multiple instances of the same identifier within the same name space, each instance following the first is called redundant. Redundant identifiers may appear in the same or different components of a compound definition line. Depending on circumstances, redundant identifiers may or may not be problematic, because they all refer to (are associated with) the same sequence record.

The xdformat program reports redundant identifiers.

What is a “duplicate” identifier?

When a database contains instances of the same identifier in a name space in different sequence records, the identifiers are called duplicate. Duplicate identifiers are more prone to being problematic than redundant identifiers, because the association between database records (sequences) and duplicate identifiers is not unique. An identifier can be both redundant and duplicate.

The xdformat program reports duplicate identifiers.

What are “qualified” and “unqualified” identifiers?

A qualified identifier is one which conforms to the NCBI standard FASTA identifier (NSID) syntax outlined in Table 1. An unqualified identifier is just a bare word, lacking any indication of its database domain or name space in which it was assigned. For instance, while “U38670” could represent a GenBank Accession, it might also be an uncontrolled identifier (UCID). The string “gb|U38670|” tells us unambiguously that the identifier is a GenBank Accession.

Table 2. Examples of unqualified and qualified identifiers

Unqualified ID	Qualified ID	Interpretation
U85245	gb\|U85245\|	U85245 is a GenBank ACCESSION
1857636	gi\|1857636	1857636 is a GenBank gi identifier
HSU85245	gb\|\|HSU85245	HSU85245 is a GenBank LOCUS
AF218085.2	gb\|AF218085.2\|	AF218085.2 is a GenBank ACCESSION.VERSION
P18646	sp\|P18646	P18646 is a SWISS-PROT ACCESSION
11S3_HELAN	sp\|\|11S3_HELAN	11S3_HELAN is a SWISS-PROT ENTRY name
A00008	pir\|A00008\|	A00008 is a PIR accession

Note that all fields in a qualified identifier must be accounted for by vertical-bars, but all fields need not contain data. A field can be left empty if its value is unset or unknown. Furthermore, retrieval of the corresponding database entry will succeed if one or more fields in a qualified identifier are instantiated.

How are unqualified identifiers looked up in an index?

First of all, it is important to know that when indexing, all identifiers are assigned to a specific name space, with unqualified or uncontrolled identifiers in the UCID class being assigned to an ad hoc “user” name space. The xdformat and xdget programs maintain an internal priority list of the possible name spaces. When provided with an unqualified identifier, xdget works its way down the priority list, successively looking for the requested identifier in each name space. The program stops at the name space in which the first matching identifier is found; any further work the program must do (e.g., to identify the earliest appearance of the identifier in the database) will be performed in this one name space.

Name spaces are examined in the decreasing priority order shown in Table 3. The qualifiers 1 and 2 on any given tag correspond respectively to the 1^st and 2^nd fields in the tag’s full identifier syntax. Note that non-standard “accession” tag may be used with the –T option as a synonym for the unified name space of Accessions assigned by the DDBJ/EBI/NCBI collaboration. The nonstandard tags “locus” and “entry” are both synonyms for the 2^nd field in all dbj, emb, gb, gp, ref, and sp identifiers, although xdformat actually stores these identifiers in 4 distinct name spaces; xdget then looks up unqualified identifiers using the priority list in Table 3.

Table 3. Priority order of identifier name spaces, from highest to lowest

*-T<tag>*	Description	Synonyms
user	Uncontrolled UCID class
lcl
gi
dbj1, emb1, gb1, gp1, sp1,ref1	DDBJ/EMBL/GenBank Accession*	-Taccession
gb2,gp2,ref2	GenBank locus	-Tlocus, -Tentry
emb2	EMBL ID	-Tlocus, -Tentry
dbj2	DDBJ ID	-Tlocus, -Tentry
sp2	SWISS-PROT entry	-Tlocus, -Tentry
pdb	PDB entry\|chain
pir1	PIR accession
pir2	PIR entry
prf1	PRF accession
prf2	PRF entry
pat	country\|number\|seqno
gnl	database\|idstring
oth	database\|accession\|release

NOTE: the priority list of Table 3 is currently used both in the presence and the absence of any –T options when xdget looks up unqualified identifiers. Future versions of xdformat and xdget will likely use the left-to-right order of–T specifications as the priority order for lookups; in the absence of any –T specifications on either program’s command line, the order shown in Table 3 will be used by default.

How can an entire class of identifiers be omitted from the index if it is not needed?

Tag specifications similar to those shown in Table 3 can be used to suppress indexing of certain classes of identifiers, while permitting all others to be indexed. If a tag specification simply ends with a 0 (zero), then that tag will be suppressed. For instance, to suppress indexing of identifiers appearing in the 2^nd field of GenBank, EMBL, and DDBJ identifiers, one would specify –Tlocus0. Or to suppress indexing of gi identifiers, use –Tgi0. Such tag specifications may also be provided on the xdget command line to suppress the use of particular classes of identifiers during retrieval.

Can new sequences be appended to an existing database – and will they be indexed?

Yes, the rapid append mode (‑a option) of xdformat is available for indexed databases; appends occur only marginally slower when an index is being maintained. Appended sequences will have their identifiers indexed using the same –T restrictions (if any) that were specified when the database was first created. Indexing of identifiers occurs automatically and unconditionally during appends to a previously indexed database, without the need to specify the –I or –X option when appending.

How are ACCESSION.VERSION identifiers managed?

The numerical .VERSION extension that frequently accompanies Accessions assigned by the NCBI/EBI/DDBJ collaboration is automatically included in the index created by xdformat. Version information can then used by xdget to identify the latest version of a sequence, when keyed by its Accession alone. Specific versions can also be retrieved if xdget is provided with an identifier of the form ACCESSION.VERSION (e.g., AAB33294.2). The –N option of xdget can be used to report instead the first (-N0) or last (-Nn) instance of an Accession in the database; the –A0 option can be used to report the lowest-numbered Version present in the database rather than the highest (the default or –An). All instances of an accession will be reported by xdget if the ‑d option is specified.

Indexing and retrieval can be restricted to Accessions assigned by the NCBI/EBI/DDBJ collaboration using the special option ‑Taccession (or ‑Tacc for short).

Remember: Version numbers assigned by NCBI/EBI/DDBJ are only tied to changes in the sequence data, not the associated annotation. The annotation of a database record may change greatly, while the Version will remain the same if the sequence itself has not changed.

What limitations exist on identifier indexes?

Assuming the underlying computer operating system and hardware have the capacity, index files produced by xdformat are currently limited to 8 TB (8,192 GB) in size, a limit that can be readily increased to 256 TB in the program if necessary. With its current configuration, however, an index of 50+ million entries requires less than 3 GB storage; and because storage requirements for the index increase only marginally faster than linearly with the number of entries, the current limit seems likely to suffice for some time. If the size of an index is problematic, or if faster retrieval is required, indexing can be restricted to the most important classes of identifiers using the –T option.

Return to the AB-BLAST Archives home page.