TITLE Compound sequence identifiers containing "gi" identifiers to appear in NCBI BLAST server output. SUMMARY During the weekend of July 3-4, for many of the sequences that have been assigned NCBI "gi" identifiers, these identifiers will be included with the usual sequence identifiers in BLAST server output. gi identifiers may be suppressed from BLAST Network server output by using the "-gi" command line option. INTRODUCTION "gi" identifiers are assigned to sequences managed in the NCBI Integrated Database (Sirotkin & Ostell, unpublished). Currently the "ID" database is used at the NCBI to track sequences from several databases, including GenBank/EMBL/DDBJ, PIR, SWISS-PROT, PDB and PRF. Each discrete sequence within ID has its own unique gi identifier. If the sequence changes, the original version is archived within the Integrated Database and a new gi is assigned to the altered sequence. gi identifiers thus provide a common avenue for referencing specific instances of sequences across a wide variety of source databases, a capability that is gaining increasing importance as more people come to rely upon the results of computation on sequence data. Users of the NCBI BLAST servers are already familiar with sequence identifiers represented in "FASTA long" format, such as: gb|L28077|TMYRRTRA gb|M34830|ABCAARAA gb|S63770|S63770 emb|I04231|I04231 sp|P18646|10KD_VIGUN gnl|dbest|2643 A table illustrating some of the syntax rules for identifiers originating from a variety of source databases is presented here: Source Database Name Identifier Syntax ============================ ======================== GenInfo Integrated Database gi|number GenBank gb|accession|locus EMBL Data Library emb|accession|name NBRF PIR pir|accession|entry SWISS-PROT sp|accession|name Brookhaven Protein Data Bank pdb|name|chain General database gnl|dbname|entry, ex. gnl|dbest|2643 A complete list of database tags and "FASTA long" identifier formats supported by the NCBI Toolbox software is quoted at the end of this message. Note that GenPept ("gp") identifiers are not supported by the NCBI Toolbox, but until an alternative identification scheme is available for protein sequences from GenBank, the BLAST server will continue to report gp identifiers of the form "gp|accession|locus_cds#" for these sequences. COMPOUND IDENTIFIERS The syntax of "FASTA long" identifiers allows them to be concatenated to form compound identifiers, using a vertical bar to separate the individual identifiers from one another. For those sequences that have been assigned gi identifiers, the BLAST server will report compound identifiers such as: gi|495710|gb|S69295|S69295S1 gi|495711|gb|S69302|S69295S2 gi|507128|gb|X72932|SPEMMFCR A full FASTA-format definition line, with a single space separating the compound identifier from the sequence description, might look like this: >gi|507128|gb|X72932|SPEMMFCR S.pyogenes genes for Fcr protein and M protein AVAILABILITY OF gi IDENTIFIERS In two major releases of GenBank now, gi identifiers have been included in COMMENT and FEATURE fields of the flat files. In Entrez and NetEntrez, gi identifiers may be used as sequence retrieval keys, where they are referred to as "NCBI Seq IDs". Within the GenBank ASN.1 files, gi identifiers are a choice of Seq-id. gi identifiers are currently to be made available on the BLAST server only for records derived directly from the GenBank flat files. Eventually, more gi identifiers will be reported for sequences from other sources. Therefore, you may note that some sequences in BLAST output are missing gi identifiers. SUPPRESSION OF gi IDENTIFIERS gi identifiers may be suppressed entirely from the human-readable BLAST server output by using the "-gi" option of the BLAST application programs. Example: blastp nr dxch.aa -gi > dxch.out Warren Gish NCBI/NLM The database tags and identifier formats supported by the NCBI Toolbox software, posted on ncbi.nlm.nih.gov beneath the /toolbox directory. This source code is quoted from the file api/sequtil.c. static char * txtid [16] = { /* FASTA_LONG formats */ "???" , /* not-set = ??? */ "lcl", /* local = lcl|integer or string */ "bbs", /* gibbsq = bbs|integer */ "bbm", /* gibbmt = bbm|integer */ "gim", /* giim = gim|integer */ "gb", /* genbank = gb|accession|locus */ "emb", /* embl = emb|accession|locus */ "pir", /* pir = pir|accession|name */ "sp", /* swissprot = sp|accession|name */ "pat", /* patent = pat|country|patent number (string)|seq number (i nteger) */ "oth", /* other = oth|accession|name|release */ "gnl", /* general = gnl|database(string)|id (string or number) */ "gi", /* gi = gi|integer */ "dbj", /* ddbj = dbj|accession|locus */ "prf", /* prf = prf|accession|name */ "pdb" }; /* pdb = pdb|entry name (string)|chain id (char) */ Sample blast server output, illustrating the report style when gi identifiers are included. BLASTP 1.3.11MP [29-Oct-93] [Build 12:24:06 Jul 1 1994] Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-410. Query= DXCH 232 Gene X protein - Chicken (fragment) (232 letters) Database: CDS Translations from GenBank(R) Release 83.0, June 15, 1994 100,312 sequences; 30,038,452 total letters. Searching..................................................done Smallest Poisson High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N gi|212897|gp|J00920|CHKX4_1 chicken x gene: exon7 and fla... 1191 1.7e-159 1 gi|212900|gp|J00922|CHKY_1 chicken y gene, including flan... 949 1.6e-126 1 gi|63060|gp|V00387|GGALB6_1 Part of the chicken ovalbumin... 690 1.5e-90 1 gi|212503|gp|M34352|CHKOVA8_1 Chicken ovalbumin gene, exo... 645 7.5e-85 1 gi|212505|gp|J00895|CHKOVAL_1 Chicken ovalbumin gene, com... 645 7.5e-85 1 WARNING: Descriptions of 208 database sequences were not reported due to the limiting value of parameter V = 5. >gi|212897|gp|J00920|CHKX4_1 chicken x gene: exon7 and flanking sequences. [Gallus gallus] Length = 232 Score = 1191 (542.5 bits), Expect = 1.7e-159, P = 1.7e-159 Identities = 232/232 (100%), Positives = 232/232 (100%) Query: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN Sbjct: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60 Query: 61 SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK 120 SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK Sbjct: 61 SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK 120 Query: 121 RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS 180 RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS Sbjct: 121 RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS 180 Query: 181 EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232 EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP Sbjct: 181 EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232 >gi|212900|gp|J00922|CHKY_1 chicken y gene, including flanking sequences. [Gallus gallus] Length = 388 Score = 949 (432.2 bits), Expect = 1.6e-126, P = 1.6e-126 Identities = 180/232 (77%), Positives = 203/232 (87%) Query: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60 QIKDLLVSSS D TT+V +N IYFKG+WK AFN EDTREMPF +TK+ESKPVQMMCMNN Sbjct: 157 QIKDLLVSSSIDFGTTMVFINTIYFKGIWKIAFNTEDTREMPFSMTKEESKPVQMMCMNN 216 Query: 61 SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK 120 SFNVATLPAEKMKILELP+ASGDLSMLVLLPDEVS LERIEKTINF+KL EWT+ N M K Sbjct: 217 SFNVATLPAEKMKILELPYASGDLSMLVLLPDEVSGLERIEKTINFDKLREWTSTNAMAK 276 Query: 121 RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS 180 + +KVYLP+MKIEEKYNLTS+LMALGMTDLF SANLTGISS ++L IS AVHG FME++ Sbjct: 277 KSMKVYLPRMKIEEKYNLTSILMALGMTDLFSRSANLTGISSVDNLMISDAVHGVFMEVN 336 Query: 181 EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232 E+G E GSTG I +IKHS E E+FRADHPFLF I++NPTN I++FGRYWSP Sbjct: 337 EEGTEATGSTGAIGNIKHSLELEEFRADHPFLFFIRYNPTNAILFFGRYWSP 388 >gi|63060|gp|V00387|GGALB6_1 Part of the chicken ovalbumin X gene (codes for exon 7). [Gallus gallus] Length = 133 Score = 690 (314.3 bits), Expect = 1.5e-90, P = 1.5e-90 Identities = 133/133 (100%), Positives = 133/133 (100%) Query: 100 IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTG 159 IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTG Sbjct: 1 IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTG 60 Query: 160 ISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNP 219 ISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNP Sbjct: 61 ISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNP 120 Query: 220 TNTIVYFGRYWSP 232 TNTIVYFGRYWSP Sbjct: 121 TNTIVYFGRYWSP 133 WARNING: HSPs involving 210 database sequences were not reported due to the limiting value of parameter B = 3. Parameters: E = 10., S = 60 (27.3 bits), E2 = 0.11, S2 = 36 W = 3, T = 20 (9.1 bits), X = 22 (10.0 bits) M = BLOSUM62 H = 0, V = 5, B = 3 -gapdecayrate 0.5 (the default) Statistics: Lambda = 0.316 nats/unit score, Lambda/ln2 = 0.455 bits/unit score K = 0.132, H = 0.534 bits/position Expected/Observed high score = 64 (29.1 bits) / 1191 (542.5 bits) # of letters in query: 232 # of neighborhood words in query: 17 # of exact words scoring below T: 216 Database: CDS Translations from GenBank(R) Release 83.0, June 15, 1994 # of letters in database: 30,038,452 # of word hits against database: 1,402,734 # of failed hit extensions: 1,308,427 # of excluded hits: 93,703 # of successful extensions: 604 # of overlapping HSPs discarded: 30 # of HSPs reportable: 574 # of sequences in database: 100,312 # of database sequences with at least one HSP: 213 No. of states in DFA: 185 (19 KB) Total size of DFA: 21 KB (64 KB) Time to generate neighborhood: 0.00u 0.00s 0.00t Real: 00:00:00 No. of processors used: 8 Time to search database: 27.69u 17.71s 45.40t Real: 00:00:15 Total cpu time: 27.71u 17.81s 45.52t Real: 00:00:15 WARNINGS ISSUED: 2