TITLE

Compound sequence identifiers containing "gi" identifiers to appear
in NCBI BLAST server output.


SUMMARY

During the weekend of July 3-4, for many of the sequences that have been
assigned NCBI "gi" identifiers, these identifiers will be included with
the usual sequence identifiers in BLAST server output.  gi identifiers may be
suppressed from BLAST Network server output by using the "-gi" command line
option.

INTRODUCTION

"gi" identifiers are assigned to sequences managed in the NCBI Integrated
Database (Sirotkin & Ostell, unpublished).  Currently the "ID" database is used
at the NCBI to track sequences from several databases, including
GenBank/EMBL/DDBJ, PIR, SWISS-PROT, PDB and PRF.  Each discrete sequence within
ID has its own unique gi identifier.  If the sequence changes, the original
version is archived within the Integrated Database and a new gi is assigned to
the altered sequence.  gi identifiers thus provide a common avenue for
referencing specific instances of sequences across a wide variety of source
databases, a capability that is gaining increasing importance as more people
come to rely upon the results of computation on sequence data.

Users of the NCBI BLAST servers are already familiar with sequence identifiers
represented in "FASTA long" format, such as:

  gb|L28077|TMYRRTRA
  gb|M34830|ABCAARAA
  gb|S63770|S63770
  emb|I04231|I04231
  sp|P18646|10KD_VIGUN
  gnl|dbest|2643

A table illustrating some of the syntax rules for identifiers originating from
a variety of source databases is presented here:

  Source Database Name              Identifier Syntax
  ============================      ========================
  GenInfo Integrated Database       gi|number
  GenBank                           gb|accession|locus
  EMBL Data Library                 emb|accession|name
  NBRF PIR                          pir|accession|entry
  SWISS-PROT                        sp|accession|name
  Brookhaven Protein Data Bank      pdb|name|chain
  General database                  gnl|dbname|entry, ex. gnl|dbest|2643

A complete list of database tags and "FASTA long" identifier formats supported
by the NCBI Toolbox software is quoted at the end of this message.  Note that
GenPept ("gp") identifiers are not supported by the NCBI Toolbox, but until an
alternative identification scheme is available for protein sequences from
GenBank, the BLAST server will continue to report gp identifiers of the form
"gp|accession|locus_cds#" for these sequences.


COMPOUND IDENTIFIERS

The syntax of "FASTA long" identifiers allows them to be concatenated to form
compound identifiers, using a vertical bar to separate the individual
identifiers from one another.  For those sequences that have been assigned gi
identifiers, the BLAST server will report compound identifiers such as:

  gi|495710|gb|S69295|S69295S1
  gi|495711|gb|S69302|S69295S2
  gi|507128|gb|X72932|SPEMMFCR

A full FASTA-format definition line, with a single space separating the
compound identifier from the sequence description, might look like this:

>gi|507128|gb|X72932|SPEMMFCR S.pyogenes genes for Fcr protein and M protein


AVAILABILITY OF gi IDENTIFIERS

In two major releases of GenBank now, gi identifiers have been included in
COMMENT and FEATURE fields of the flat files.  In Entrez and NetEntrez, gi
identifiers may be used as sequence retrieval keys, where they are referred to
as "NCBI Seq IDs".  Within the GenBank ASN.1 files, gi identifiers are a choice
of Seq-id.

gi identifiers are currently to be made available on the BLAST server only for
records derived directly from the GenBank flat files.  Eventually, more gi
identifiers will be reported for sequences from other sources.  Therefore, you
may note that some sequences in BLAST output are missing gi identifiers.


SUPPRESSION OF gi IDENTIFIERS

gi identifiers may be suppressed entirely from the human-readable BLAST server
output by using the "-gi" option of the BLAST application programs.

Example:

    blastp nr dxch.aa -gi > dxch.out


Warren Gish
NCBI/NLM


The database tags and identifier formats supported by the NCBI Toolbox
software, posted on ncbi.nlm.nih.gov beneath the /toolbox directory.
This source code is quoted from the file api/sequtil.c.

    static char * txtid [16] = {          /* FASTA_LONG formats */
        "???" ,     /* not-set = ??? */
        "lcl",      /* local = lcl|integer or string */
        "bbs",     /* gibbsq = bbs|integer */
        "bbm",      /* gibbmt = bbm|integer */
        "gim",      /* giim = gim|integer */
        "gb",       /* genbank = gb|accession|locus */
        "emb",      /* embl = emb|accession|locus */
        "pir",      /* pir = pir|accession|name */
        "sp",       /* swissprot = sp|accession|name */
        "pat",      /* patent = pat|country|patent number (string)|seq number (i
nteger) */
        "oth",      /* other = oth|accession|name|release */
        "gnl",      /* general = gnl|database(string)|id (string or number) */
        "gi",       /* gi = gi|integer */
        "dbj",      /* ddbj = dbj|accession|locus */
        "prf",      /* prf = prf|accession|name */
        "pdb" };    /* pdb = pdb|entry name (string)|chain id (char) */

Sample blast server output, illustrating the report style when gi identifiers
are included.

BLASTP 1.3.11MP [29-Oct-93] [Build 12:24:06 Jul  1 1994]

Reference:  Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990).  Basic local alignment search tool.  J. Mol. Biol.
215:403-410.

Query=  DXCH  232 Gene X protein - Chicken (fragment)
        (232 letters)

Database:  CDS Translations from GenBank(R) Release 83.0, June 15, 1994
           100,312 sequences; 30,038,452 total letters.
Searching..................................................done

                                                                     Smallest
                                                                     Poisson
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

gi|212897|gp|J00920|CHKX4_1 chicken x gene: exon7 and fla...  1191  1.7e-159  1
gi|212900|gp|J00922|CHKY_1 chicken y gene, including flan...   949  1.6e-126  1
gi|63060|gp|V00387|GGALB6_1 Part of the chicken ovalbumin...   690  1.5e-90   1
gi|212503|gp|M34352|CHKOVA8_1 Chicken ovalbumin gene, exo...   645  7.5e-85   1
gi|212505|gp|J00895|CHKOVAL_1 Chicken ovalbumin gene, com...   645  7.5e-85   1


WARNING:  Descriptions of 208 database sequences were not reported due to the
          limiting value of parameter V = 5.


>gi|212897|gp|J00920|CHKX4_1 chicken x gene: exon7 and flanking sequences.
            [Gallus gallus]
            Length = 232

 Score = 1191 (542.5 bits), Expect = 1.7e-159, P = 1.7e-159
 Identities = 232/232 (100%), Positives = 232/232 (100%)

Query:     1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60
             QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN
Sbjct:     1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60

Query:    61 SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK 120
             SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK
Sbjct:    61 SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK 120

Query:   121 RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS 180
             RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS
Sbjct:   121 RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS 180

Query:   181 EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232
             EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP
Sbjct:   181 EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232


>gi|212900|gp|J00922|CHKY_1 chicken y gene, including flanking sequences.
            [Gallus gallus]
            Length = 388

 Score = 949 (432.2 bits), Expect = 1.6e-126, P = 1.6e-126
 Identities = 180/232 (77%), Positives = 203/232 (87%)

Query:     1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60
             QIKDLLVSSS D  TT+V +N IYFKG+WK AFN EDTREMPF +TK+ESKPVQMMCMNN
Sbjct:   157 QIKDLLVSSSIDFGTTMVFINTIYFKGIWKIAFNTEDTREMPFSMTKEESKPVQMMCMNN 216

Query:    61 SFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEK 120
             SFNVATLPAEKMKILELP+ASGDLSMLVLLPDEVS LERIEKTINF+KL EWT+ N M K
Sbjct:   217 SFNVATLPAEKMKILELPYASGDLSMLVLLPDEVSGLERIEKTINFDKLREWTSTNAMAK 276

Query:   121 RRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELS 180
             + +KVYLP+MKIEEKYNLTS+LMALGMTDLF  SANLTGISS ++L IS AVHG FME++
Sbjct:   277 KSMKVYLPRMKIEEKYNLTSILMALGMTDLFSRSANLTGISSVDNLMISDAVHGVFMEVN 336

Query:   181 EDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232
             E+G E  GSTG I +IKHS E E+FRADHPFLF I++NPTN I++FGRYWSP
Sbjct:   337 EEGTEATGSTGAIGNIKHSLELEEFRADHPFLFFIRYNPTNAILFFGRYWSP 388


>gi|63060|gp|V00387|GGALB6_1 Part of the chicken ovalbumin X gene (codes for
            exon 7). [Gallus gallus]
            Length = 133

 Score = 690 (314.3 bits), Expect = 1.5e-90, P = 1.5e-90
 Identities = 133/133 (100%), Positives = 133/133 (100%)

Query:   100 IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTG 159
             IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTG
Sbjct:     1 IEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTG 60

Query:   160 ISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNP 219
             ISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNP
Sbjct:    61 ISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNP 120

Query:   220 TNTIVYFGRYWSP 232
             TNTIVYFGRYWSP
Sbjct:   121 TNTIVYFGRYWSP 133


WARNING:  HSPs involving 210 database sequences were not reported due to the
          limiting value of parameter B = 3.


Parameters:
  E = 10., S = 60 (27.3 bits),  E2 = 0.11, S2 = 36
  W = 3, T = 20 (9.1 bits), X = 22 (10.0 bits)
  M = BLOSUM62
  H = 0, V = 5, B = 3
  -gapdecayrate 0.5 (the default)

Statistics:
  Lambda = 0.316 nats/unit score, Lambda/ln2 = 0.455 bits/unit score
  K = 0.132, H = 0.534 bits/position
  Expected/Observed high score = 64 (29.1 bits) / 1191 (542.5 bits)
  # of letters in query:  232
  # of neighborhood words in query:  17
  # of exact words scoring below T:  216
  Database:  CDS Translations from GenBank(R) Release 83.0, June 15, 1994
  # of letters in database:  30,038,452
  # of word hits against database:  1,402,734
  # of failed hit extensions:  1,308,427
  # of excluded hits:  93,703
  # of successful extensions:  604
  # of overlapping HSPs discarded:  30
  # of HSPs reportable:  574
  # of sequences in database:  100,312
  # of database sequences with at least one HSP:  213
No. of states in DFA:  185 (19 KB)
Total size of DFA:  21 KB (64 KB)
Time to generate neighborhood:  0.00u 0.00s 0.00t  Real: 00:00:00
No. of processors used:  8
Time to search database:  27.69u 17.71s 45.40t  Real: 00:00:15
Total cpu time:  27.71u 17.81s 45.52t  Real: 00:00:15

WARNINGS ISSUED:  2