PlantsP-Functional Genomics of Plant Phosphorylation
Home  |  Search  |  Families  |  Resources  |  Papers  |  Feature Scan  |  Comments/Suggestions

FASTA Sequence Format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.
  • The description line starts with a greater than symbol (">").
  • The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description.
  • The "ID" and the description are optional.
  • All lines of text should be shorter than 80 characters.
  • The sequence ends if there is another greater than symbol (">") symbol at the beginning of a line and another sequence begins.

The following example contains two sequences (Example1, Example2):

>Example1 envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
>Example2 synthetic peptide
HITREPLKHIPKERYRGTNDTLSPQIESIWAAELDRYKLVKTNCSNVS
Sequences should be in single letter standard IUB/IUPAC amino acid and nucleic acid codes. There is some variablitity among programs in support for the following:
  • Lower case letters are generally acceptable, but in some programs have special meanings.
  • Gaps introduced by alignment programs are frequently represented by the hyphen or period characters.
  • The end of a sequence is sometimess represented by a asterisk (*). Less frequently, sequences for which the reported region begins or ends within the known sequence may begin or end wit a slash (/).
  • Unknown bases are generally represented by N, and sometimes by X.
  • Unknown amino acid residues are represented by X, and the ambiguous acid/amide residues by B (D or N) and Z (E or Q). Selenomethionine is sometimes represented as "O", and seleocysteine as "U", but many programs will treat these as an errors.

The accepted nucleic acid codes are:

        A --> adenosine           M --> A C (amino)
        C --> cytidine            S --> G C (strong)
        G --> guanine             W --> A T (weak)
        T --> thymidine           B --> G T C
        U --> uridine             D --> G A T
        R --> G A (purine)        H --> A C T
        Y --> T C (pyrimidine)    V --> G C A
        K --> G T (keto)          N --> A G C T (any)
                                  X --> A G C T (any)
The accepted amino acid codes are:

    A  alanine                         P  proline
    B  aspartate or asparagine         Q  glutamine
    C  cystine                         R  arginine
    D  aspartate                       S  serine
    E  glutamate                       T  threonine
    F  phenylalanine                   U  selenocysteine
    G  glycine                         V  valine
    H  histidine                       W  tryptophane
    I  isoleucine                      Y  tyrosine
    K  lysine                          Z  glutamate or glutamine
    L  leucine                         X  any
    M  methionine                      *  translation stop
    N  asparagine                      
    O  selenomethionine
dividing line dividing line