Sequence representation
After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.
The nucleic acid codes supported are:
|
Nucleic Acid Code
|
Meaning
|
|
A
|
Adenosine |
|
C
|
Cytidine |
|
G
|
Guanine |
|
T
|
Thymidine |
|
U
|
Uracil |
|
R
|
G A (puRine) |
|
Y
|
T C (pYrimidine) |
|
K
|
G T (Ketone) |
|
M
|
A C (aMino group) |
|
S
|
G C (Strong interaction)
|
|
W
|
A T (Weak interaction)
|
|
B
|
G T C (not A) (B comes after A)
|
|
D
|
G A T (not C) (D comes after C)
|
|
H
|
A C T (not G) (H comes after G)
|
|
V
|
G C A (not T, not U) (V comes after U)
|
|
N
|
A G C T (aNy)
|
|
X
|
masked
|
|
-
|
gap of indeterminate length
|
The amino acid codes supported are:
|
Amino Acid Code
|
Meaning
|
|
A
|
Alanine |
|
B
|
Aspartic acid or Asparagine |
|
C
|
Cysteine |
|
D
|
Aspartic acid |
|
E
|
Glutamic acid |
|
F
|
Phenylalanine |
|
G
|
Glycine |
|
H
|
Histidine |
|
I
|
Isoleucine |
|
K
|
Lysine |
|
L
|
Leucine |
|
M
|
Methionine |
|
N
|
Asparagine |
|
P
|
Proline |
|
Q
|
Glutamine |
|
R
|
Arginine |
|
S
|
Serine |
|
T
|
Threonine |
|
U
|
Selenocysteine |
|
V
|
Valine |
|
W
|
Tryptophan |
|
Y
|
Tyrosine |
|
Z
|
Glutamic acid or Glutamine |
|
X
|
any
|
|
*
|
translation stop
|
|
-
|
gap of indeterminate length
|