FASTA

FASTA. This program compares a protein sequence to another protein sequence or to a protein database, or a DNA sequence to another DNA sequence or a DNA library. It can be used for protein comparison or nucleotide comparison. This program achieves a high level of sensitivity for similarity searching at high speed by performing optimised searches for local alignments using a substitution matrix. The high speed of this program is achieved by using the observed pattern of word hits to identify potential matches before attempting the more time consuming optimised search. The trade-off between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Not every word hit is investigated but instead initially looks for segment's containing several nearby hits.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST, which also searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.

See: Pearson WR, Lipman DJ (1988) "Improved tools for biological sequence comparison." PNAS USA. 85(8):2444-8.

Manual: http://www.med.nyu.edu/rcr/rcr/fastaman.html

INPUT = Nucleic Acid or Protein sequence files

Tools in the FASTA family: ALIGN, CHOFAS • FASTS • FASTXFASTY • GGSEARCH • GLSEARCH • GREASE • LALIGN • LFASTA • PLALIGN • PRSS • PRFX • SSEARCH • TFASTS • TFASTXTFASTY

Format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters (punch cards had 80 columns). The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]  
    LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV  
    EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG  
    LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL  
    GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX  
    IENY  

Header line

The header line, which begins with '>', gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence databases use standardized headers, which helps when automatically extracting information from the header. The header line may contain more than one header, separated by a ^A (Control-A) character (as in [1]).

In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Effectively no databases and no bioinformatics applications recognize these comments so don't use them. Instead, everyone follows the NCBI FASTA specification. An example of a multiple sequence FASTA file follows:

>SEQUENCE_1  
    MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
    LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK  
    IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL  
    MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL  

    >SEQUENCE_2  
    SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI  
    ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH  

Sequence representation

After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.

The nucleic acid codes supported are:

Nucleic Acid Code Meaning
A Adenosine
C Cytidine
G Guanine
T Thymidine
U Uracil
R G A (puRine)
Y T C (pYrimidine)
K G T (Ketone)
M A C (aMino group)
S G C (Strong interaction)
W A T (Weak interaction)
B G T C (not A) (B comes after A)
D G A T (not C) (D comes after C)
H A C T (not G) (H comes after G)
V G C A (not T, not U) (V comes after U)
N A G C T (aNy)
X masked
- gap of indeterminate length

The amino acid codes supported are:

↓Amino Acid Code ↓Meaning
A Alanine
B Aspartic acid or Asparagine
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
K Lysine
L Leucine
M Methionine
N Asparagine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid or Glutamine
X any
* translation stop
- gap of indeterminate length