FASTA
FASTA. This program compares a protein sequence to another protein sequence or to a protein database, or a DNA sequence to another DNA sequence or a DNA library. It can be used for protein comparison or nucleotide comparison. This program achieves a high level of sensitivity for similarity searching at high speed by performing optimised searches for local alignments using a substitution matrix. The high speed of this program is achieved by using the observed pattern of word hits to identify potential matches before attempting the more time consuming optimised search. The trade-off between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Not every word hit is investigated but instead initially looks for segment's containing several nearby hits.
FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST, which also searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.
See: Pearson WR, Lipman DJ (1988) "Improved tools for biological sequence comparison." PNAS USA. 85(8):2444-8.
Manual: http://www.med.nyu.edu/rcr/rcr/fastaman.html
INPUT = Nucleic Acid or Protein sequence files
Tools in the FASTA family: ALIGN, CHOFAS • FASTS • FASTX • FASTY • GGSEARCH • GLSEARCH • GREASE • LALIGN • LFASTA • PLALIGN • PRSS • PRFX • SSEARCH • TFASTS • TFASTX • TFASTY
Format
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters (punch cards had 80 columns). The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
Header line
The header line, which begins with '>', gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence databases use standardized headers, which helps when automatically extracting information from the header. The header line may contain more than one header, separated by a ^A (Control-A) character (as in [1]).
In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Effectively no databases and no bioinformatics applications recognize these comments so don't use them. Instead, everyone follows the NCBI FASTA specification. An example of a multiple sequence FASTA file follows:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Sequence representation
After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.
The nucleic acid codes supported are:
| Nucleic Acid Code | Meaning |
|---|---|
| A | Adenosine |
| C | Cytidine |
| G | Guanine |
| T | Thymidine |
| U | Uracil |
| R | G A (puRine) |
| Y | T C (pYrimidine) |
| K | G T (Ketone) |
| M | A C (aMino group) |
| S | G C (Strong interaction) |
| W | A T (Weak interaction) |
| B | G T C (not A) (B comes after A) |
| D | G A T (not C) (D comes after C) |
| H | A C T (not G) (H comes after G) |
| V | G C A (not T, not U) (V comes after U) |
| N | A G C T (aNy) |
| X | masked |
| - | gap of indeterminate length |
