TFASTA

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

TFastA searches for similarities between a query protein sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six frames before performing the comparison. Each translated reading frame is treated as a separate sequence to be searched. In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison. In the second step, these "best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see if some of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences with the highest scores are aligned to the query sequence for display.

What is a Word? A word is any short sequence (n-mer or k-tuple) where you have set n to some small integer less than or equal to six. The word GGATGG is one of the 4,096 possible words of length six that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QL is one of the 400 possible words of length two that you can make with the 20 letters of the amino acid alphabet. Part of the FastA family.

See: Pearson WR, Lipman DJ (1988) "Improved tools for biological sequence comparison." PNAS USA. 85(8):2444-8.

Manual: http://www.med.nyu.edu/rcr/rcr/fastaman.html

INPUT = Protein sequence files in fasta format. The search set is either a single DNA/RNA sequence or multiple DNA/RNA sequences. You can specify multiple probe sequences. TFastX accepts a single protein sequence as the query sequence. The search set is either a single nucleic acid sequence or multiple nucleic acid sequences.

TEST INPUT FILES

Input file: tfasta_in.txt

TEST OUTPUT FILES

Output file: tfasta_out1.txt