TFASTX
Compares a protein sequence to a DNA sequence or DNA sequence library. TFastX does a Pearson and Lipman search for similarity between a protein query sequence and a user-specified group of nucleotide sequences, taking frameshifts into account. Like TFastA, it is designed to answer the question, "What protein sequences encoded in a nucleotide sequence database are similar to my protein sequence?".
TFastX can be considered an enhanced version of TFastA. While TFastA treats each of the six reading frames of a nucleotide sequence as a separate sequence, resulting in three separate alignments for each strand, TFastX compares the protein query sequence to only one translated protein per strand of the nucleotide sequence, resulting in one alignment per strand.
It calculates a similarity score for alignments, taking frameshifts into account. It can "join" short regions separated by frameshifts into a single long alignment. TFastX may alert you to more meaningful hits than TFastA does when the nucleotide sequences contain frameshift errors.
TFastX can also be used in situations where FrameSearch is used. TFastX is faster, but FrameSearch is more sensitive. In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison. In the second step, these "best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see if some of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences with the highest scores are aligned to the query sequence for display.
What is a Word? A word is any short sequence (n-mer or k-tuple) where you have set n to some small integer less than or equal to six. The word GGATGG is one of the 4,096 possible words of length six that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QL is one of the 400 possible words of length two that you can make with the 20 letters of the amino acid alphabet. The DNA sequence is translated in three forward and three reverse frames, and the protein query sequence is compared to each of the six derived protein sequences. The DNA sequence is translated from one end to the other; no attempt is made to edit out intervening sequences. Part of the FastA family.
For more about fastx, fasty, tfastx, tfasty, see: Pearson WR, Wood T, Zhang Z, Miller W. (1997) "Comparison of DNA sequences with protein sequences." Genomics. Nov 15;46(1):24-36.
Manual: http://www.med.nyu.edu/rcr/rcr/fastaman.html
INPUT = Protein sequence files in fasta format. The search set is either a single DNA/RNA sequence or multiple DNA/RNA sequences. You can specify multiple probe sequences.
TEST INPUT FILES
Input file: tfastx_in.txt
TEST OUTPUT FILES
Output file: tfastx_out1.txt