FASTX
FastX does a Pearson/Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. It is designed to answer the question, "What protein sequences encoded by my nucleic acid sequence are homologous to sequences in a specified protein database(s) ?"
FastX translates both strands of the nucleic sequence before performing the comparison. It searches for similarities between a nucleic acid sequence (the query) and any group of protein sequences. It is useful when the nucleotide query contains sequencing errors that cause frameshifts that would interfere with protein-protein matching methods method such as FastA. In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison. In the second step, these "best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see if some of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences with the highest scores are aligned to the query sequence for display. What is a Word? A word is any short sequence (n-mer or k-tuple) where you have set n to some small integer less than or equal to six. The word GGATGG is one of the 4,096 possible words of length six that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QL is one of the 400 possible words of length two that you can make with the 20 letters of the amino acid alphabet. Faster than FastY and FrameSearch, less sensitive. Part of the FastA family.
Manual: http://www.med.nyu.edu/rcr/rcr/fastaman.html
INPUT = Nucleic Acid sequence files in fasta format. The search set is either a single protein sequence or multiple protein sequences. You can specify multiple sequences.
TEST INPUT FILES
Input file: fastx_in.txt
TEST OUTPUT FILES
Output file: fastx_out.txt