Wikis > SOPs > Methods SOPs > How to find Largest ORF in Forward or Reverse Compliment Strand (or edit the program to the output you want)

How to find Largest ORF in Forward or Reverse Compliment Strand

How to find Largest ORF in Forward or Reverse Compliment Strand

Perl is required to run perl script.

The input required for this program is a FASTA file with one or more nucleotide sequences. The output will be two files, one a text file with both the nucleotide and amino acid sequences of the orf, the other a FASTA file with the nucleotide sequences of the orf as a default.
1)    The PERL program to do this can be found in the CVS directory in scriptsproject.

http://virology.uvic.ca/virology-ca-tools/scripts/

It is called ‘FindLongestsOrf.pl’.  This program required the sequence to be in FASTA format. Many sequences may be run at the same time with this program.
2)    Open Terminal (command prompt) and change the directory to where your script is saved.

The terminal will show you which directory you are currently located in, and changes in directory can be made through prompts such as:

cd                  change directory                        ex: cd C:Users/Desktop/Bioinformatics/

This command will take you to a ‘Bioinformatics’ folder on your desktop, the directory path of the folder which you are trying to access can be seen near the top of the folder, when you open up the folder which you are looking for the directory path to

.                      this directory                            ex: cd ./Bioinformatics/

If you are already in your Desktop, this command is a shorter way to access the Bioinformatics folder

..                     one directory above                  ex: cd ..

This command will take you one directory higher, if you are currently in the ‘Bioinformatics’ folder, then back to Desktop
3)    Once you are in the correct directory type:  perl ./nameofscript inputFile.fasta outputFile [+/-] (whole)
The first argument is the path to the input file (must be in FASTA format).
The second argument is the path to the output file, if it does not exist it will be created.
The third argument specifies the positive or negative strand.

The fourth argument is not necessary, but an option. The default fasta file will have the nucleotide sequence of the orf. If you would like it to instead have the whole nucleotide sequence of the strand it is on, type in “whole” as the fourth argument

e.g.

./FindLongestOrf.pl input.fasta output –

or

./FindLongestOrf.pl input.fasta output – whole
It’s picky about any extra blank lines.

If the command prompt is coming up with the error “perl is not recognized as an internal or external command” ensure that you have perl on your computer, and if so, then set the path for perl

Changing the output of FindLongestOrf.pl:
The formatting and information in text file and the FASTA file which are created by running FindLongestOrf can be modified by editting the perl script. This is done in the last section the perl script in the series of if/elsif statements:

if (length($m_aaseq) < 10) {
print OUT “$m_def\t$m_size bp\t000″ . length($m_aaseq) . ” aa\t$m_start..$m_stop\t$seq\t$m_aaseq\n”;
}

print OUTFASTA “>$m_def|$m_size bp|” . length($m_aaseq) . ” aa|$m_start..$m_stop\n$seq\n”;

OUT is the output for the regular file, OUTFASTA for the FASTA file

Variables:
$m_def                                     accession number
$m_size                                    number of base pairs in the orf
$m_ntseq                                 the nucleotide sequence of the orf
$m_aaseq                                the amino acid sequence of the orf
$seq                                           the whole nucleotide sequence on which the orf is on

\t              means tab (creates space)

\n             means new line

other symbols such as “>” or “|” will print out exactly what they show

As you can see in the the example above, the fasta file will have the whole nucleotide sequence because of …\n$seq\… , instead of just within the orf. This was done to create phylogenic trees out of the whole sequence, but can be changed to just the orf by substituting …\n$m_ntseq\… for …\n$seq\…

 

Troubleshooting tips: Sometimes, access to the program file is restricted. To solve this, refer to instructions online on how to modify access depending on the command-line program you are using. In the case of UNIX command-line programs, navigate to the specific directory and then make the following inputs: “chmod” followed by the permission commands and then the name of the program file.

chmod<permission commands> <filename>
eg. chmodug+rwx FindLongestOrf.pl
 

Comments are closed.