HOWTO: Generating Base Data
This Howto document provides an overview of the process for generating base amino acid preference data from genomes (data that is used in later processes, such as training and testing). There are currently tools to generate three types of base data: single amino acid preference, pair amino acid preference, and triple amino acid preference. There are actually two tools to generate each type of amino preference data - one tool that works with single segmented genomes and a second that works with multi-segmented genomes.
Generating Single Segment Genome, Single Amino Acid Preference Data:
- Copy the file SinglePreference.java from the baseData/ directory of the toolkit to your current working directory. Also, copy the files SequenceReader.java, SequenceUtilities.java, ExtractGenes.java, and AminoAcidSinglePreference.java from the utils/ directory of the toolkit to your current working directory.
- Compile this suite of java files by issuing the command:
javac SinglePreference.java
Since this file uses all of the other java files, it will force those files to be compiled during the compilation process.
- For each file that you are interested in generating amino preference data for, make sure that you have both the fasta formatted sequence file and a gene file (formatted like dataFiles/NC_004812.genes), available on your system.
- To generate the base data, call the SinglePreference program with the following parameters:
java SinglePreference pathToFastaFile pathToGeneFile outputFileName
- After execution, the file specified as outputFileName should contain a vector of 20 datapoints which represent the usage of each of 20 amino acids within the genome, summed across all genes within the genome.
Generating Single Segment Genome, Pair and Triple Amino Acid Preference Data:
- Follow the procedure described above for Single Amino Acid Preference, except replace SinglePreference.java with PairPreference.java or TriplePreference.java as needed. Similarly, replace AminoAcidSinglePreference.java where it is called for with AminoAcidPairPreference.java or AminoAcidTriplePreference.java.
- Make sure you compile the new pair preference or triple preference files you have just moved into your working directory.
Generating Multiple Segment Genome, Single Amino Acid Preference Data:
- Copy the file SinglePreferenceMultiSegment.java from the baseData/ directory of the toolkit to your current working directory. Also, copy the files SequenceReader.java, SequenceUtilities.java, ExtractGenes.java, and AminoAcidSinglePreference.java from the utils/ directory of the toolkit to your current working directory.
- Compile this suite of java files by issuing the command:
javac SinglePreferenceMultiSegment.java
Since this file uses all of the other java files, it will force those files to be compiled during the compilation process.
- For each file that you are interested in generating amino preference data for, make sure that you have two specially formatted file for multi-segmented sequences and genes available on your system. The fasta file that will be fed to the program should not contain the sequence data, but actually a set of strings that represent where to find the fasta files for each segment of the genome. An example of this type of file can be found as dataFiles/MULTISEGMENT_GENOME0.fasta. Similarly, the genes file should contain strings that point to the paths of the true gene files for each of the segments of the genome. An example multisegment gene file can be found as dataFiles/MULTISEGMENT_GENOME0.genes. The true fasta formatted sequence file and gene file for each segment must be available on your system and pointed to by the special multisegmented input fasta file and input gene file.
- To generate the base data, call the SinglePreferenceMultiSegment program with the following parameters:
java SinglePreferenceMultiSegment pathToSpecialMultiSegmentFastaFile pathToSpecialMultiSegmentGeneFile outputFileName
- After execution, the file specified as outputFileName should contain a vector of 20 datapoints which represent the usage of each of 20 amino acids within the genome, summed across all genome segments, incorporating all genes within the genome.
Generating Multiple Segment Genome, Pair and Triple Amino Acid Preference Data:
- Follow the procedure described above for Multiple Segment Single Amino Acid Preference, except replace SinglePreferenceMultiSegment.java with PairPreferenceMultiSegment.java or TriplePreferenceMultiSegment.java as needed. Similarly, replace AminoAcidSinglePreference.java where it is called for with AminoAcidPairPreference.java or AminoAcidTriplePreference.java.
- Make sure you compile the new pair preference or triple preference files you have just moved into your working directory.