HOWTO: Generating Distribution Boostrapping Data


This Howto document provides an overview of the process for generating (bootstrapping) random samples of amino acid data based on genome distribution sampling. There are currently tools to generate this types of data in three formats: single amino acid preference, pair amino acid preference, and triple amino acid preference. These programs handle single segmented genomes as well as multi-segmented genomes, as long as the multi-segmented genome FASTA and gene files are formatted as described in the document Generating Base Data [Instructions for Multisegment Genome Data, Instruction #3].

Generating Distribution Boostrapped Single Amino Acid Preference Data:

  1. Copy the file NewSelectRandomSingles.java from the bootstrap/distribution/ directory of the toolkit to your current working directory. Also, copy the files SequenceReader.java, SequenceUtilities.java, ExtractGenes.java, and AminoAcidSinglePreference.java from the utils/ directory of the toolkit to your current working directory.

  2. Compile this suite of java files by issuing the command:

    javac NewSelectRandomSingles.java

    Since this file uses all of the other java files, it will force those files to be compiled during the compilation process.

  3. Create three files that represent a database of the genomes one is interested in sampling from.

    1. The first file should contain full pathnames to the FASTA files to use as the genome sampling set.
    2. The second file should contain full pathnames to the gene files to use as the genome sampling set. This file should be in the same order as the first file.
    3. The third file should contain full pathnames to the FASTA files to use as the genome sampling set, with each FASTA filename followed by a colon (":") and the type of genome for that file (SSDNA, DSDNA, RETROID). This file should also be in the same order as the database of FASTA files.

    Examples of these three files can be found in dataFiles/virusFASTA.db, dataFiles/virusGenes.db, and dataFiles/virusTypes.db respectively.

  4. The underlying algorithm for distribution bootstrapping is:

    1. Select a random genome from the database.
    2. Initialize a counter for all twenty amino acids of interest to zero.
    3. Generate a list containing all single amino acids used in the genome. The size of this list is equal to the total number of amino acids in the genome.
    4. Randomly select single amino acids from the list, updating the counter for that amino acid as appropriate.
    5. Continue randomly selecting, with replacement, until the count of amino acids selected is equal to that requested to be collected by the user.
    6. Once enough random single amino acids have been selected, normalize the total counts of amino acids seen.
    7. Output the single amino acid preference profile created by this process.
    8. Repeat, selecting random genomes with replacement until the specified number (set by the user) of random profiles is generated.

  5. To generate random samples of amino acid data based on the underlying database of genomes referenced by the files above, execute the following command:

    java NewSelectRandomSingles pathToFASTADatabase pathToGenesDatabase pathToTypesDatabase numberOfInputFiles numberOfRandomSamplesToGenerate sizeInAminosOfRandomSamplesToGenerate outputDirectoryPathName

    Each parameter is described below:

    As an example, to generate 1000 random samples, each having 660 amino acids using the virusFASTA.db (which has 236 genomes) described above and desiring to store the data in the directory randomSamples/, use the following command:

    java NewSelectRandomSingles virusFASTA.db virusGenes.db virusTypes.db 236 1000 660 randomSamples/

  6. After execution, there should be five files in the output directory:

    1. randomClassificationTestGroup - This file contains all of the the single amino acid preference profiles generated by the program, formatted to be used with SVMLight for testing.
    2. randomClassificationTest.geneUsage - This file contains entries for each sample, referencing which part of each genome and gene was used in generating the sample. It is useful for verifying the sampling of genes.
    3. randomClassificationTest.testToTrue - This file contains entries for each sample, referencing the actual FASTA file that was used as the basis for the random samples. It is useful for verifying the sampling of genomes.
    4. randomClassificationTest.types - This file contains entries for each sample, referencing the true type of the genome from which the sample was generated. This file can be used as a basis for comparing against the outputs from a SVMLight classification run.
    5. classificationTestSequenceNames - This file can be ignored, as it is no longer required by other parts of the software.

    The randomClassificationTestGroup file is formatted to be used with the SVMLight classification tool as a testing input. Every sample is on a separate line, and all features of samples are prefixed by their feature number, followed by a colon, followed by the feature value. For example, feature 1 with value 0.05 is labeled as 1:0.05. Since singles are selected using this program, there should be twenty features on each line of the output file. Each line of the file will also be started with a 0 value, indicating to SVMLight that no classification is currently known for this sample. All randomly generated samples are placed into the same output file so that they can be tested at one time by the SVMLight software.

Generating Distribution Boostrapped Pair and Triple Amino Acid Preference Data:

  1. Follow the procedure described above for Single Amino Acid Preference, except replace NewSelectRandomSingles.java with NewSelectRandomPairs.java or NewSelectRandomTriples.java as needed. Similarly, replace AminoAcidSinglePreference.java where it is called for with AminoAcidPairPreference.java or AminoAcidTriplePreference.java.

  2. Make sure you compile the new pair preference or triple preference files you have just moved into your working directory.

  3. Output files for pair preference data will have 400 features, and output files for triple preference data will have 8000 features. The triple preference data files can grow large (greater than 50MB at least), so ensure an appropriate amount of disk space is available before executing this process.