HOWTO: Generating Sequence Boostrapping Data


This Howto document provides an overview of the process for generating (bootstrapping) random samples of amino acid data based on real gene sampling. There are currently tools to generate this types of data in three formats: single amino acid preference, pair amino acid preference, and triple amino acid preference. These programs handle single segmented genomes as well as multi-segmented genomes, as long as the multi-segmented genome FASTA and gene files are formatted as described in the document Generating Base Data [Instructions for Multisegment Genome Data, Instruction #3].

Generating Sequence Boostrapped Single Amino Acid Preference Data:

  1. Copy the file NewSelectRandomGenesSingles.java from the bootstrap/sequence/ directory of the toolkit to your current working directory. Also, copy the files SequenceReader.java, SequenceUtilities.java, ExtractGenes.java, and AminoAcidSinglePreference.java from the utils/ directory of the toolkit to your current working directory.

  2. Compile this suite of java files by issuing the command:

    javac NewSelectRandomGenesSingles.java

    Since this file uses all of the other java files, it will force those files to be compiled during the compilation process.

  3. Create three files that represent a database of the genomes one is interested in sampling from.

    1. The first file should contain full pathnames to the FASTA files to use as the genome sampling set.
    2. The second file should contain full pathnames to the gene files to use as the genome sampling set. This file should be in the same order as the first file.
    3. The third file should contain full pathnames to the FASTA files to use as the genome sampling set, with each FASTA filename followed by a colon (":") and the type of genome for that file (SSDNA, DSDNA, RETROID). This file should also be in the same order as the database of FASTA files.

    Examples of these three files can be found in dataFiles/virusFASTA.db, dataFiles/virusGenes.db, and dataFiles/virusTypes.db respectively.

  4. The underlying algorithm for sequence bootstrapping is:

    1. Select a random genome from the database.
    2. Select a random gene from the chosen genome.
    3. Select a random subsection of the chosen gene that is of a specific size (set by the user) and count the amino acids seen in that subsection.
    4. Repeat selecting random genes and subsections of genes with replacement, adding the new count of each amino acid to the previous count.
    5. Once enough random subsections have been selected to generate a profile of size set by the user, normalize the total counts of amino acids seen.
    6. Output the single amino acid preference profile created by this process.
    7. Repeat selecting random genomes with replacement until the specified number (set by the user) of random profiles is generated.

  5. To generate random samples of amino acid data based on the underlying database of genomes referenced by the files above, execute the following command:

    java NewSelectRandomGenesSingles pathToFASTADatabase pathToGenesDatabase pathToTypesDatabase numberOfInputFiles numberOfRandomSamplesToGenerate sizeInAminosOfRandomSamplesToGenerate chunkSizeToSelectFromEachGene outputDirectoryPathName

    Each parameter is described below:

    As an example, to generate 1000 random samples, each having 660 amino acids selected in 60 amino acid chunks using the virusFASTA.db (which has 236 genomes) described above and desiring to store the data in the directory randomSamples/, use the following command:

    java NewSelectRandomGenesSingles virusFASTA.db virusGenes.db virusTypes.db 236 1000 660 60 randomSamples/

  6. After execution, there should be five files in the output directory:

    1. randomClassificationTestGroup - This file contains all of the the single amino acid preference profiles generated by the program, formatted to be used with SVMLight for testing.
    2. randomClassificationTest.geneUsage - This file contains entries for each sample, referencing which part of each genome and gene was used in generating the sample. It is useful for verifying the sampling of genes.
    3. randomClassificationTest.testToTrue - This file contains entries for each sample, referencing the actual FASTA file that was used as the basis for the random samples. It is useful for verifying the sampling of genomes.
    4. randomClassificationTest.types - This file contains entries for each sample, referencing the true type of the genome from which the sample was generated. This file can be used as a basis for comparing against the outputs from a SVMLight classification run.
    5. classificationTestSequenceNames - This file can be ignored, as it is no longer required by other parts of the software.

    The randomClassificationTestGroup file is formatted to be used with the SVMLight classification tool as a testing input. Every sample is on a separate line, and all features of samples are prefixed by their feature number, followed by a colon, followed by the feature value. For example, feature 1 with value 0.05 is labeled as 1:0.05. Since singles are selected using this program, there should be twenty features on each line of the output file. Each line of the file will also be started with a 0 value, indicating to SVMLight that no classification is currently known for this sample. All randomly generated samples are placed into the same output file so that they can be tested at one time by the SVMLight software.

Generating Sequence Boostrapped Pair and Triple Amino Acid Preference Data:

  1. Follow the procedure described above for Single Amino Acid Preference, except replace NewSelectRandomGenesSingles.java with NewSelectRandomGenesPairs.java or NewSelectRandomGenesTriples.java as needed. Similarly, replace AminoAcidSinglePreference.java where it is called for with AminoAcidPairPreference.java or AminoAcidTriplePreference.java.

  2. Make sure you compile the new pair preference or triple preference files you have just moved into your working directory.

  3. Output files for pair preference data will have 400 features, and output files for triple preference data will have 8000 features. The triple preference data files can grow large (greater than 50MB at least), so ensure an appropriate amount of disk space is available before executing this process.