Supplementary material for:

"Correlation of Amino Acid preference and Mammalian Viral Genome Type"

The list of viruses included in the dataset
Detailed table data for bootstrapped sequences and distribitutions for Figures 1 and 3
Detailed table data for bootstrapped sequences and distribitutions for Figures 2 and 4
The complete listing of amino acids triples with p-values < 0.01

Amino Acid Preference Toolkit in Java

Pathogen Project
Department of Computer Science and Engineering
University of South Carolina
Columbia, SC 29208
Contact Email: rose@cse.sc.edu

Support Vector Machine Implementation

SVMLight toolkit - Developed by Thorsten Joachims, Department of Computer Science, Cornell University; Not included in tar and zip files below; Written in C

Amino Acid Preference Toolkit

This toolkit is designed to be used in a Unix environment, taking advantage of Unix features such as shell files and gzip. It is possible to migrate these to a Windows environment, however, by replacing calls to gzip with the appropriate zip command and updating shell files to be batch files.

Download the entire toolkit in tarred, gzipped format
Download the entire toolkit in zip format

Notes on Running Tests

Notes on Using the Amino Acid Preference Toolkit in a Windows Environment.
Notes on Generating Base Data Used in Evaluating Amino Acid Preferences [1st step to perform]
Notes on Running Cross Validation with the Amino Acid Preference Toolkit
Notes on Parallelizing the Cross Validation Process
Notes on Bootstrapping Random Amino Preference Profiles from Gene Sequences (sequence bootstrapping)
Notes on Bootstrapping Random Amino Preference Profiles from Genome-Wide Amino Distributions (distribution bootstrapping)
Notes on Generating Decimated Training and Test Sets to Work With Hypothesized Important Amino Acids

File List

The directory structure does not relate to any type of Java package structure, but is rather used as a simple organization technique. Proper use of the toolkit may require files from multiple directories. Compilation of a given file may also required combining files from multiple directories (usually grabbing something from the utils/ directory).

baseData/ directory - Algorithms that generate the base data used for computing and comparing amino acid preference profiles.
SinglePreference.java - Compute the whole genome amino acid preference profile, given an input fasta file and input gene notation file.
SinglePreferenceMultiSegment.java - Compute the whole genome amino acid preference profile for a multi-segmented genome, given an input multi-segment fasta file and input multi-segment gene notation file.
PairPreference.java - Compute the whole genome amino acid pair preference profile, given an input fasta file and input gene notation file.
PairPreferenceMultiSegment.java - Compute the whole genome amino acid pair preference profile for a multi-segmented genome, given an input multi-segment fasta file and input multi-segment gene notation file.
TriplePreference.java - Compute the whole genome amino acid triple preference profile, given an input fasta file and input gene notation file.
TriplePreferenceMultiSegment.java - Compute the whole genome amino acid triple preference profile for a multi-segmented genome, given an input multi-segment fasta file and input multi-segment gene notation file.

utils/ directory - Utility classes used by other programs.
SequenceReader.java - Parses a fasta file to extract the sequence information.
AminoAcidSinglePreference.java - Compute an amino acid preference profile for an input amino acid string
AminoAcidPairPreference.java - Compute an amino acid pair preference profile for an input amino acid string
AminoAcidTriplePreference.java - Compute an amino acid triple preference profile for an input amino acid string
SequenceUtilities.java - A collection of various utilities for manipulating sequences. Includes commonly used functions for finding the reverse and complement of a sequence.
ExtractGenes.java - A tool that parses a genes file formatted as found below (see the dataFiles/) directory and that provides a simple function interface for reading that format.
ParseGenbankFile.java - A tool to parse a Genbank genome data file, extracting the CDS (coding regions) into the proper format required for the toolkit. Not completely fool-proof (mainly, can't handle extraneous CDS used in descriptions instead of representing coding regions; these errors are easily detected when trying to use the output file, however, as ExtractGenes will complain and point out the line it can't understand.).

crossValidation/ directory - Classes useful in generating training and testing data for performing cross validation.
ValidationSelectTrainingSets.java - This tool, given a list of data files belonging to each type under study and parameters of how to select training sets (percentage to leave out, proportioned or uniform sets, and smallest set size), generates an appropriate set of training and testing data for cross validation. This tool also generates several scripts used for running the training and testing phases. This file will need to be updated to incorporate local pointers to the svm toolkit.
ValidationFormatFileForSVMTesting.java - Taking as input a file with a list of amino preference data filenames, this tool generates a single output file containing the amino preference data from all of the input list files, formatted as testing data for the SVMLight tool.
ValidationFormatForSVM.java - Given as input a list of the types of genomes under study and a list of filename and type combinations, this tool generates an SVM-formatted training file for each type, with examples belonging to a given type labeled as positive examples and examples not of the given type labeled as negative examples as appropriate.
ValidationCombineSVMResults.java - Taking as input a file containing pointers to the filenames of the cross validation SVM outputs, the number of testing samples used, and an output filename, this program returns a file containing the believed classification as provided by the SVM. Essentially, this program determines the most likely class for a sample by finding the maximum score among all scores returned from the different SVM models. The output of this program is fed into AnalyzeClassificationResults
AnalyzeClassificationResults.java - This tool, taking as input a file listing the set of types under study, the true classifications of a set of testing examples, the classification tool (such as SVMLight) output classifications for the same set of examples, and the number of types under study, returns a confusion matrix representing the performance of the classifier.
CumulateScores.java - This tool, taking as input a file listing the set of types under study and a file pointing to analysis results generated by AnalyzeClassificationResults returns a global confusion matrix for the entire 10-fold cross validation process.
makeDirectories.sh - This tool is a Unix shell script which automatically creates the ten directories required for storing training and testing data in the 10-fold cross validation process.

boostrap/ directory - Classes useful in generating testing sets that are bootstrapped from gene and genome data.
NewSelectRandomSingles.java - Given a database (flat files) of fasta files, genes files, and type files, bootstraps a set of sequences from genome wide amino preference distributions per specified size constraints.
NewSelectRandomPairs.java - Same as above program but bootstraps amino pair preference profiles.
NewSelectRandomTriples.java - Same as above but bootstraps amino triple preference profiles.
NewSelectRandomGenesSingles.java - Same as above for single amino preference, but bootstraps by randomly sampling genes instead of the whole genome distribution.
NewSelectRandomGenesPairs.java - Same as above for pair amino preference, sampling from genes.
NewSelectRandomGenesTriples.java- Same as above for triple amino preference, sampling from genes.

subset/ directory - Classes useful in generating training and testing data sets that use a decimated group of amino preferences for separating types.
GenerateSVMSubsetRenormalize.java - Given a standard svm formatted file and a list of indexes to use as a decimated set of features for separation, this tool generates a new svm formatted file representing the values of just those features, normalized relative to each other.

parallel/ directory - Classes useful in parallelizing the various processes above, such as cross validation.
ReformatForParallel.java - This program takes in a single processor testing or training script for cross validation (generated by ValidationSelectTrainingSets) and the type of file being input (train or test) and returns the appropriate scripts to implement the process on a multi-processor machine.

dataFiles/ directory - Various examples of text files used as inputs to programs in the toolkit.
svmTypeList - This file contains a list of the genome types under study and is commonly fed in to tools in the crossValidation and bootstrap analysis directories.
DSDNA type file - This file contains a list of all of the files under study of the type DSDNA. It is used as input to the ValidationSelectTrainingSets class for cross validation (that program currently reads svmTypeList to see what types are under study and expects each type to be associated with a file of the same name that contains a list of the data files of that type).
NC_004812.1.fasta - An example FASTA formatted sequence.
NC_004812.genes - An example genes file that corresponds to the above example FASTA sequence file. The format of this file is: gene start,gene stop,gene strand,continues (without the commas). The continues field says whether or not the next line of start,stop information is also related to the same gene [the same semantics as join in a Genbank description].
importantTriples.01 - An example input file for GenerateSVMSubsetRenormalize.java, containing the indexes of the subset of features one is interested in using for decimated tests.
analysisInput - An example input file for CumulateScores, a tool that will generate an overall confusion matrix for 10-fold cross validation. The data in this file is the filenames of the individual AnalysisResults output files generated by running analyzeSVM.sh after cross validation.