HOWTO: Cross Validation Using the Amino Acid Preference Toolkit

Ensure you have generated the whole-genome amino-preference data files for all genomes of interest, using a program such as TriplePreference or TriplePreferenceMultiSegment. Documentation for this process is located in the Howto For Generating Base Data
Copy the needed cross validation programs from the amino preference crossValidation/ directory into your current working directory. The needed programs for cross validation are:
1. makeDirectories.sh - Set up the directory format to handle 10-fold cross validation.
2. ValidationSelectTrainingSets.java - Generates cross validation training and testing sets.
3. ValidationFormatFileForSVMTesting.java - Formats for SVM input.
4. ValidationFormatForSVM.java - Formats for SVM input.
5. ValidationCombineSVMResults.java - Combines output scores of SVM.
6. AnalyzeClassificationResults.java - Compares output classifications to true classifiations.
7. CumulateScores.java - Generates overall confusion matrix for cross validation.
Compile each of the Java files above so that they can be executed by typing: javac filename.java, where filename is replaced by the filename of the file you are interested in compiling.
Generate a types meta-file which contains, on separate lines in the file, the names of the types of genomes that are being tested. For example, this file could be called svmTypeList and contain the following text:
DSDNA
SSDNA
RETROID
SSRNANegative
SSRNAPositive
For each type name used in the meta-file above, generate a file which contains the filenames of all amino-preference data files belonging to that type. For example, if there are 20 genomes of type DSDNA, there should be a file called DSDNA which has 20 separate lines, each of which is the full-path filename to one of the genome amino preference data files generated in step 1.
Generate the directories for the cross validation training and testing data sets by running the shell script makeDirectories.sh. This should generate ten directories, labeled testingGroup0 through testingGroup9.
Generate the training and testing sets for 10-fold cross validation by using the following program:
java ValidationSelectTrainingSets svmTypeList 0.10 0 MINSIZE
where MINSIZE is replaced by the size of the smallest type set. This says to use the ValidationSelectTrainingSets program, leaving out 10% of the data for testing in each group (thus performing 10-fold cross validation). The last two parameters tell ValidationSelectTrainingSets to use the default mechanism of producing training and testing examples which allows for unbalanced data sets. By setting the second parameter to 1 from 0, all data sets will use equal numbers of training data, which is limited by the MINSIZE group. Several shell scripts used to control the rest of the process will be generated by this step.
Format the testing data for svmLight by running the shell script generateTestingSVM.sh. This should deposit a file called svmTestingGroupX in each testing directory, where the X value is replaced by the number of the testing group. Make the shell script executable if necessary. (On unix, chmod u+x generateTestingSVM.sh).
Format the training data for svmLight by running the shell script generateTrainingSVM.sh. This should deposit a file for each genome type of interest in each testing directory. Within each file, the samples that are of the appropriate type are labeled as positive examples (+1), while the rest are labeled as negative examples (-1).
The two scripts trainSVM.sh and testSVM.sh contain pointers to the home directory for the svmLight package tools. These directories are currently hard-coded to the directory for our specific Pathogen project and need to be updated to point to wherever the programs are on your local directory structure. In a unix environment, this can be done fairly easily by performing a regular expression search and replace.
Perform training with each training set by running the shell script trainSVM.sh. This will generate a model file in each directory for each genome type of interest.
Perform testing with each testing set by running the shell script testSVM.sh. This will generate output score files in each directory for each genome type of interest.
Generate a confusion matrix for each individual test set by running the script analyzeSVM.sh. An output file will be generated in each testing directory called AnalysisResults and will contain a list of which files were either classified correctly by the SVM or misclassified, as well as contain a confusion matrix for that testing set.
Finally, a global confusion matrix over all ten test sets can be generated by running the CumulateScores program as follows:
java CumulateScores svmTypeList analysisInput
where analysisInput is a file that contains the filenames of each of the AnalysisResults outputs generated by the previous step. This file needs to be created by the enduser. A sample analysisInput file which should work with the defaults generated by the program can be found here.