Exploring Machine Learning Techniques to Improve Peptide Identification

Wednesday, November 14, 2018 - 3:00pm to 4:00pm
Meeting room 2265, Innovation Center

Department of Computer Science and Engineering
University of South Carolina

Author : Fawad Kirmani
Advisor : Dr. John Rose
Date : Nov 14th , 2018
Time : 3:00 pm
Place : Meeting room 2265, Innovation Center


The goal of this work is to improve proteotypic peptide prediction with lower processing time and better efficiency. Proteotypic peptides are the peptides in protein sequence that can be confidently observed by mass-spectrometry based proteomics. One of the widely used method for identifying peptides is tandem mass spectrometry (MS/MS). The peptides that need to be identified are compared with the accurate mass and elution time (AMT) tag database. The AMT tag database helps in reducing the processing time and increases the accuracy of the identified peptides. Prediction of proteotypic peptides has seen a rapid improvement in recent years for AMT studies for peptides using amino acid properties like charge, code, solubility and hydropathy.

We describe the improved version of a support vector machine (SVM) classifier that has achieved similar classification sensitivity, specificity and AUC on Yersinia Pestis, Saccharomyces cerevisiae and Bacillus subtilis str. 168 datasets as was described by Web-Robertson et al. [13] and Ahmed Alqurri [10]. The improved version of the SVM classifier uses the C++ SVM library instead of the MATLAB built in library. We describe how we achieved these similar results with much lesser processing time.

Furthermore, we tested four machine learning classifiers on Yersinia Pestis, Saccharomyces cerevisiae and Bacillus subtilis str. 168 data. We performed feature selection from scratch, using four different algorithms to achieve better results from the different machine learning algorithms. Some of these classifiers gave similar or better results than the SVM classifiers with fewer features. We describe the results of these four classifiers with different feature sets.