COLLOQUIUM Department of Computer Science and Engineering University of South Carolina Comparing the Efficiency of Database Retrieval Methods Eva Czabarka National Center for Biotechnology Information U.S. National Library of Medicine Date: January 23, 2004 (Friday) Time: 3:30-4:30PM Place: Swearingen 1A03 (Faculty Lounge) Abstract The science of biology used to be fueled by kilobit quantities of information, mostly processed by humans. Now, public databases deliver biological information in gigabit quantities, and entire computers are devoted to the collection, storage, retrieval, and analysis of the information. Much of bioinformatics is therefore preoccupied with the construction of information retrieval algorithms. In any application, however, several different algorithms might be constructed. To compare them, a scientist must be prepared to measure the amount of information they retrieve. The ROC[n] score is a popular measurement of information retrieval in bioinformatics. Of two competing algorithms, the one with the higher ROC[n] score might be deemed superior. Upon deeper reflection, however, chance alone could be causing one ROC[n] score to be higher than another. Moreover, the "superior" algorithm might incur extra costs in programming difficulty, running time, or code maintenance, so it might not be preferable, unless statistical tests suggest that the higher ROC[n] score is real, and not due to chance. Is a difference between ROC[n]-s significant or not, in a statistical sense? A p-value based on bootstrap resampling of a database can answer the question. We prove that in a large database, a small subset of records can be resampled using only independent Poisson random variates. Apart from theoretical considerations, Poisson resampling is an approximation relevant to many other practical measures of information retrieval, not just the ROC[n]. This provides a convenient programming economy in many cases of practical database resampling. Our main results are central limit theorems about the ROC[n] score and differences of ROC[n]s. These theorems lead to simple statistical tests that assign p-values to the differences between ROC[n] scores, based on the fact that they are normally distributed. The means and standard deviations can be easily computed from the retrieval lists, and code already exists to compute them. In the data we examined, our CLT always approximated the corresponding means and standard deviations to within 2%. This leads to highly accurate p-values and eliminates the need for bootstrapping the database. The p-values computed allow for selection of algorithms that are indeed significantly better than their competing versions. Cruder versions of these results have already been used in deciding which of several new ideas should be incorporated into a version of BLAST. Eva Czabarka, a Research Fellow at the National Library of Medicine, received a Ph.D. in Mathematics in 1998 from the University of South Carolina. Her original areas of studies were combinatorics and graph theory, and their applications to computer science. Her thesis work extended the shift technique used in the proof of the Erdős-Ko-Rado theorem. In collaboration with scientists at Los Alamos she developed polynomial time algorithms based on simple models of DNA-sequencing procedures that can improve sequencing efficiency substantially. She is still maintaining an interest in problems of discrete mathematics - She has started to work in the area of graph drawing, particularly crossing numbers. She designed graph algorithms that have applications in bioinformatics. Her main area of research has shifted towards probability and statistics at NIH, where she has spent the last 4 years. Her current main interests are developing tools to evaluate the statistical significance between efficacies of database retrieval methods and building models that describe the distribution of single nucleotide polymorphisms in order to understand the effect and magnitude of the forces that shaped their distribution. She is also involved with a project that would introduce hierarchical clustering to NCBI's COG (Clusters of Orthologous groups of proteins) database, and highly automate the process of finding such clusters (thus allowing significant extension of the current database).