COLLOQUIUM
Department of Computer Science and Engineering
University of South Carolina
 
Comparing the Efficiency of Database Retrieval Methods

Eva Czabarka

National Center for Biotechnology Information
U.S. National Library of Medicine

Date: January 23, 2004 (Friday)
Time: 3:30-4:30PM
Place: Swearingen 1A03 (Faculty Lounge)

Abstract

The science of biology used to be fueled by kilobit quantities of
information, mostly processed by humans. Now, public databases deliver
biological information in gigabit quantities, and entire computers are
devoted to the collection, storage, retrieval, and analysis of the
information. Much of bioinformatics is therefore preoccupied with the
construction of information retrieval algorithms. In any application,
however, several different algorithms might be constructed. To compare
them, a scientist must be prepared to measure the amount of
information they retrieve. The ROC[n] score is a popular measurement
of information retrieval in bioinformatics. Of two competing
algorithms, the one with the higher ROC[n] score might be deemed
superior. Upon deeper reflection, however, chance alone could be
causing one ROC[n] score to be higher than another. Moreover, the
"superior" algorithm might incur extra costs in programming
difficulty, running time, or code maintenance, so it might not be
preferable, unless statistical tests suggest that the higher ROC[n]
score is real, and not due to chance. Is a difference between ROC[n]-s
significant or not, in a statistical sense? A p-value based on
bootstrap resampling of a database can answer the question.  We prove
that in a large database, a small subset of records can be resampled
using only independent Poisson random variates. Apart from theoretical
considerations, Poisson resampling is an approximation relevant to
many other practical measures of information retrieval, not just the
ROC[n]. This provides a convenient programming economy in many cases
of practical database resampling. Our main results are central limit
theorems about the ROC[n] score and differences of ROC[n]s. These
theorems lead to simple statistical tests that assign p-values to the
differences between ROC[n] scores, based on the fact that they are
normally distributed. The means and standard deviations can be easily
computed from the retrieval lists, and code already exists to compute
them. In the data we examined, our CLT always approximated the
corresponding means and standard deviations to within 2%. This leads
to highly accurate p-values and eliminates the need for bootstrapping
the database. The p-values computed allow for selection of algorithms
that are indeed significantly better than their competing
versions. Cruder versions of these results have already been used in
deciding which of several new ideas should be incorporated into a
version of BLAST.

Eva Czabarka, a Research Fellow at the National Library of Medicine,
received a Ph.D. in Mathematics in 1998 from the
University of South Carolina. Her original areas of studies were
combinatorics and graph theory, and their applications to computer
science. Her thesis work extended the shift technique used in the
proof of the Erdős-Ko-Rado theorem. In collaboration with scientists
at Los Alamos she developed polynomial time algorithms based on simple
models of DNA-sequencing procedures that can improve sequencing
efficiency substantially. She is still maintaining an interest in
problems of discrete mathematics - She has started to work in the area
of graph drawing, particularly crossing numbers. She designed graph
algorithms that have applications in bioinformatics. Her main area of
research has shifted towards probability and statistics at NIH, where
she has spent the last 4 years. Her current main interests are
developing tools to evaluate the statistical significance between
efficacies of database retrieval methods and building models that
describe the distribution of single nucleotide polymorphisms in order
to understand the effect and magnitude of the forces that shaped their
distribution. She is also involved with a project that would introduce
hierarchical clustering to NCBI's COG (Clusters of Orthologous groups
of proteins) database, and highly automate the process of finding such
clusters (thus allowing significant extension of the current
database).