How many samples are needed to build a classifier: a general sequential approach

Wenjiang J Fu; Edward R Dougherty; Bani Mallick; Raymond J Carroll

doi:10.1093/bioinformatics/bth461

How many samples are needed to build a classifier: a general sequential approach

Bioinformatics. 2005 Jan 1;21(1):63-70. doi: 10.1093/bioinformatics/bth461. Epub 2004 Aug 5.

Authors

Wenjiang J Fu¹, Edward R Dougherty, Bani Mallick, Raymond J Carroll

Affiliation

¹ Department of Statistics, Texas A&M University, 447 Blocker Building, College Station, TX 77843, USA. wfu@stat.tamu.edu

PMID: 15297303
DOI: 10.1093/bioinformatics/bth461

Abstract

Motivation: The standard paradigm for a classifier design is to obtain a sample of feature-label pairs and then to apply a classification rule to derive a classifier from the sample data. Typically in laboratory situations the sample size is limited by cost, time or availability of sample material. Thus, an investigator may wish to consider a sequential approach in which there is a sufficient number of patients to train a classifier in order to make a sound decision for diagnosis while at the same time keeping the number of patients as small as possible to make the studies affordable.

Results: A sequential classification procedure is studied via the martingale central limit theorem. It updates the classification rule at each step and provides stopping criteria to ensure with a certain confidence that at stopping a future subject will have misclassification probability smaller than a predetermined threshold. Simulation studies and applications to microarray data analysis are provided. The procedure possesses several attractive properties: (1) it updates the classification rule sequentially and thus does not rely on distributions of primary measurements from other studies; (2) it assesses the stopping criteria at each sequential step and thus can substantially reduce cost via early stopping; and (3) it is not restricted to any particular classification rule and therefore applies to any parametric or non-parametric method, including feature selection or extraction.

Availability: R-code for the sequential stopping rule is available at http://stat.tamu.edu/~wfu/microarray/sequential/R-code.html

Publication types

Comparative Study
Evaluation Study
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, P.H.S.
Validation Study

MeSH terms

Algorithms*
Artificial Intelligence*
Breast Neoplasms / metabolism
Cluster Analysis
Humans
Models, Statistical*
Neoplasm Proteins / genetics
Oligonucleotide Array Sequence Analysis / methods*
Pattern Recognition, Automated / methods*
Reproducibility of Results
Sample Size*
Sensitivity and Specificity
Sequence Analysis, DNA / methods*

Substances

Neoplasm Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding