Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization

Sunghwan Kim; Steffi Oesterreich; Seyoung Kim; Yongseok Park; George C Tseng

doi:10.1093/biostatistics/kxw039

Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization

Biostatistics. 2017 Jan;18(1):165-179. doi: 10.1093/biostatistics/kxw039. Epub 2016 Aug 22.

Authors

Sunghwan Kim¹, Steffi Oesterreich², Seyoung Kim³, Yongseok Park⁴, George C Tseng⁴

Affiliations

¹ Department of Biostatistics, University of Pittsburgh, 130 Desoto Street, Pittsburgh, PA 15261, USA and Department of Statistics, Korea University, Anamdong, Seoul 02841, South Korea.
² Magee-Women's Research Institute, 204 Craft Avenue, Pittsburgh, PA 15213, USA.
³ School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA.
⁴ Department of Biostatistics, University of Pittsburgh, 130 Desoto Street, Pittsburgh, PA 15261, USA yongpark@pitt.edu; ctseng@pitt.edu.

Abstract

With the rapid advances in technologies of microarray and massively parallel sequencing, data of multiple omics sources from a large patient cohort are now frequently seen in many consortium studies. Effective multi-level omics data integration has brought new statistical challenges. One important biological objective of such integrative analysis is to cluster patients in order to identify clinically relevant disease subtypes, which will form basis for tailored treatment and personalized medicine. Several methods have been proposed in the literature for this purpose, including the popular iCluster method used in many cancer applications. When clustering high-dimensional omics data, effective feature selection is critical for better clustering accuracy and biological interpretation. It is also common that a portion of "scattered samples" has patterns distinct from all major clusters and should not be assigned into any cluster as they may represent a rare disease subcategory or be in transition between disease subtypes. In this paper, we firstly propose to improve feature selection of the iCluster factor model by an overlapping sparse group lasso penalty on the omics features using prior knowledge of inter-omics regulatory flows. We then perform regularization over samples to allow clustering with scattered samples and generate tight clusters. The proposed group structured tight iCluster method will be evaluated by two real breast cancer examples and simulations to demonstrate its improved clustering accuracy, biological interpretation, and ability to generate coherent tight clusters.

Keywords: Group structured lasso; Integrative clustering (iCluster); Penalized EM-algorithm; The Cancer Genome Atlas (TCGA).

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Breast Neoplasms* / classification
Breast Neoplasms* / diagnosis
Breast Neoplasms* / genetics
Cluster Analysis*
Genomics / methods*
Humans
Precision Medicine / methods*

Abstract

Publication types

MeSH terms

Grants and funding