Objective: To describe the genetic structure of populations in different areas of China, and explore the effects of different strategies to control the confounding factors of the genetic structure in cohort studies. Methods: By using the genome-wide association study (GWAS) on data of 4 500 samples from 10 areas of the China Kadoorie Biobank (CKB), we performed principal components analysis to extract the first and second principal components of the samples for the component two-dimensional diagram generation, and then compared them with the source of sample area to analyze the characteristics of genetic structure of the samples from different areas of China. Based on the CKB cohort data, a simulation data set with cluster sample characteristics such as genetic structure differences and extensive kinship was generated; and the effects of different analysis strategies including traditional analysis scheme and mixed linear model on the inflation factor (λ) were evaluated. Results: There were significant genetic structure differences in different areas of China. Distribution of the principal components of the population genetic structure was basically consistent with the geographical distribution of the project area. The first principal component corresponds to the latitude of different areas, and the second principal component corresponds to the longitude of different areas. The generated simulation data showed high false positive rate (λ=1.16), even if the principal components of the genetic structure was adjusted or the area specific subgroup analysis was performed, λ could not be effectively controlled (λ>1.05); while, by using a mixed linear model adjusting for the kinship matrix, λ was effectively controlled regardless of whether the genetic structure principal component was further adjusted (λ=0.99). Conclusions: There were large differences in genetic structure among populations in different areas of China. In molecular epidemiology studies, bias caused by population genetic structure needs to be carefully treated. For large cohort data with complex genetic structure and extensive kinship, it is necessary to use a mixed linear model for association analysis.
目的: 描述中国不同地区群体遗传结构特征,探索并评价不同分析方案控制队列样本群体遗传结构混杂因素的效果。 方法: 通过中国慢性病前瞻性研究(CKB)队列10个地区4 500例样本的全基因组关联研究数据,通过主成分分析提取样本第一、二主成分,绘制主成分二维图,并与样本地区来源相比较,分析我国不同地区样本的遗传结构特征。以CKB队列数据为基础,生成存在遗传结构差异、亲缘关系等队列样本特征的模拟数据集,探索并评价不同分析策略对膨胀因子(λ)的控制效果。 结果: 我国不同地区人群存在显著的群体遗传结构差异,人群遗传结构主成分分布与项目地区的地理分布基本一致,第一主成分对应不同地区的纬度,第二主成分对应不同地区的经度。生成的模拟数据集,直接进行关联分析假阳性率较高(λ=1.16),即使调整遗传结构主成分或根据地区进行亚组分析仍无法有效控制λ(λ>1.05);使用混合线性模型引入亲属关系矩阵作为随机效应量后,无论是否进一步调整遗传结构主成分,λ均得到有效控制(λ=0.99)。 结论: 我国不同地区人群遗传结构存在较大差异,在分子流行病学研究中需要谨慎处理群体遗传结构造成的研究偏倚;针对大队列数据遗传结构复杂、亲缘关系广泛等特征,需要使用混合线性模型进行关联分析。.
Keywords: Area differences; Linear mixed model; Molecular epidemiology; Population genetic structure.