官术网_书友最值得收藏!

Population-Scale Clustering and Ethnicity Prediction

Understanding variations in genome sequences assists us in identifying people who are predisposed to common diseases, curing rare diseases, and finding the corresponding population group of individuals from a larger population group. Although classical machine learning techniques allow researchers to identify groups (that is, clusters) of related variables, the accuracy and effectiveness of these methods diminish for large and high-dimensional datasets such as the whole human genome.

On the other hand, Deep Neural Networks (DNNs) form the core of deep learning (DL) and provide algorithms to model complex, high-level abstractions in data. They can better exploit large-scale datasets to build complex models.

In this chapter, we apply the K-means algorithm to large-scale genomic data from the 1000 Genomes project analysis aimed at clustering genotypic variants at the population scale. Finally, we train an H2O-based DNN model and a Spark-based random forest model for predicting geographic ethnicity. The theme of this chapter is give me your genetic variants data and I will tell your ethnicity.

Nevertheless, we will configure H2O so that the same setting can be used in upcoming chapters too. Concisely, we will learn the following topics throughout this end-to-end project:

  • Population-scale clustering and geographic ethnicity prediction
  • The 1000 Genomes project, a deep catalog of human genetic variants
  • Algorithms and tools
  • Using K-means for population-scale clustering
  • Using H2O for ethnicity prediction
  • Using random forest for ethnicity prediction
主站蜘蛛池模板: 资兴市| 保德县| 濮阳市| 和平区| 陆河县| 江城| 新巴尔虎左旗| 西贡区| 永吉县| 内乡县| 宁乡县| 平凉市| 迁安市| 汾西县| 大姚县| 日喀则市| 赞皇县| 神池县| 达尔| 尼勒克县| 绥中县| 襄汾县| 竹北市| 鲁甸县| 平昌县| 富顺县| 蓬溪县| 资溪县| 莱州市| 怀化市| 芜湖市| 朔州市| 西吉县| 宜阳县| 井陉县| 饶平县| 镶黄旗| 武穴市| 五常市| 辉县市| 盐池县|