官术网_书友最值得收藏!

ADAM for large-scale genomics data processing

Analyzing DNA and RNA sequencing data requires large-scale data processing to interpret the data according to its context. Excellent tools and solutions have been developed at academic labs, but often fall short on scalability and interoperability. By this means, ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark and Parquet.

However, large-scale data processing solutions such as ADAM-Spark can be applied directly to the output data from a sequencing pipeline, that is, after quality control, mapping, read preprocessing, and variant quantification using single sample data. Some examples are DNA variants for DNA sequencing, read counts for RNA sequencing, and so on.

See more at http://bdgenomics.org/ and the related publication: Massie, Matt and Nothaft, Frank et al., ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing, UCB/EECS-2013-207, EECS Department, University of California, Berkeley.

In our study, ADAM is used to achieve the scalable genomics data analytics platform with support for the VCF file format so that we can transform genotype-based RDD into a Spark DataFrame.

主站蜘蛛池模板: 天镇县| 铜川市| 渭源县| 卢氏县| 溧阳市| 固原市| 鹿邑县| 当雄县| 平遥县| 新津县| 汾西县| 贺州市| 勐海县| 井陉县| 乌兰察布市| 库车县| 肥东县| 夹江县| 迭部县| 灵武市| 乌什县| 永善县| 乌恰县| 梁山县| 青岛市| 朝阳县| 富源县| 海安县| 山阴县| 玉环县| 景谷| 巴东县| 阳城县| 宁乡县| 昌黎县| 虞城县| 礼泉县| 闵行区| 巫山县| 邢台县| 内乡县|