官术网_书友最值得收藏!

Getting ready

While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:

tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz

tabix -p vcf genotypes.vcf.gz

If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.

The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.

The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).

主站蜘蛛池模板: 庆云县| 克拉玛依市| 遂宁市| 花垣县| 宁蒗| 连江县| 上林县| 新巴尔虎左旗| 重庆市| 洞头县| 阳春市| 荥阳市| 丰原市| 沭阳县| 大连市| 隆化县| 本溪| 盈江县| 永修县| 绥棱县| 陵川县| 潍坊市| 二连浩特市| 阳新县| 清苑县| 康马县| 云林县| 旬邑县| 四会市| 乐清市| 尚义县| 新邵县| 吉安县| 临安市| 青神县| 灵璧县| 南城县| 牙克石市| 金阳县| 日照市| 澎湖县|