官术网_书友最值得收藏!

Getting ready

While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:

tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz

tabix -p vcf genotypes.vcf.gz

If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.

The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.

The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).

主站蜘蛛池模板: 大安市| 赤水市| 泸州市| 米泉市| 土默特右旗| 达拉特旗| 布尔津县| 台州市| 唐山市| 诏安县| 桑日县| 密云县| 江都市| 钟祥市| 嘉义县| 朝阳县| 永定县| 定远县| 绥江县| 衡阳市| 湘阴县| 汝州市| 仁化县| 浦县| 芒康县| 广汉市| 平湖市| 阿瓦提县| 滦南县| 花垣县| 馆陶县| 贵阳市| 孟州市| 通化市| 玉山县| 新竹县| 新化县| 清原| 全南县| 剑阁县| 淮南市|