官术网_书友最值得收藏!

Getting ready

While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:

tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz

tabix -p vcf genotypes.vcf.gz

If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.

The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.

The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).

主站蜘蛛池模板: 思南县| 芦山县| 汉源县| 彭水| 当阳市| 平昌县| 丘北县| 永福县| 江川县| 商丘市| 马鞍山市| 涡阳县| 淮南市| 博客| 台南市| 神农架林区| 渝北区| 额尔古纳市| 庄河市| 册亨县| 土默特右旗| 库车县| 蒙阴县| 固原市| 金山区| 元阳县| 焉耆| 山东| 博兴县| 马龙县| 新河县| 松阳县| 满洲里市| 普宁市| 得荣县| 托克逊县| 河津市| 泌阳县| 隆昌县| 河南省| 松阳县|