官术网_书友最值得收藏!

Getting ready

While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:

tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz

tabix -p vcf genotypes.vcf.gz

If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.

The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.

The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).

主站蜘蛛池模板: 延寿县| 辽宁省| 历史| 新巴尔虎左旗| 邳州市| 贡觉县| 博爱县| 莆田市| 库尔勒市| 越西县| 博客| 龙州县| 江华| 赤壁市| 区。| 吴川市| 满洲里市| 沈丘县| 广安市| 阿合奇县| 百色市| 普格县| 会东县| 湖北省| 淄博市| 无棣县| 夏邑县| 沂水县| 日照市| 全南县| 富阳市| 平原县| 黑山县| 鄂托克前旗| 青田县| 图片| 建始县| 平泉县| 镇雄县| 章丘市| 剑川县|