- Bioinformatics with Python Cookbook
- Tiago Antao
- 207字
- 2021-06-10 19:01:48
Getting ready
While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:
tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz
tabix -p vcf genotypes.vcf.gz
If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.
The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.
The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).
- Design Principles for Process:driven Architectures Using Oracle BPM and SOA Suite 12c
- 自己動手寫搜索引擎
- JavaScript全程指南
- Mastering Ember.js
- JavaScript 網頁編程從入門到精通 (清華社"視頻大講堂"大系·網絡開發視頻大講堂)
- PyTorch Artificial Intelligence Fundamentals
- SQL語言從入門到精通
- 精通Linux(第2版)
- NetBeans IDE 8 Cookbook
- Advanced Express Web Application Development
- Kotlin極簡教程
- App Inventor少兒趣味編程動手做
- UX Design for Mobile
- Flink核心技術:源碼剖析與特性開發
- Elastix Unified Communications Server Cookbook