- Bioinformatics with Python Cookbook
- Tiago Antao
- 207字
- 2021-06-10 19:01:48
Getting ready
While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:
tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz
tabix -p vcf genotypes.vcf.gz
If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.
The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.
The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).
- Beginning C++ Game Programming
- Raspberry Pi Networking Cookbook(Second Edition)
- Python自動化運維快速入門(第2版)
- PHP+MySQL網站開發技術項目式教程(第2版)
- DevOps入門與實踐
- Mastering PHP Design Patterns
- Learn Programming in Python with Cody Jackson
- Python Web數據分析可視化:基于Django框架的開發實戰
- Java語言程序設計教程
- 時空數據建模及其應用
- Python語言科研繪圖與學術圖表繪制從入門到精通
- SQL Server 2016 從入門到實戰(視頻教學版)
- 人工智能算法(卷1):基礎算法
- Head First Kotlin程序設計
- PHP動態網站開發實踐教程