- Bioinformatics with Python Cookbook
- Tiago Antao
- 207字
- 2021-06-10 19:01:48
Getting ready
While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:
tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz
tabix -p vcf genotypes.vcf.gz
If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.
The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.
The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).
- 解構(gòu)產(chǎn)品經(jīng)理:互聯(lián)網(wǎng)產(chǎn)品策劃入門寶典
- 信息可視化的藝術(shù):信息可視化在英國
- 編寫高質(zhì)量代碼:改善C程序代碼的125個建議
- Java深入解析:透析Java本質(zhì)的36個話題
- Visual Basic程序設(shè)計習(xí)題解答與上機指導(dǎo)
- 差分進化算法及其高維多目標(biāo)優(yōu)化應(yīng)用
- 運用后端技術(shù)處理業(yè)務(wù)邏輯(藍(lán)橋杯軟件大賽培訓(xùn)教材-Java方向)
- Java EE企業(yè)級應(yīng)用開發(fā)教程(Spring+Spring MVC+MyBatis)
- Android應(yīng)用開發(fā)深入學(xué)習(xí)實錄
- 新印象:解構(gòu)UI界面設(shè)計
- Julia數(shù)據(jù)科學(xué)應(yīng)用
- Akka入門與實踐
- Web編程基礎(chǔ):HTML5、CSS3、JavaScript(第2版)
- C#程序設(shè)計基礎(chǔ)入門教程
- 算法精解:C語言描述