官术网_书友最值得收藏!

Getting ready

In the best-case scenario, you have a VCF file with proper filters applied. If this is the case, you can just go ahead and use your file. Note that all VCF files will have a FILTER column, but this might not mean that all of the proper filters were applied. You have to be sure that your data is properly filtered.

In the second case, which is one of the most common, your file will have unfiltered data, but you'll have enough annotations and can apply hard filters (there is no need for programmatic filtering). If you have a GATK annotated file, refer to http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set.

In the third case, you have a VCF file that has all the annotations that you need, but you may want to apply more flexible filters (for example, "if read depth > 20, accept if mapping quality > 30; otherwise, accept if mapping quality > 40").

In the fourth case, your VCF file does not have all the necessary annotations and you have to revisit your BAM files (or even other sources of information). In this case, the best solution is to find whatever extra information you can find and create a new VCF file with the required annotations. Some genotype callers (such as GATK) allow you to specify which annotations you want; you may also want to use extra programs to provide more annotations. For example, SnpEff (http://snpeff.sourceforge.net/) will annotate your SNPs with predictions of their effect (for example, if they are in exons, are they coding or non-coding?).

It's impossible to provide a clear-cut recipe. It will vary with your type of sequencing data, your species of study, and your tolerance to errors, among other variables. What we can do is provide a set of typical analysis that is done for high-quality filtering.

In this recipe, we will not use data from the Human 1,000 Genomes Project. We want dirty unfiltered data that has a lot of common annotations that can be used to filter it. We will use data from the Anopheles gambiae 1,000 Genomes Project (Anopheles is the mosquito vector involved in the transmission of the parasite that causes malaria), which makes filtered and unfiltered data available. You can find more information on this project at http://www.malariagen.net/projects/vector/ag1000g.

We will get a part of the centromere of chromosome 3L for around 100 mosquitoes, which is followed by a part somewhere in the middle of this chromosome (and index both):

tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC.phase1.AR1.vcf.gz 3L:1-200000 |bgzip -c > centro.vcf.gz
tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC.phase1.AR1.vcf.gz 3L:21000001-21200000 |bgzip -c > standard.vcf.gz

tabix -p vcf centro.vcf.gz

tabix -p vcf standard.vcf.gz

If the links do not work, be sure to check https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for updates. As usual, the code for downloading this data is available on the Chapter02/Filtering_SNPs.ipynb Notebook.

Finally, a word of warning on this recipe: the level of Python here will be slightly more complicated than usual. The more general code we will write will be easier to reuse for your specific case. We will use functional programming techniques (lambda functions) and the partial function application extensively.

主站蜘蛛池模板: 磐石市| 亚东县| 庆元县| 凌源市| 遂川县| 那坡县| 和静县| 江源县| 开化县| 蒙城县| 临沭县| 洛阳市| 庆元县| 肇东市| 涿鹿县| 图木舒克市| 德州市| 鄢陵县| 丽水市| 穆棱市| 海丰县| 梅河口市| 仪征市| 洞头县| 黄大仙区| 安多县| 威海市| 固原市| 东乌珠穆沁旗| 京山县| 马公市| 乌拉特后旗| 克什克腾旗| 磐石市| 托克逊县| 南皮县| 延安市| 温州市| 城口县| 阿勒泰市| 辰溪县|