官术网_书友最值得收藏!

How to do it...

Take a look at the following steps:

  1. Let's start by inspecting the information that we can get per record:
import vcf
v = vcf.Reader(filename='genotypes.vcf.gz')
print('Variant Level information')
infos = v.infos
for info in infos:
print(info)
print('Sample Level information')
fmts = v.formats
for fmt in fmts:
print(fmt)

We start by inspecting the annotations that are available for each record (remember that each record encodes a variant, such as SNP, CNV, INDELs, and so on, and the state of that variant per sample). At the variant (record) level, we find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth, among others. There are others, but they are mostly specific to the 1,000 Genomes Project (here, we will try to be as general as possible). Your own dataset may have more annotations (or none of these).

At the sample level, there are only two annotations in this file: GT—genotype, and DP—the per sample read depth. You have the per variant (total) read depth and the per sample read depth; be sure not to confuse both.

  1. Now that we know which information is available, let's inspect a single VCF record:
v = vcf.Reader(filename='genotypes.vcf.gz')
rec = next(v)
print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER)
print(rec.INFO)
print(rec.FORMAT)
samples = rec.samples
print(len(samples))
sample = samples[0]
print(sample.called, sample.gt_alleles, sample.is_het, sample.phased)
print(int(sample['DP']))

We will start by retrieving the standard information: the chromosome, position, identifier, reference base, (typically just one), alternative bases (you can have more than one, but it's not uncommon as a first filtering approach to only accept a single ALT, for example, only accept biallelic SNPs), quality (as you might expect, Phred-scaled), and filter status. Regarding the filter status, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).

We then print the additional variant-level information (AC, AS, AF, AN, DP, and so on), followed by the sample format (in this case, DP and GT). Finally, we count the number of samples and inspect a single sample to check whether it was called for this variant. Also, the reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which is not that common) are included.

  1. Let's check the type of variant and the number of nonbiallelic SNPs in a single pass:
from collections import defaultdict
f = vcf.Reader(filename='genotypes.vcf.gz')
my_type = defaultdict(int)
num_alts = defaultdict(int)
for rec in f:
my_type[rec.var_type, rec.var_subtype] += 1
if rec.is_snp:
num_alts[len(rec.ALT)] += 1
print(num_alts)
print(my_type)

We will now use the now common Python default dictionary. We find that this dataset has INDELs (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-thirds being transitions with one-third transversions). There is a residual number (79) of tri-allelic SNPs.

主站蜘蛛池模板: 灵寿县| 蒲江县| 达孜县| 双牌县| 体育| 潜山县| 施甸县| 凯里市| 理塘县| 沅陵县| 长乐市| 斗六市| 东平县| 长海县| 玛沁县| 六枝特区| 岳阳县| 道真| 兴义市| 河津市| 钟山县| 保德县| 新野县| 崇信县| 长岛县| 定襄县| 汶上县| 湘乡市| 讷河市| 平潭县| 剑河县| 木里| 阜城县| 永川市| 右玉县| 霍州市| 临江市| 江都市| 从化市| 永和县| 临猗县|