官术网_书友最值得收藏!

How to do it...

Take a look at the following steps:

  1. We will start by setting up a reader for our file. Remember that this file has already been supplied to you, and should be in your current work directory:
from collections import defaultdict
import re
import HTSeq

lct_bed = HTSeq.BED_Reader('LCT.bed')
  1. We are now going to extract all the types of features via its name:
feature_types = defaultdict(int)
for rec in lct_bed:
last_rec = rec
feature_types[re.search('([A-Z]+)', rec.name).group(0)] += 1
print(feature_types)

Remember that this code is specific to our example. You will have to adapt it to your case.

You will find that the preceding code uses a regular expression. Be careful with regular expressions, as they tend to generate read-only code that is difficult to maintain. You might have better alternatives. In any case, regular expressions exist and you will find them from time to time.

The output for our case is as follows:

defaultdict(<class 'int'>, {'ENSE': 27, 'NM': 17, 'CCDS': 17})
  1. We stored the last record so that we can inspect it:
print(last_rec)
print(last_rec.name)
print(type(last_rec))
interval = last_rec.iv
print(interval)
print(type(interval))

There are many fields available, most notably name and interval. For the preceding code, the output is as follows:

<GenomicFeature: BED line 'CCDS2178.11' at 2: 135788543 -> 135788322 (strand '-')>
CCDS2178.11
<class 'HTSeq.GenomicFeature'>
2:[135788323,135788544)/-
<class 'HTSeq._HTSeq.GenomicInterval'>
  1. Let's dig deeper into the interval:
print(interval.chrom, interval.start, interval.end)
print(interval.strand)
print(interval.length)
print(interval.start_d)
print(interval.start_as_pos)
print(type(interval.start_as_pos))

The output is as follows:

2 135788323 135788544
-
221
135788543
2:135788323/-
<class 'HTSeq._HTSeq.GenomicPosition'>

Note the genomic position (chromosome, start and end). The most complex issue is how to deal with the strand. If the feature is coded in the negative strand, you have to be careful with processing. HTSeq offers the start_d and end_d fields to help you with this (that is, they will be reversed with regards to the start and end if the strand is negative).

  1. Finally, let's extract some statistics from our coding regions (CCDS records). We will use CCDS, since it's probably than the better curated database here:
exon_start = None
exon_end = None
sizes = []
for rec in lct_bed:
if not rec.name.startswith('CCDS'):
continue
interval = rec.iv
exon_start = min(interval.start, exon_start or interval.start)
exon_end = max(interval.length, exon_end or interval.end)
sizes.append(interval.length)
sizes.sort()
print("Num exons: %d / Begin: %d / End %d" % (len(sizes), exon_start, exon_end))
print("Smaller exon: %d / Larger exon: %d / Mean size: %.1f" % (sizes[0], sizes[-1], sum(sizes)/len(sizes)))

The output should be self-explanatory:

Num exons: 17 / Begin: 135788323 / End 135837169
Smaller exon: 79 / Larger exon: 1551 / Mean size: 340.2
主站蜘蛛池模板: 芒康县| 洪江市| 叶城县| 乐昌市| 西吉县| 黄浦区| 乌兰浩特市| 绥化市| 卓尼县| 庐江县| 灵台县| 固阳县| 胶南市| 剑川县| 萝北县| 荥经县| 玛曲县| 武强县| 泸水县| 海晏县| 翁牛特旗| 齐齐哈尔市| 仪陇县| 花垣县| 嘉禾县| 永德县| 天津市| 涟源市| 诸暨市| 广西| 景谷| 新兴县| 铜鼓县| 嫩江县| 禄丰县| 垦利县| 安福县| 青铜峡市| 西华县| 七台河市| 昌江|