- Bioinformatics with Python Cookbook
- Tiago Antao
- 1044字
- 2021-06-10 19:01:51
How to do it...
Let's take a look at the following steps:
- We will start by inspecting the description of all of the the sequences on the reference genome FASTA file:
from Bio import SeqIO
genome_name = 'PlasmoDB-9.3_Pfalciparum3D7_Genome.fasta'
recs = SeqIO.parse(genome_name, 'fasta')
for rec in recs:
print(rec.description)
This code should look familiar from the previous chapter, Chapter 2, Next-Generation Sequencing. Let's take a look at part of the output:

Different genome references will have different description lines, but they will generally have important information. In this example, you can see that we have chromosomes, mitochondria, and apicoplast. We can also view chromosome sizes, but we will take the value from the sequence length instead.
- Let's parse the description line to extract the chromosome number. We will retrieve the chromosome size from the sequence, and compute the GC content across chromosomes on a window basis:
from Bio import SeqUtils
recs = SeqIO.parse(genome_name, 'fasta')
chrom_sizes = {}
chrom_GC = {}
block_size = 50000
min_GC = 100.0
max_GC = 0.0
for rec in recs:
if rec.description.find('SO=chromosome') == -1:
continue
chrom = int(rec.description.split('_')[1])
chrom_GC[chrom] = []
size = len(rec.seq)
chrom_sizes[chrom] = size
num_blocks = size // block_size + 1
for block in range(num_blocks):
start = block_size * block
if block == num_blocks - 1:
end = size
else:
end = block_size + start + 1
block_seq = rec.seq[start:end]
block_GC = SeqUtils.GC(block_seq)
if block_GC < min_GC:
min_GC = block_GC
if block_GC > max_GC:
max_GC = block_GC
chrom_GC[chrom].append(block_GC)
print(min_GC, max_GC)
We have performed a windowed analysis of all chromosomes, similar to what you have seen in the previous chapter, Chapter 2, Next-Generation Sequencing. We started by defining a window size of 50 kbp. This is appropriate for P. falciparum (feel free to vary its size), but you will want to consider other values for genomes with chromosomes that are orders of magnitude different from this.
Note that we are re-reading the file. With such a small genome, it would have been feasible (in step one) to do an in-memory load of the whole genome. By all means, feel free to try this programming style for small genomes – it's faster! However, this code is more generalized for larger genomes.
- Note that in the for loop, we ignore the mitochondrion and apicoplast by parsing the SO entry to the description. The chrom_sizes dictionary will maintain the size of chromosomes.
The chrom_GC dictionary is our most interesting data structure and will have a list of a fraction of the GC content for each 50 kbp window. So, for chromosome 1, which has a size of 640,851 bp, there will be 14 entries because this chromosome size has 14 blocks of 50 kbp.
- Now, let's perform a genome plot of the GC distribution. We will use shades of blue for the GC content. However, for high outliers, we will use shades of red. For low outliers, we will use shades of yellow:
from reportlab.lib import colors
from reportlab.lib.units import cm
from Bio.Graphics import BasicChromosome
chroms = list(chrom_sizes.keys())
chroms.sort()
biggest_chrom = max(chrom_sizes.values())
my_genome = BasicChromosome.Organism(output_format="png")
my_genome.page_size = (29.7*cm, 21*cm)
telomere_length = 10
bottom_GC = 17.5
top_GC = 22.0
for chrom in chroms:
chrom_size = chrom_sizes[chrom]
chrom_representation = BasicChromosome.Chromosome ('Cr %d' % chrom)
chrom_representation.scale_num = biggest_chrom
tel = BasicChromosome.TelomereSegment()
tel.scale = telomere_length
chrom_representation.add(tel)
num_blocks = len(chrom_GC[chrom])
for block, gc in enumerate(chrom_GC[chrom]):
my_GC = chrom_GC[chrom][block]
body = BasicChromosome.ChromosomeSegment()
if my_GC > top_GC:
body.fill_color = colors.Color(1, 0, 0)
elif my_GC < bottom_GC:
body.fill_color = colors.Color(1, 1, 0)
else:
my_color = (my_GC - bottom_GC) / (top_GC -bottom_GC)
body.fill_color = colors.Color(my_color,my_color, 1)
if block < num_blocks - 1:
body.scale = block_size
else:
body.scale = chrom_size % block_size
chrom_representation.add(body)
tel = BasicChromosome.TelomereSegment(inverted=True)
tel.scale = telomere_length
chrom_representation.add(tel)
my_genome.add(chrom_representation)
my_genome.draw('falciparum.png', 'Plasmodium falciparum')
The first line converts the return of the keys method to a list. This was redundant in Python 2, but not in Python 3, where the keys method has a specific dict_keys return type.
We will draw the chromosomes in order (hence the sort). We will need the size of the biggest chromosome (14, in P. falciparum) to make sure that the size of chromosomes is printed with the correct scale (the biggest_chrom variable).
We then create an A4-sized representation of an organism with a PNG output. Note that we will draw very small telomeres of 10 bp. This will produce a rectangular-like chromosome. You can make the telomeres bigger, giving it a roundish representation, or you may have the arguably better idea of using the correct telomere size for your species.
We declare that anything with a GC content below 17.5 percent or above 22.0 percent will be considered an outlier. Remember that for most other species, this will be much higher.
We then print these chromosomes: they are bounded by telomeres and composed of 50 kbp chromosome segments (the last segment is sized with the remainder). Each segment will be colored in blue, with a red-green component based on the linear normalization between two outlier values. Each chromosome segment will either be 50 kbp, or potentially smaller if it's the last one of the chromosome. The output is shown in the following diagram:

Figure 1: The 14 chromosomes of Plasmodium falciparum, color-coded with the GC content (red is more than 22 percent, yellow less than 17 percent, and the blue shades represent a linear gradient between both numbers)
- Finally, you can print the image inline in the Notebook:
from IPython.core.display import Image
Image("falciparum.png")
- 黑客攻防從入門到精通(實戰秘笈版)
- Mastering Adobe Captivate 2017(Fourth Edition)
- PHP 7底層設計與源碼實現
- Practical Windows Forensics
- 人人都懂設計模式:從生活中領悟設計模式(Python實現)
- Access 2010數據庫應用技術(第2版)
- 深度學習:Java語言實現
- 計算機應用基礎案例教程
- 微信小程序開發與實戰(微課版)
- Regression Analysis with Python
- HTML5權威指南
- Java圖像處理:基于OpenCV與JVM
- 從零開始學Android開發
- MyBatis 3源碼深度解析
- HTML+CSS+JavaScript網頁制作:從入門到精通(第4版)