官术网_书友最值得收藏!

Introduction

Next-generation sequencing (NGS) is one of the fundamental technological developments of the decade in life sciences. Whole genome sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and several other technologies are routinely used to investigate important biological problems. These are also called high-throughput sequencing technologies, and with good reason: they generate vast amounts of data that needs to be processed. NGS is the main reason that computational biology has become a big-data discipline. More than anything else, this is a field that requires strong bioinformatics techniques.

Here, we will not discuss each individual NGS technique per se (this would require a whole book on its own). We will use an existing WGS dataset and the 1,000 Genomes Project to illustrate the most common steps necessary to analyze genomic data. The recipes presented here will be easily applicable to other genomic sequencing approaches. Some of them can also be used for transcriptomic analysis (for example, RNA-Seq). The recipes are also species-independent, so you will be able to apply them to any other species for which you have sequenced data. The biggest difference in processing data from different species is related to genome size, diversity, and the quality of the assembled genome (if it exists for your species). These will not affect the automated Python part of NGS processing much. In any case, we will discuss different genomes in the next chapter, Chapter 3, Working with Genomes.

As this is not an introductory book, you are expected to know at least what FASTA, FASTQ, Binary Alignment Map (BAM), and Variant Call Format (VCF) files are. I will also make use of the basic genomic terminology without introducing it (such as exomes, nonsynonymous mutations, and so on). You are required to be familiar with basic Python. We will leverage this knowledge to introduce the fundamental libraries in Python to perform the NGS analysis. Here, we will follow the flow of a standard bioinformatics pipeline.

However, before we delve into real data from a real project, let's get comfortable with accessing existing genomic databases and basic sequence processing—a simple start before the storm.

主站蜘蛛池模板: 蓝山县| 台前县| 蒙城县| 台北县| 突泉县| 金湖县| 海盐县| 石景山区| 铜山县| 木里| 山丹县| 纳雍县| 双流县| 赤壁市| 景洪市| 瑞金市| 黄浦区| 周宁县| 沾益县| 玉环县| 怀柔区| 新兴县| 满洲里市| 楚雄市| 苏尼特左旗| 工布江达县| 陵川县| 宁强县| 林周县| 察雅县| 牙克石市| 石首市| 瑞昌市| 健康| 绥中县| 东宁县| 松溪县| 临沂市| 永修县| 玉田县| 平顺县|