Dealing with the GC-content bias in second-generation DNA sequence data
主 题: Dealing with the GC-content bias in second-generation DNA sequence data
报告人: Prof. Terry Speed ( UC Berkeley)
时 间: 2012-05-24 14:00-15:00
地 点: 理科一号楼1114
GC-content bias describes the dependence between fragment count (read coverage) and GC content found in high-throughput sequencing assays, particularly the Illumina Genome Analyzer technology. For analyses that focus on measuring fragment abundance within a genome, this bias can dominate the signal of interest. There is no consensus as to the source or shape of the bias; current methods to remove it do not assume a knowledge of the curve shape or scale. In this work we analyze regularities in the GC-bias patterns, and find a compact description for this curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that influences fragment counts. This GC effect is unimodal: both GC rich fragments and AT rich fragments are under-represented in the sequencing results. Moreover, the size of the fragment may interact with the shape and peak of the GC curve. Based on these findings, we propose a new method to calculate expected coverage. This is the first single-bp GC correction and accommodates library, strand, and fragment lengths information, as well as non-uniform bin sizes. We show that it outperforms current approaches in copy-number estimation tasks. These GC-modeling considerations can inform other high-throughput sequencing analyses, such as ChIP-seq and RNA-seq, and illuminate possible causes for the GC-content bias.