Quality Scores for Next Generation Sequencing

Introduction

A next-generation sequencing experiment consists of a series of discrete steps that uniquely contribute to the overall quality of a data set. Sequencing quality metrics can provide important information about the accuracy of each step in this process, including library preparation, base calling, read alignment, and variant calling. Base calling accuracy, measured by the Phred quality score (Q score), is the most common metric used to assess the accuracy of a sequencing platform. It indicates the probability that a given base is called incorrectly by the sequencer. Historically used to determine Sanger sequencing accuracy, Phred originated as an algorithmic approach that considered Sanger sequencing metrics, such as peak resolution and shape, and linked them to known sequence accuracy through large multivariate lookup tables. This method proved to be highly accurate across a range of sequencing chemistries and instruments, making it the quality scoring standard for commercial sequencing technologies. While next-generation sequencing metrics vary from those of Sanger sequencing (e.g., no electropherogram peak heights), the process of generating a Phred quality scoring scheme is largely the same. Parameters relevant to a particular sequencing chemistry are analyzed for a large empirical data set of known accuracy. The resulting quality score lookup tables are used to calculate a quality score for de novo next-generation sequencing data (in real time on Illumina platforms), possessing an equivalent meaning to the historical metrics familiar to most Sanger sequencing users.

Calculating Phred Quality Scores

Q scores are defined as a property that is logarithmically related to the base calling error probabilities (P).

Q = − 10 log10 P

For example, if Phred assigns a Q score of 30 (Q30) to a base, this is equivalent to the probability of an incorrect base call 1 in 1000 times (Table 1). This means that the base call accuracy (i.e., the probability of a correct base call) is 99.9%. A lower base call accuracy of 99% (Q20) will have an incorrect base call probability of 1 in 100, meaning that every 100 bp sequencing read will likely contain an error. When sequencing quality reaches Q30, virtually all of the reads will be perfect, having zero errors and ambiguities. This is why Q30 is considered a benchmark for quality in next-generation sequencing. By comparison, Sanger sequencing systems generally produce base call accuracy of ~99.4%, or ~Q20. Low Q scores can increase false-positive variant calls, which can result in inaccurate conclusions and higher costs for validation experiments.

Illumina Data Quality

Illumina Q score calculations have been shown to be very similar to the actual data quality observed in human genome sequencing. Figure 1 shows the predicted and empirical quality scores from a HiSeq 2000 Quality Scores for Next-Generation Sequencing Assessing sequencing accuracy using Phred quality scoring. run are well correlated. Q scores can reveal how much of the data from a given run is usable in a resequencing or assembly experiment. Sequencing data with lower quality scores can result in a significant portion of the reads being unusable, resulting in wasted time and expense. PhiX quality scores for the MiSeq and HiSeq systems show that nearly all bases have scores > Q30 for single and paired-end reads (Figure 2). Comparison of E. coli whole-genome sequencing data shows that this high data quality is consistent across both platforms (Table 2).

Accurate Sequencing Chemistry

Illumina sequencing by synthesis (SBS) technology delivers the highest percentage of error-free reads, with a vast majority of bases having quality scores above Q30. In many cases, even higher quality scores of Q35-Q40 are available. The latest version of the chemistry, TruSeq™ SBS and Cluster Generation v3 reagents, have been optimized for accurate base calling even within difficult-to-sequence regions of the genome, such as repeats, homo polymers, and high GC regions. TruSeq v3 chemistry is available for the HiSeq and MiSeq systems. The unparalleled TruSeq accuracy is ideal for next-generation sequencing in clinical environments that demand the highest standard of quality. Since the release of the original Illumina Genome Analyzer™ system, SBS technology has been used in the widest range of sequencing applications, resulting in more than 2,000 peer-reviewed publications in just five years—a feat unmatched for any other life science technology.

SBS chemistry uses four fluorescently labeled nucleotides to sequence up to billions of clusters on the flow cell surface in parallel. During each sequencing cycle, a single labeled deoxynucleoside triphosphate (dNTP) is added to the nucleic acid chain. The dNTPs contain a reversible blocking group that serves as a terminator for polymerization, so after each dNTP incorporation, the fluorescent dye is imaged to identify the base and then enzymatically cleaved to allow incorporation of the next nucleotide. Since all four reversible terminator-bound dNTPs (A, C, T, G) are present as single, separate molecules, natural competition minimizes incorporation bias, which can be problematic with serial nucleotide incorporation chemistry used in Sanger sequencing. Base calls are made directly from signal intensity measurements during each cycle, greatly reducing raw error rates compared to other technologies. The result is highly accurate base-by-base sequencing that eliminates sequence context-specific errors, enabling robust base calling across the genome, including repetitive sequence regions and homopolymers.

Summary

See the full Quality Scores for Next-Generation Sequencing tech note here.

Last updated

© 2023 Illumina, Inc. All rights reserved. All trademarks are the property of Illumina, Inc. or their respective owners. Trademark information: illumina.com/company/legal.html. Privacy policy: illumina.com/company/legal/privacy.html