What is nucleotide diversity and why is it important?
What is nucleotide diversity?
Nucleotide diversity refers to the relative proportion of nucleotides A, C, G and T present in every cycle of the run. Well-balanced or high diversity libraries have roughly equal proportions of all four nucleotides in each cycle throughout the sequencing run. Low diversity libraries have a high proportion of certain nucleotides and a low proportion of other nucleotides in a cycle. Finally, libraries that have in-line barcodes or other molecular identifiers can have regions of low diversity in otherwise high diversity libraries.
The thumbnail images below show the relative proportion of nucleotides for a well-balanced library and a low diversity library for one tile of a MiSeq sequencing cycle. Thumbnail images for the balanced library are showing a roughly equal signal in each channel. The low diversity library is showing more signal in the C channel and much less signal in the A, G, and T channels. When all clusters provide signal primarily in one channel, the instrument can have trouble identifying the location of the clusters and make quality bases calls.
The diagram below shows the %Base by cycle plot from Sequencing Analysis Viewer (SAV) of a well-balanced and a low diversity library. For the well-balanced library, all four channels contribute roughly 25% to the total signal. For the low diversity library, the relative proportion of each nucleotide varies between cycles. In some cycles, up to 90% of the total signal is produced by one channel.
Why is nucleotide diversity important?
Nucleotide diversity is critical for optimal run performance and high-quality data generation. It is particularly important in the first 25 cycles of a sequencing run because this is when the clusters passing filter, phasing/pre-phasing, and color matrix corrections are calculated. Phasing/pre-phasing is the percentage of molecules in a cluster for which sequencing falls behind (phasing) or jumps ahead (pre-phasing) the current cycle. Color matrix correction refers to a template created in the first few cycles that includes intensities from each channel and is then used in all subsequent reads as well as for phasing/pre-phasing rates. These metrics are then used in base calling and quality score calculations for all cycles in the run. Balanced fluorescent signal from all imaging channels provides the most accurate empirical models and improve overall data quality.
On a non-patterned flow cell, the number and location of clusters is empirically determined in the first 4 to 7 cycles (depending on the instrument and reagent kit used) through a process called template generation. For platforms with non-patterned flow cells such as the MiniSeq, MiSeq, NextSeq 500/550, and HiSeq 1000/2500, nucleotide diversity is important during template generation. Signal from all four bases must be present in the first 4 to 7 cycles to best generate the template.
Illumina sequencing systems use three different types of SBS technology
The HiSeq and the MiSeq systems use 4-channel chemistry. In this technology, each of the four nucleotides emits a unique wavelength and four images are taken per cycle. The Real-Time Analysis (RTA) software then empirically determines the color normalization matrix and calculates phasing/pre-phasing rates, both of which are used in base calling and assigning quality scores. The MiSeq and HiSeq 2500 use RTA 1 while the HiSeq 3000/4000 and HiSeq X use RTA 2. When sequencing low diversity libraries on the MiSeq or the HiSeq systems, a minimum of 5% to 10% PhiX spike-in is recommended depending on the platform and control software version.
The MiniSeq, NextSeq 550/550, NextSeq 1000/2000, and the NovaSeq 6000 platforms use 2-channel chemistry. With this technology, two fluorescent dyes and two images are used to determine the incorporation of all four nucleotides per cycle. This enables faster sequencing and more efficient data processing. The NextSeq 500/550 and the MiniSeq use RTA2 for image processing, base calling, and quality score calculations. The NextSeq 1000/2000 and NovaSeq 6000 uses RTA3, which has been optimized for data processing time. For platforms with 2-channel chemistry, nucleotide balance is especially important for color matrix correction and intensity normalization. When sequencing low diversity libraries, a minimum of 10% PhiX spike-in is recommended for the MiniSeq and NextSeq and 5% PhiX for the NovaSeq.
The iSeq 100 platform is the only Illumina instrument to use one-channel chemistry. Each sequencing cycle uses a single fluorescent dye, two images, and two chemistry steps, one prior to each of the imaging steps. The iSeq 100 uses RTA2 for intensity extraction, base calling, and quality score calculation. A minimum of 5% PhiX spike-in is recommended when sequencing low diversity libraries on the iSeq 100.