Variant and Methylation calling FAQ for the Illumina 5 Base WGS and Enrichment kits

How best to check for conversion efficiency?

The methyl_metrics.csv file generated by DRAGEN reports the conversion of the methylated (pUC19) and unmethylated (lambda) control DNA. The “% of Cs methylated in a CpG context” is the metric name to focus on. Customers should focus on the % of C’s methylated in CpG context for both pUC19 and Lambda, rather than just focusing on the false positive rate.

For the numbers reported in *.methyl_metrics.csv, for example: METHYL CALLING % of C’s methylated in CpG content, 49.70, can this be interpreted as 49.7% of the CpGs in the genome are methylated? Or C’s in all the CpGs 49.7% of them are methylated?

49.7% of the Cs in a CpG context that were sequenced are methylated. This value corresponds to the sequencing reads (deduplicated and overlap trimmed), and not the whole genome to provide a technical readout of assay methylation conversion. It generally will correspond to the genomic methylation level too, but as methylation follows a continuous distribution at each position for bulk sequencing, the metrics are not interchangeable.

How would one distinguish between a somatic variant or a methylation call, of say 10%?

At a high-level, they can be distinguished because methylation is strand-specific, it shows up with distinct signals from DNA variants in the data, as well as whether those are germline or somatic variants. The methylation output and the variant calling outputs will be in separate files.
If per-position methylation reporting at specific variant positions is desired, Illumina recommends the customer use the methylation reporting from the VCF output for an accurate result (M5mC field).

How can one distinguish a C>T variant from a m****C>T conversion?

Because of the complementary base pairing, a C on one strand is paired with a G on the opposite. This enables distinguishing methylation from C>T mutations as only one strand will be converted if the base was methylated, but both strands will be changed if there was a variant. For C>T variants, the variant pattern (C>T) is present on both strands (Watson (+) and Crick (-)) of the DNA. For a methylated locus, the reference mismatch will only be found on one strand. If the methylated C is on the Watson (+) strand, in sequencing one will observe Watson strand T and Crick Strand G. If the methylated C is on the Crick (-) strand, one will observe Watson strand G and Crick strand A in sequencing. DRAGEN leverages this strand asymmetry to distinguish methylation and variants. As an example, instead of asking "is this base a C or T?" it is asking "is this base a C that’s methylated 90% of the time or is it a T?"
Due to deconvoluting the signal sources, one can distinguish SNV from methylation with statistical models, and the DRAGEN algorithms are able to call SNV events with >99.5% accuracy.

Is the detection of methylation less accurate when there are SNV in the same loci? Since there will only be one of the strands with the methylated C.

For C>T and G>A variants, methylation estimates may be inaccurate in the CX_report file as the methylation and SNV signal are not handled separately. This is why Illumina now provides the methylation at variant positions in the VCF output, which separates the variant information from methylation signal to provide more accurate reporting. The same is true in the gVCF, which reports methylation at all reference and non-reference CpGs by default.

Are the analyses of true variants and methylation based on the premise of simultaneous methylation on both strands? Can variants with non-CG methylation be distinguished?

The algorithm accounts for many signal-impacting factors, and is not limited to CG methylation. Variants and non-CG methylation can be distinguished with the same high accuracy, as can variants and hemi-methylation.

When Read1 and Read2 overlap, both reads report a C to T conversion at the same reference position, this will count as 1 coverage on the mC side in CX report file, is this correct? What if Read1 and Read2 report differently?

Correct, overlapping Read1 and Read2 positions that agree on methylation state contribute a single coverage count.
The Read1 sequence has priority, so if Read1 and Read2 report different bases, the Read1 sequence will be used. In short, all Read2 bases that overlap with Read1 will be ignored.

For any feedback or questions regarding this article (Illumina Knowledge Article #9952), contact Illumina Technical Support [email protected].

Last updated 1 month ago

Was this helpful?