Why is allowing mismatches when demultiplexing desirable?
Both bcl2fastq2 and BCL Convert allow for the selection of number of mismatches that can be tolerated when demultiplexing/identifying indexes and attributing reads from a sequencing run to their respective samples. The number of mismatches can be set to 0, which indicates the sequence read must be identical to the index from the sample sheet.
Alternatively, the number of mismatches can be set to 1 or higher, indicating this number of mismatches can be tolerated. Generally, increasing the number of mismatches that can be tolerated also increases the number of indexes that can be identified. Most indexes have 0 mismatches but setting the mismatch tolerance to 1 typically increases the number of identified reads by 2-10%.
The likelihood of misassigning one index to another depends on the Hamming distance between the sequences and the probability of converting the original base to the exact base of the second index. The Hamming distance is the number of substitutions in two strings of equal length needed to transform one into another. Illumina index sets are typically designed with a Hamming distance (n) or mismatch number (mm) between any two pairs of indexes of greater than or equal to 4. This Hamming distance (n) allows considerable tolerance to substitution errors, especially with paired indexes, which must both be identified for the insert to be assigned to a sample.
The following table estimates potential index misassignments due to sequencing error for NovaSeq runs with various flow cell types, number of mismatches allowed (MM), Hamming distance (n), and Single or Dual Index strategies. Overall, decreasing number of mismatches (MM) achieves similar stringency as increasing hamming distance (n):
The least conservative case is a Hamming distance of n=4 and a mismatch of MM=1, even if there is high error rate per position. In this least conservative scenario, the rate of misassignment is very low - in the order of 0.000013% reads. This rate is much lower than index hopping or other sources of index misassignment. Usually, adopting a dual indexing strategy with 1 allowed mismatch is a less costly and more effective way to eliminate misassignments, rather than setting a stricter policy of 0 allowed mismatches.