How to parse the Counts sparse matrix file output by the DRAGEN scRNA and scATAC pipelines
The DRAGEN Single Cell pipelines generate a count matrix of unique UMIs/genes (scRNA) and peaks (scATAC) per cell and outputs it in aMatrix Market format (matrix.mtx.gz), a format typically used for storing sparse matrices. If a user wants to explore the output matrix in a human-readable format, they can do so by loading the matrix in a "dense" dataframe in Python/other programming languages. It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space by "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (eg, scanpy in python).
The row names for this matrix are stored in thebarcodes.tsv.gz file while the column names are stored in a genes.tsv.gz (scRNA) or a peaks.tsv.gz (scATAC) file.
The matrix can be converted into a "dense" representation through two python modules: scanpy
and pandas
. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.
First, it is necessary to install the necessary libraries:
Within python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (eg, CSV), although this might not recommended due to large disk usage.
For any feedback or questions regarding this article (Illumina Knowledge Article #7911), contact Illumina Technical Support techsupport@illumina.com.
Last updated