How to parse the Counts sparse matrix file output by the DRAGEN scRNA and scATAC pipelines

The DRAGEN Single Cell pipelines generate a count matrix of unique UMIs/genes (scRNA) and peaks (scATAC) per cell and outputs it in aMatrix Market format (matrix.mtx.gz), a format typically used for storing sparse matrices. If a user wants to explore the output matrix in a human-readable format, they can do so by loading the matrix in a "dense" dataframe in Python/other programming languages. It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space by "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (eg, scanpy in python).

The row names for this matrix are stored in thebarcodes.tsv.gz file while the column names are stored in a genes.tsv.gz (scRNA) or a peaks.tsv.gz (scATAC) file.

The matrix can be converted into a "dense" representation through two python modules: scanpy and pandas. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.

First, it is necessary to install the necessary libraries:

> pip install -U scanpy pandas

Within python, the matrix can be loaded in "dense" representation using the following commands:

# import libraries import pandas as pd import scanpy as sc # define path to input files matrix\_path = "path/to/matrix.mtx.gz" genes\_path = "path/to/genes.tsv.gz" #path/to/peaks.tsv.gz for scATAC databarcodes\_path = "path/to/barcodes.tsv.gz" # load matrix through scanpy adata = sc.read\_mtx(matrix\_path).T adata.var\_names = pd.read\_csv(genes\_path, sep="\t", header=None)[1] adata.obs\_names = pd.read\_csv(barcodes\_path, sep="\t", header=None)[0] # convert scanpy internal format (AnnData) to dense pandas DataFrame df = pd.DataFrame(adata.X.todense(), index=adata.obs\_names, columns=adata.var\_names) # save it as CSV file df.to\_csv("output\_matrix.csv")

The matrix can be saved through different output formats (eg, CSV), although this might not recommended due to large disk usage.

For any feedback or questions regarding this article (Illumina Knowledge Article #7911), contact Illumina Technical Support [email protected].

Last updated 1 year ago

Was this helpful?