Cell Ranger ARC2.0, printed on 11/24/2024
The cellranger-arc count pipeline outputs two types of feature-barcode matrices described in the table below. There are two types of features, Gene Expression and Peaks, in a matrix. For a Gene Expression feature each element in the matrix is the number of UMIs associated with the corresponding feature (row) and barcode (column). For a Peaks feature each element in the matrix is the the number of cut sites associated with the corresponding feature (row) and barcode (column).
Type | Description |
---|---|
Raw feature-barcode matrix | The rows represent features, genes and peaks detected, while the columns consist of all barcodes that were detected in the experiment with non-zero signal in either modality. For GEX, a barcode has a non-zero signal if it has at least one read. For ATAC, a non-zero signal means that there are reads mapped to at least one peak. Both cell-associated and background barcodes are included in the raw feature-barcode matrix. |
Filtered feature-barcode matrix | The rows represent features, genes and peaks detected, while the columns consist of all cell-associated barcodes. |
Both the matrices described above are sparse, in other words, a large number of entries in the matrix are zero. Each matrix is stored in two formats for sparse matrices, the text-based Market Exchange Format (MEX) that is described below and the HDF5 format described here. It also contains gzipped TSV files with feature and barcode sequences corresponding to row and column indices respectively. For example, the matrices output may look like:
$ cd /home/jdoe/runs/sample345/outs $ tree filtered_feature_bc_matrix filtered_feature_bc_matrix ├── barcodes.tsv.gz ├── features.tsv.gz └── matrix.mtx.gz 0 directories, 3 files
For a Gene Expression feature: The first and second column of the features.tsv.gz file stores the gene ID and name as defined in the reference GTF, respectively. If no gene_name field is present in the reference GTF, gene name is equivalent to gene ID. The third column identifies the type of feature, which will be one of Gene Expression or Peaks, depending on the feature type. The fourth, fifth and sixth columns store the chromosome, start and end positions of the cellranger-arc determined TSS for this gene in 0-based bed-format.
TSS for each gene: The TSS is defined as the 0-based co-ordinate of the 5'-most position of a transcript. For each gene, transcripts are restricted to those that have the GENCODE basic tag. If a gene does not have a transcript with this tag then all associated transcripts are selected. For each gene, we define the TSS as the minimum region that spans the TSSs of all selected transcripts.
For a Peaks feature: The first two columns of the features.tsv.gz file, store the peak ID which is the location of peak, denoted as "contig:start-end". The third column identifies the type of feature, which will be one of Gene Expression or Peaks, depending on the feature type. The fourth, fifth, and sixth columns store the chromosome, start and end positions of the peak in 0-based bed-format.
Below is a minimal example features.tsv.gz file showing data collected for 3 genes and 2 peaks.
$ zcat filtered_feature_bc_matrix/features.tsv.gz | head -n 5 ENSG00000139687 RB1 Gene Expression chr13 48303725 48303747 ENSG00000141510 TP53 Gene Expression chr17 7675492 7687550 ENSG00000012048 BRCA1 Gene Expression chr17 43125314 43125483 chr13:48301972-48306754 chr13:48301972-48306754 Peaks chr13 48301972 48306754 chr17:76751362-76755140 chr17:76751362-76755140 Peaks chr17 76751362 76755140
Barcode sequences correspond to column indices.
$ zcat filtered_feature_bc_matrices/barcodes.tsv.gz | head -n 5 AAACAGCCAAATATCC-1 AAACAGCCAGGAACTG-1 AAACAGCCAGGCTTCG-1 AAACCAACACCTGCTC-1 AAACCAACAGATTCAT-1
Each barcode sequence includes a suffix with a dash separator followed by a number:
AAACAGCCAAATATCC-1
More details on the barcode sequence format are available in the GEX BAM section. Note that the barcodes correspond to the 10x Barcode sequence of the Gene Expression library, and to learn more about the pairing between ATAC and GEX barcodes see Barcode Translation
R and Python support the MEX format, and sparse matrices can be used for more efficient manipulation.
The R package Matrix supports loading MEX format data, and can be easily used to load the sparse feature-barcode matrix, as shown in the example code below.
library(Matrix) matrix_dir = "/opt/sample345/outs/filtered_feature_bc_matrix/" barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz") features.path <- paste0(matrix_dir, "features.tsv.gz") matrix.path <- paste0(matrix_dir, "matrix.mtx.gz") mat <- readMM(file = matrix.path) feature.names = read.delim(features.path, header = FALSE, stringsAsFactors = FALSE) barcode.names = read.delim(barcode.path, header = FALSE, stringsAsFactors = FALSE) colnames(mat) = barcode.names$V1 rownames(mat) = feature.names$V1
The csv, os, gzip and scipy.io modules can be used to load a feature-barcode matrix into Python as shown below.
import csv import gzip import os import scipy.io matrix_dir = "/opt/sample345/outs/filtered_feature_bc_matrix" mat = scipy.io.mmread(os.path.join(matrix_dir, "matrix.mtx.gz")) features_path = os.path.join(matrix_dir, "features.tsv.gz") feature_ids = [row[0] for row in csv.reader(gzip.open(features_path), delimiter="\t")] gene_names = [row[1] for row in csv.reader(gzip.open(features_path), delimiter="\t")] feature_types = [row[2] for row in csv.reader(gzip.open(features_path), delimiter="\t")] barcodes_path = os.path.join(matrix_dir, "barcodes.tsv.gz") barcodes = [row[0] for row in csv.reader(gzip.open(barcodes_path), delimiter="\t")]