Cell Ranger ARC2.0, printed on 11/21/2024
The cellranger-arc pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode and valid UMI and were assigned with high confidence to a gene. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, features, and barcode lists used for the analysis.
(root) ├─ barcode_idx ├─ barcode_info [HDF5 group] │ ├─ genomes │ └─ pass_filter ├─ barcodes ├─ count ├─ feature_idx ├─ features [HDF5 group] │ ├─ _all_tag_keys │ ├─ feature_type │ ├─ genome │ ├─ id │ └─ name ├─ gem_group ├─ library_idx ├─ library_info ├─ metrics_json ├─ umi └─ umi_type
The following HDF5 datasets in the molecule information file correspond to columns of a table. Each row of that table corresponds to a unique molecule specified by (UMI, cell-barcode, feature) tuple. This tuple indicates the feature best supported by the reads (including PCR duplicates) assigned to that unique pairing of UMI and 10x Barcode.
Column | Type | Description |
---|---|---|
barcode_idx | uint64 | A zero-based index into the barcodes dataset (see next section), indicating the 10x Barcode sequence assigned to this putative molecule. |
count | uint32 | Number of reads associated with this putative molecule that were confidently mapped to the assigned feature. |
feature_idx | uint32 | A zero-based index into the features HDF5 group (see next section), indicating the feature to which this putative molecule was assigned. |
gem_group | uint16 | Integer label that distinguishes data derived from distinct 10x Genomics GEM reactions (such as different chip or chip channels). |
library_idx | uint16 | A zero-based index into the library_info array (see next section) that distinguishes data coming from distinct 10x Genomics libraries. For the Chromium Single Cell Multiome ATAC + Gene Expression assay only one library can be associated with a single GEM well. |
umi | uint32 | 2-bit encoded (see note below) processed (i.e. corrected) UMI sequence. |
umi_type | uint32 | A boolean array specifying whether the molecule aligned to an exonic (1) or intronic (0) region of the associated feature. |
The barcodes
and library_info
datasets provide information about the experiments contained in this analysis.
Dataset | Type | Description |
---|---|---|
barcodes | string | A list of all 10x Barcodes associated with this experiment (including those that were not observed). The barcode_idx column described in the previous section contains indices into this list of barcodes. |
library_info | string | A JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_id , library_type , and gem_group |
The HDF5 group barcode_info
provides information regarding the barcodes that were called as cells during the analysis. This HDF5 group contains two columns.
Dataset | Type | Description |
---|---|---|
genomes | string | A list of all genome references used in this analysis. In most cases, this will be a single genome. |
pass_filter | uint64 | A matrix with three columns that contains one row per cell-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx) , where genome_idx is an index into the genomes dataset. |
The HDF5 group features
contains information regarding the feature reference used for the analysis. The datasets within the features
group represent columns of a table containing one row per feature. Values in the feature_idx
column described in the previous section provide indices into the rows of this table.
In addition to the columns described below, _all_tag_keys
contains a list of built-in tags (genome
).
Column | Type | Description |
---|---|---|
feature_type | string | The type of feature reference to which this feature belongs (Gene Expression). |
genome | string | The genome reference for a given feature (e.g., "GRCh38" or "mm10"). |
id | string | The The Ensembl gene ID corresponding to this feature. |
name | string | The common gene symbol associated with each of the above ids . |
The UMI sequences are 2-bit encoded as follows:
Note that the cell-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info/barcodes
HDF5 dataset.
The metrics_json
dataset contains pipeline metrics in JSON format that are used internally by Cell Ranger. Users should view metrics using the Cell Ranger ARC metrics outputs.