Space Ranger1.2, printed on 11/22/2024
The spaceranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode and valid UMI and were assigned with high confidence to a gene. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, features (genes), and barcode lists used for the analysis.
The following HDF5 datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (UMI, spot-barcode, feature) tuple indicating the feature best supported by the reads (i.e., including PCR duplicates) assigned to that UMI and spot-barcode.
Column | Type | Description |
---|---|---|
barcode_idx | uint64 | A zero-based index into the barcodes dataset (see next section), indicating the spot-barcode assigned to this putative molecule. |
count | uint32 | Number of reads associated with this putative molecule that were confidently mapped to the assigned feature. |
feature_idx | uint32 | A zero-based index into the feature list (see next section), indicating the feature to which this putative molecule was assigned. |
gem_group | uint16 | Integer label that is currently one (1) for all Space Ranger output. |
library_idx | uint16 | Integer label that is currently one (1) for all Space Ranger output. |
umi | uint32 | 2-bit encoded (see note below) processed (i.e. corrected) UMI sequence. |
In addition, the molecule info file has datasets corresponding to information about the libraries, barcode list(s), and feature set(s) that were used in the analysis.
At the top level of the HDF5 file hierarchy, the barcodes
and library_info
datasets provide information about the experiments contained in this analysis:
Dataset | Type | Description |
---|---|---|
barcodes | string | A list of all spot-barcodes associated with this experiment (including those that were not observed). The barcode_idx column described in the previous section contains indices into this list of barcodes. Each spot-barcode sequence has a trailing digit that is currently one (1) in output generated from Space Ranger (e.g., AGAATGGTCTGCAT-1 ). |
library_info | string | A JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_id , library_type , and gem_group |
The HDF5 group barcode_info
gives information regarding the barcodes determined to be underneath the tissue. This HDF5 group contains two columns
Dataset | Type | Description |
---|---|---|
genomes | string | A list of all genome references used for gene expression libraries in this analysis. |
pass_filter | uint64 | A matrix with three columns that contains one row per passing spot-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx) , where genome_idx is an index into the genomes dataset. |
The HDF5 group features
contains information regarding the feature reference(s) used for the analysis. The datasets within the features
group represent columns in a table containing one row per feature (gene). Values in the feature_idx
column described in the previous section provide indices into the rows of this hypothetical table.
Column | Type | Description |
---|---|---|
feature_type | string | The type of feature reference to which this feature belongs. Currently Visum only supports Gene Expression features. |
genome | string | The genome reference for a given feature (e.g., "GRCh38" or "mm10"). |
id | string | The unique id corresponding to this feature (for example, an Ensembl gene ID). |
name | string | A human-readable name associated with this feature (for example, the common name associated with a gene). |
The features
group also contains an HDF5 group target_sets
used for Targeted Gene Expression samples. When a target gene panel is present, indices of the target genes are stored inside target_sets
, in an HDF5 dataset named after the target gene panel (e.g., "Human Gene Signature").
(root) ├─ barcode_idx ├─ barcode_info [HDF5 group] │ ├─ genomes │ └─ pass_filter ├─ barcodes ├─ count ├─ feature_idx ├─ features [HDF5 group] │ ├─ _all_tag_keys │ ├─ target_sets [for Targeted Gene Expression] │ │ └─ [target set name] │ ├─ feature_type │ ├─ genome │ ├─ id │ ├─ name ├─ gem_group ├─ library_idx ├─ library_info ├─ metrics [HDF5 group; see below] └─ umi
The UMI sequences are 2-bit encoded as follows:
Note that the spot-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info
HDF5 group.
The metrics_json
dataset contains pipeline metrics in JSON format that are used internally by Space Ranger. Users should view metrics using the Space Ranger metrics outputs.