Cell Ranger ARC2.0, printed on 11/22/2024
In addition to the MEX format, we also provide matrices in the Hierarchical Data Format (HDF5 or H5). H5 is a binary format that can compress and access data much more efficiently than text formats such as MEX, which is especially useful when dealing with large datasets. H5 files are supported in both Python and R.
For more information on the format, see the Introduction to HDF5.
The top level of the file contains a single HDF5 group, called matrix
, and metadata stored as HDF5 attributes. Within the matrix
group are datasets containing the dimensions of the matrix, the matrix entries, as well as the features and cell-barcodes associated with the matrix rows and columns, respectively.
Column | Type | Description |
---|---|---|
barcodes | string | Barcode sequences and their corresponding GEM wells (e.g. AAACGGGCAGCTCGAC-1 ) |
data | uint32 | Nonzero UMI counts in column-major order |
indices | uint32 | Zero-based row index of corresponding element in data |
indptr | uint32 | Zero-based index into data / indices of the start of each column, i.e., the data corresponding to each barcode sequence |
shape | uint64 | Tuple of (# rows, # columns) indicating the matrix dimensions |
The matrix entries are stored in Compressed Sparse Column (CSC) format. For more details on the format, see this SciPy introduction. CSC represents the matrix in column-major order, such that each barcode is represented by a contiguous chunk of data values.
The feature reference is stored as an HDF5 group called features
, within the matrix
group. See the documentation for the Molecule Info HDF5 file for details.
(root) └── matrix [HDF5 group] ├── barcodes ├── data ├── indices ├── indptr ├── shape └── features [HDF5 group] ├─ _all_tag_keys ├─ feature_type ├─ genome ├─ id ├─ interval [cellranger-arc only] └─ name
There are two ways to load the H5 matrix into Python:
This method requires that you add cellranger-arc-2.0.2/lib/python
to your $PYTHONPATH
. For example, if you installed Cell Ranger ARC into /opt/cellranger-arc-2.0.2
, then you can call the following script to set your PYTHONPATH
call:
$ source cellranger-arc-2.0.2/sourceme.bash
Then in Python, the matrix can be loaded as follows:
import cellranger.matrix as cr_matrix filtered_matrix_h5 = "/opt/sample345/outs/filtered_feature_bc_matrix.h5" filtered_feature_bc_matrix = cr_matrix.CountMatrix.load_h5_file(filtered_matrix_h5)
This method is a bit more involved, and requires the SciPy and PyTables libraries.
import collections import scipy.sparse as sp_sparse import tables CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix']) def get_matrix_from_h5(filename): with tables.open_file(filename, 'r') as f: mat_group = f.get_node(f.root, 'matrix') barcodes = f.get_node(mat_group, 'barcodes').read() data = getattr(mat_group, 'data').read() indices = getattr(mat_group, 'indices').read() indptr = getattr(mat_group, 'indptr').read() shape = getattr(mat_group, 'shape').read() matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape) feature_ref = {} feature_group = f.get_node(mat_group, 'features') feature_ids = getattr(feature_group, 'id').read() feature_names = getattr(feature_group, 'name').read() feature_types = getattr(feature_group, 'feature_type').read() feature_ref['id'] = feature_ids feature_ref['name'] = feature_names feature_ref['feature_type'] = feature_types tag_keys = [key.decode() for key in getattr(feature_group, '_all_tag_keys').read()] for key in tag_keys: feature_ref[key] = getattr(feature_group, key).read() return CountMatrix(feature_ref, barcodes, matrix) filtered_matrix_h5 = "/opt/sample345/outs/filtered_feature_bc_matrix.h5" filtered_feature_bc_matrix = get_matrix_from_h5(filtered_matrix_h5)