10x Genomics
Chromium Single Cell Gene Expression

Cell Ranger7.0, printed on 03/03/2025

HDF5 Feature-Barcode Matrix Format

In addition to the MEX format, we also provide matrices in the Hierarchical Data Format (HDF5 or H5). H5 is a binary format that can compress and access data much more efficiently than text formats such as MEX, which is especially useful when dealing with large datasets. H5 files are supported in both R and Python.

For more information on the format, see the Introduction to HDF5.

Note: Cell Ranger generates an output file with per-molecule information in HDF5 format. General information about the HDF5 file format here applies to the molecule_info.h5 or sample_molecule_info.h5 file, but see the documentation for specific details about the Molecule Info HDF5 file.

Data format
HDF5 file hierarchy
Downstream analysis with R and Python
Loading matrices into Python

Data format

The top level of the file contains a single HDF5 group, called matrix, and metadata stored as HDF5 attributes. Within the matrix group are datasets containing the dimensions of the matrix, the matrix entries, as well as the features and cell-barcodes associated with the matrix rows and columns, respectively.

Column	Type	Description
`barcodes`	string	Barcode sequences and their corresponding GEM wells (e.g. `AAACGGGCAGCTCGAC-1`)
`data`	uint32	Nonzero UMI counts in column-major order
`indices`	uint32	Zero-based row index of corresponding element in `data`
`indptr`	uint32	Zero-based index into `data` / `indices` of the start of each column, i.e., the data corresponding to each barcode sequence
`shape`	uint64	Tuple of (# rows, # columns) indicating the matrix dimensions

The matrix entries are stored in Compressed Sparse Column (CSC) format. For more details on the format, see this SciPy introduction. CSC represents the matrix in column-major order, such that each barcode is represented by a contiguous chunk of data values.

The feature reference is stored as an HDF5 group called features, within the matrix group. Note that for Targeted Gene Expression samples, the features dataset in the filtered matrix H5 file will not contain non-targeted genes, and the feature indices in target_sets are updated accordingly.

HDF5 file hierarchy

(root)
└── matrix [HDF5 group]
    ├── barcodes
    ├── data
    ├── indices
    ├── indptr
    ├── shape
    └── features [HDF5 group]
        ├─ _all_tag_keys
        ├─ target_sets [for Targeted Gene Expression or Fixed RNA Profiling]
        │   └─ [target set name]
        ├─ feature_type
        ├─ genome
        ├─ id
        ├─ name
        ├─ pattern [Feature Barcode only]
        ├─ read [Feature Barcode only]
        └─ sequence [Feature Barcode only]

You can examine the contents of the H5 file using software such as HDFView or the h5dump command, as demonstrated below.

Show the file contents of the entire H5 object:

h5dump -n ./filtered_feature_bc_matrix.h5

HDF5 "filtered_feature_bc_matrix.h5" {
FILE_CONTENTS {
 group      /
 group      /matrix
 dataset    /matrix/barcodes
 dataset    /matrix/data
 group      /matrix/features
 dataset    /matrix/features/_all_tag_keys
 dataset    /matrix/features/feature_type
 dataset    /matrix/features/genome
 dataset    /matrix/features/id
 dataset    /matrix/features/name
 dataset    /matrix/indices
 dataset    /matrix/indptr
 dataset    /matrix/shape
 }
}

Show the top few lines of a specific part of the object (e.g., the matrix/barcodes dataset):

h5dump -d matrix/barcodes ./filtered_feature_bc_matrix.h5 | head -n 15

HDF5 "./filtered_feature_bc_matrix.h5" {
DATASET "matrix/barcodes" {
   DATATYPE  H5T_STRING {
      STRSIZE 18;
      STRPAD H5T_STR_NULLPAD;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 1225 ) / ( 1225 ) }
   DATA {
   (0): "AAACCCAAGGAGAGTA-1", "AAACGCTTCAGCCCAG-1", "AAAGAACAGACGACTG-1",
   (3): "AAAGAACCAATGGCAG-1", "AAAGAACGTCTGCAAT-1", "AAAGGATAGTAGACAT-1",
   (6): "AAAGGATCACCGGCTA-1", "AAAGGATTCAGCTTGA-1", "AAAGGATTCCGTTTCG-1",
   (9): "AAAGGGCTCATGCCCT-1", "AAAGGGCTCCGTAGGC-1", "AAAGGTACAACTGCTA-1",
   (12): "AAAGTCCAGCGGGTTA-1", "AAAGTCCAGTCAACAA-1", "AAAGTCCCACCAGCCA-1",
...

Downstream analysis with R and Python

For suggestions on downstream analysis with 3rd party R and Python tools, see the 10x Genomics Analysis Guides resource.

Loading matrices into Python

There are two ways to load the H5 matrix into Python:

Method 1: using Cell Ranger

This method requires that you add cellranger/lib/python to your $PYTHONPATH (note: this method will only work on Linux machines). For example, if you installed Cell Ranger into /opt/cellranger-7.0.1, then from the Cell Ranger directory, you can call the following script to set your PYTHONPATH call:

source cellranger-7.0.1/sourceme.bash

Then in Python, the matrix can be loaded as follows (edit file path to your H5 file):

import cellranger.matrix as cr_matrix
filtered_matrix_h5 = "/opt/sample345/outs/filtered_feature_bc_matrix.h5"
filtered_feature_bc_matrix = cr_matrix.CountMatrix.load_h5_file(filtered_matrix_h5)

Method 2: using PyTables

This method is a bit more involved, and requires the SciPy and PyTables libraries.

import collections
import scipy.sparse as sp_sparse
import tables
 
CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix'])
 
def get_matrix_from_h5(filename):
    with tables.open_file(filename, 'r') as f:
        mat_group = f.get_node(f.root, 'matrix')
        barcodes = f.get_node(mat_group, 'barcodes').read()
        data = getattr(mat_group, 'data').read()
        indices = getattr(mat_group, 'indices').read()
        indptr = getattr(mat_group, 'indptr').read()
        shape = getattr(mat_group, 'shape').read()
        matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)
         
        feature_ref = {}
        feature_group = f.get_node(mat_group, 'features')
        feature_ids = getattr(feature_group, 'id').read()
        feature_names = getattr(feature_group, 'name').read()
        feature_types = getattr(feature_group, 'feature_type').read()
        feature_ref['id'] = feature_ids
        feature_ref['name'] = feature_names
        feature_ref['feature_type'] = feature_types
        tag_keys = getattr(feature_group, '_all_tag_keys').read()
        for key in tag_keys:
            key = key.decode("utf-8")
            feature_ref[key] = getattr(feature_group, key).read()
         
        return CountMatrix(feature_ref, barcodes, matrix)
 
filtered_matrix_h5 = "/opt/sample345/outs/filtered_feature_bc_matrix.h5"
filtered_feature_bc_matrix = get_matrix_from_h5(filtered_matrix_h5)

Cell Ranger

Loupe

10x Genomics
Chromium Single Cell Gene Expression

HDF5 Feature-Barcode Matrix Format

Table of Contents

Data format

HDF5 file hierarchy

Downstream analysis with R and Python

Loading matrices into Python

Method 1: using Cell Ranger

Method 2: using PyTables

About

Legal Notices

Resources

Headquarters

Social

Cell Ranger

Loupe

10x GenomicsChromium Single Cell Gene Expression

HDF5 Feature-Barcode Matrix Format

Table of Contents

Data format

HDF5 file hierarchy

Downstream analysis with R and Python

Loading matrices into Python

Method 1: using Cell Ranger

Method 2: using PyTables

10x Genomics
Chromium Single Cell Gene Expression