HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Visium Spatial Gene Expression

HDF5 Feature-Barcode Matrix Format

Table of Contents

In addition to the MEX format, 10x Genomics also provides matrices in the Hierarchical Data Format (HDF5 or H5). H5 is a binary format that compresses and accesses data more efficiently than text formats such as MEX, which is useful when dealing with large datasets. H5 files are supported in both Python and R.

For more information on the format, see the Introduction to HDF5.

Note: Space Ranger generates an output file with per-molecule information in HDF5 format. General information about the HDF5 file format described here applies to the molecule_info.h5 file. Refer to the Molecule Info (H5) documentation for more specific details.

HDF5 file hierarchy

filtered_feature_bc_matrix.h5
└── matrix [HDF5 group]
    ├── barcodes
    ├── data
    ├── features [HDF5 group]
    │   ├── _all_tag_keys
    │   ├── feature_type
    │   ├── genome
    │   ├── id
    │   ├── name
    │   └── target_sets [For Targeted GEX, HDF5 group]
    │       └── [target set name]]
    ├── indices
    ├── indptr
    └── shape

The hierarchy is same for raw_feature_bc_matrix.h5 file.

The contents of the .h5 file can be examined using HDFView software or the h5dump command.

h5dump -n filtered_feature_bc_matrix.h5ls
 
HDF5 "filtered_feature_bc_matrix.h5" {
FILE_CONTENTS {
 group      /
 group      /matrix
 dataset    /matrix/barcodes
 dataset    /matrix/data
 group      /matrix/features
 dataset    /matrix/features/_all_tag_keys
 dataset    /matrix/features/feature_type
 dataset    /matrix/features/genome
 dataset    /matrix/features/id
 dataset    /matrix/features/name
 group      /matrix/features/target_sets
 dataset    /matrix/features/target_sets/[target set name]
 dataset    /matrix/indices
 dataset    /matrix/indptr
 dataset    /matrix/shape
 }
}
h5dump -d matrix/barcodes filtered_feature_bc_matrix.h5 | head -n 15
 
HDF5 "filtered_feature_bc_matrix.h5" {
DATASET "matrix/barcodes" {
   DATATYPE  H5T_STRING {
      STRSIZE 18;
      STRPAD H5T_STR_NULLPAD;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 3468 ) / ( 3468 ) }
   DATA {
   (0): "AAACAAGTATCTCCCA-1", "AAACAATCTACTAGCA-1", "AAACACCAATAACTGC-1",
   (3): "AAACAGAGCGACTCCT-1", "AAACAGTGTTCCTGGG-1", "AAACATTTCCCGGATT-1",
   (6): "AAACCGGGTAGGTACC-1", "AAACCGTTCGTCCAGG-1", "AAACCTAAGCAGCCGG-1",
   (9): "AAACCTCATGAAGTTG-1", "AAACGAAGAACATACC-1", "AAACGAGACGGTTGAT-1",
   (12): "AAACGCCCGAGATCGG-1", "AAACGGGCGTACGGGT-1", "AAACGGTTGCGAACTG-1",

Data format

The top level of the file contains a single HDF5 group, called matrix, and metadata stored as HDF5 attributes. Within the matrix group are datasets containing the dimensions of the matrix, the matrix entries, as well as the features and spot-barcodes associated with the matrix rows and columns, respectively.

ColumnTypeDescription
barcodesstringBarcode sequences and their corresponding library identifiers (for example, AAACGGGCAGCTCGAC-1). The library identifier is always -1 for spaceranger count runs of individual capture areas, and a small integer that identifies distinct capture areas in the output of spaceranger aggr
datauint32Nonzero UMI counts in column-major order
indicesuint32Zero-based row index of corresponding element in data
indptruint32Zero-based index into data / indices of the start of each column, that is the data corresponding to each barcode sequence
shapeuint64Tuple of (# rows, # columns) indicating the matrix dimensions

The matrix entries are stored in Compressed Sparse Column (CSC) format. For more details on the format, see this SciPy introduction. CSC represents the matrix in column-major order, so that each barcode is represented by a contiguous chunk of data values.

The feature reference is stored as an HDF5 group called features, within the matrix group. Note that for Targeted Gene Expression samples, the features dataset in the filtered matrix H5 file will not contain non-targeted genes, and the feature indices in target_sets are updated accordingly.

Loading HDF5 into R

There are multiple packages that allow import of HDF5 file into R as shown in the example code below (edit path to the H5 file in red).

# set path to the h5 file
h5_path = "/opt/sample345/outs/filtered_feature_bc_matrix.h5"
 
# Method 1
# load the Bioconductor rhdf5 package
library(rhdf5)
 
# read in the file and examine its contents
filtered_hs <- H5Fopen(h5_path)
h5ls(filtered_hs)
 
# Method 2
# load package
library(Seurat)
 
# read in the file and examine its contents
filtered_hs <- Read10X_h5(h5_path)
head(filtered_hs, 10)

Loading HDF5 into Python

There are two ways to load the H5 matrix into Python:

Method 1: Using cellranger.matrix module

This method requires adding spaceranger/lib/python to your $PYTHONPATH. For example, if you installed Space Ranger into /opt/spaceranger-1.3.1, then you can call the following script to set your PYTHONPATH:

$ source spaceranger-1.3.1/sourceme.bash

Then in Python, the matrix can be loaded using the cellranger.matrix module as follows (edit the path to the H5 file in red):

import cellranger.matrix as cr_matrix
filtered_h5 = "/opt/sample345/outs/filtered_feature_bc_matrix.h5"
filtered_matrix_h5 = cr_matrix.CountMatrix.load_h5_file(filtered_hs)

Method 2: Using PyTables

This method is more involved, and requires the SciPy and PyTables libraries. Edit the path to the H5 file in red.

import collections
import scipy.sparse as sp_sparse
import tables
 
CountMatrix = collections.namedtuple('CountMatrix', ['feature_ref', 'barcodes', 'matrix'])
 
def get_matrix_from_h5(filename):
    with tables.open_file(filename, 'r') as f:
        mat_group = f.get_node(f.root, 'matrix')
        barcodes = f.get_node(mat_group, 'barcodes').read()
        data = getattr(mat_group, 'data').read()
        indices = getattr(mat_group, 'indices').read()
        indptr = getattr(mat_group, 'indptr').read()
        shape = getattr(mat_group, 'shape').read()
        matrix = sp_sparse.csc_matrix((data, indices, indptr), shape=shape)
         
        feature_ref = {}
        feature_group = f.get_node(mat_group, 'features')
        feature_ids = getattr(feature_group, 'id').read()
        feature_names = getattr(feature_group, 'name').read()
        feature_types = getattr(feature_group, 'feature_type').read()
        feature_ref['id'] = feature_ids
        feature_ref['name'] = feature_names
        feature_ref['feature_type'] = feature_types
        tag_keys = getattr(feature_group, '_all_tag_keys').read()
        for key in tag_keys:
            feature_ref[key] = getattr(feature_group, key).read()
         
        return CountMatrix(feature_ref, barcodes, matrix)
 
filtered_h5 = "/opt/sample345/outs/filtered_feature_bc_matrix.h5"
filtered_matrix_h5 = get_matrix_from_h5(filtered_hs)