Cell Ranger2.0, printed on 11/05/2024
The cellranger pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid cell-barcode and valid UMI. This file is required by the R kit in order to produce read-subsampled gene-barcode matrices. This HDF5 file contains data corresponding to the observed molecules, as well as data corresponding to the reference transcriptome that was used.
The following datasets in the molecule info file correspond to columns of a table. Each row of that table corresponds to a unique (cell-barcode, UMI, gene) tuple. There is an additional row per (cell-barcode, UMI) tuple that aggregates information about reads that could not be confidently mapped to a gene.
Column | Type | Description |
---|---|---|
barcode | uint64 | 2-bit encoded processed cell-barcode sequence. |
barcode_corrected_reads | uint32 | Number of reads within this putative molecule that had their cell-barcode corrected. |
conf_mapped_uniq_read_pos | uint32 | Number of unique read mapping positions associated with this putative molecule. |
gem_group | uint8 | Integer label that distinguishes data coming from distinct 10x GEM reactions (such as different channels or chips). |
gene | uint32 | A zero-based index into the gene_ids field (see next section), indicating the gene to which this putative molecule was mapped. When set to the maximum gene index + 1, this row describes reads that did not map confidently to any gene. |
genome | uint32 | A zero-based index into the genome_ids field (see next section), indicating the genome to which this putative molecule was mapped. When set to the maximum genome index + 1, this row describes reads that did not map confidently to any genome. |
nonconf_mapped_reads | uint32 | The number of reads with this cell-barcode and UMI that mapped to the genome but did not map confidently to any gene. |
reads | uint32 | Number of reads that confidently mapped to this putative molecule. |
umi | uint32 | 2-bit encoded processed UMI sequence. |
umi_corrected_reads | uint32 | Number of reads within this putative molecule that had their UMI corrected. |
unmapped_reads | uint32 | The number of reads with this cell-barcode and UMI that did not map to the genome. |
In addition, the molecule info has a few datasets corresponding to the reference transcriptome(s) associated with this analysis.
Column | Type | Description |
---|---|---|
gene_ids | string | The Ensembl gene IDs contained in this reference. The gene column defined in the previous section is an index into this array. |
gene_names | string | The common gene symbol associated with each of the above gene_ids . |
genome_ids | string | The list of genomes represented in this reference. In most cases, this will be a single genome. The genome column defined in the previous section is an index into this array. |
The cell-barcode and UMI sequences are 2-bit encoded as follows: