10x Genomics
Chromium Single Cell Gene Expression

Cell Ranger3.1, printed on 03/03/2025

Creating a Reference Package with cellranger mkref

Cell Ranger provides pre-built human (hg19, GRCh38), mouse (mm10), and ercc92 reference packages for read alignment and gene expression quantification in cellranger count.
To create and use a custom reference package, Cell Ranger requires a reference genome sequence (FASTA file) and gene annotations (GTF file).

A tutorial 'Build a Custom Reference With cellranger mkref' is available to walk you through the steps.

Compatible Use Cases

Cell Ranger supports the use of customer-generated references under the following conditions:

Your reference should have only a small number of overlapping gene annotations. - Reads aligning non-uniquely to multiple genes cause the pipeline to detect fewer molecules.
Your FASTA and GTF files must be compatible with the open source splicing-aware RNA-seq aligner, STAR.
- To be considered for transcriptome alignment, genes must have annotations with feature type 'exon' (column 3) in the GTF file.

Making a Reference Package

To create a custom reference:

Filter GTF file with mkgtf to contain only genes of interest.
Index the FASTA and GTF files with mkref.

Example use cases:

Filter with mkgtf

GTF files downloaded from sites like ENSEMBL and UCSC often contain transcripts and genes which need to be filtered from your final annotation. Cell Ranger provides mkgtf, a simple utility to filter genes based on their key-value pairs in the GTF attribute column:

$ cellranger mkgtf input.gtf output.gtf --attribute=key:allowable_value

For example, the following filtering was applied to generate the GTF file for the GRCh38 Cell Ranger reference package:

$ cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
                   --attribute=gene_biotype:protein_coding \
                   --attribute=gene_biotype:lincRNA \
                   --attribute=gene_biotype:antisense \
                   --attribute=gene_biotype:IG_LV_gene \
                   --attribute=gene_biotype:IG_V_gene \
                   --attribute=gene_biotype:IG_V_pseudogene \
                   --attribute=gene_biotype:IG_D_gene \
                   --attribute=gene_biotype:IG_J_gene \
                   --attribute=gene_biotype:IG_J_pseudogene \
                   --attribute=gene_biotype:IG_C_gene \
                   --attribute=gene_biotype:IG_C_pseudogene \
                   --attribute=gene_biotype:TR_V_gene \
                   --attribute=gene_biotype:TR_V_pseudogene \
                   --attribute=gene_biotype:TR_D_gene \
                   --attribute=gene_biotype:TR_J_gene \
                   --attribute=gene_biotype:TR_J_pseudogene \
                   --attribute=gene_biotype:TR_C_gene

This generated a filtered GTF file Homo_sapiens.GRCh38.ensembl.filtered.gtf from the original unfiltered GTF file Homo_sapiens.GRCh38.ensembl.gtf. In the output file, other biotypes such as gene_biotype:pseudogene are excluded from the GTF annotation.

Index with cellranger mkref

To create custom references, use the cellranger mkref command, passing it one or more matching sets of FASTA and GTF files. This utility copies your FASTA and GTF, indexes these in several formats, and outputs a folder with the name you pass to --genome. Input GTF files are typically filtered with mkgtf prior to mkref.

Argument	Description
`--genome`	Unique genome name(s), used to name output folder. Should contain only alphanumeric characters and optionally period, hyphen, and underscore characters [a-zA-Z0-9_-]+. Specify multiple genomes by specifying the --genome argument multiple times.
`--fasta`	Path(s) to FASTA file containing your genome reference. Specify multiple genomes by specifying the --fasta argument multiple times.
`--genes`	Path(s) to genes GTF file(s) containing annotated genes for your genome reference. Specify multiple genomes by specifying the --genes argument multiple times.
`--nthreads`	(Optional) Number of threads used during STAR genome index generation. Defaults to 1.
`--memgb`	(Optional) Maximum memory (GB) used during STAR genome index generation. Defaults to 16. Please note, the amount of memory specified must be greater than the number of gigabases in the input reference FASTA file.
`--ref-version`	(Optional) Reference version string to include with reference.

Basic usage

$ cellranger mkref --genome=output_genome --fasta=input.fa --genes=input.gtf

Outputs

A successful mkref run should conclude with a message similar to this:

Creating new reference folder at output_genome
...done

Writing genome FASTA file into reference folder...
...done

Computing hash of genome FASTA file...
...done

Writing genes GTF file into reference folder...
WARNING: The following transcripts appear on multiple chromosomes in the GTF:


This can indicate a problem with the reference or annotations. Only the first chromosome will be counted.
...done

Computing hash of genes GTF file...
...done

Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)...
...done

Writing genome metadata JSON file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
...done.

\>\>\> Reference successfully created! \<\<\<

Output listing:

genome_output/
├── fasta
│   └── genome.fa
├── genes
│   └── genes.gtf
├── pickle
│   └── genes.pickle
├── reference.json
└── star # STAR genome index folder

System Requirements

Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory. We recommend you run the mkref command with --nthreads equal to the number of cores available on your system.

Single Species

The most common use case is to create a reference for only one species. In this case, there is one set of matched FASTA and GTF files typically obtained from Ensembl, NCBI, or UCSC.

$ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf

When possible, please obtain genome sequence (FASTA) and gene annotations (GTF) from the same source: Use Ensembl FASTA files with Ensembl GTF files. Chromosome or sequence names in the FASTA file must match the chromosome or sequence names in the GTF file.

As noted in the STAR manual, the most comprehensive genome sequence and annotations are recommended:

For the genome sequence, include all major chromosomes, unplaced and unlocalized scaffolds, but do not include patches and alternative haplotypes.
- In Ensembl, the recommended genome file to download is annotated as "primary assembly." - In NCBI, it is "no alternative - analysis set."
For the GTF file, genes must be annotated with feature type 'exon' (column 3). - Prior to mkref, GTF annotation files from Ensembl and NCBI are typically filtered with mkgtf to include only a subset of the annotated gene biotypes.

Multiple Species

To create a reference for multiple species, run the mkref command with multiple FASTA and GTF files. This is similar to the single species case above, but note that the order of the arguments matters. The arguments are grouped by the order they appear; for instance, the first --genome option listed corresponds to the first --fasta and --genes options listed. Please use or create this type of reference when analyzing barnyard validation experiments for estimating multiplet rates.

$ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \
                   --genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf

Adding One or More Genes to Your Reference

Provided that you follow the format described above, it is fairly simple to add custom gene definitions to an existing reference. First, add the additional FASTA sequence records to the fasta/genome.fa file. Next, update the GTF file, genes/genes.gtf, with the gene annotation record(s).

The GTF file format is essentially a list of records, one per line, each comprising nine tab-delimited non-empty fields.

Column	Name	Description
1	Chromosome	Must refer to a chromosome/contig in the genome fasta.
2	Source	Unused.
3	Feature	`cellranger count` only uses rows where this line is `exon`.
4	Start	Start position on the reference (1-based inclusive).
5	End	End position on the reference (1-based inclusive).
6	Score	Unused.
7	Strand	Strandedness of this feature on the reference: `+` or `-`.
8	Frame	Unused.
9	Attributes	A semicolon-delimited list of key-value pairs of the form `key "value"`. The attribute keys `transcript_id` and `gene_id` are required; `gene_name` is optional and may be non-unique, but if present will be preferentially displayed in reports.

After adding the necessary records to your FASTA file and the additional lines to your GTF file, run cellranger mkref as normal.

Generating a Cell Ranger compatible "pre-mRNA" Reference Package

The single-nuclei RNA-seq assay captures unspliced pre-mRNA as well as mature mRNA. However, after alignment, cellranger count only counts reads aligned to exons. Since the pre-mRNA will generate intronic reads, it may be useful to create a custom “pre-mRNA” reference package, listing each gene transcript locus as an exon. Thus, these intronic reads will be included in the UMI counts for each gene and barcode.

A custom pre-mRNA reference package can be easily created from an existing Cell Ranger reference package in 2 steps. Starting with the pre-built GRCh38 reference package, as an example:

1. Create a "pre-mRNA" GTF

Extract GTF annotation rows for transcripts based on the feature type transcript (column 3) of the original tab-delimited GTF and replace the feature type from transcript to exon. Here's a script to do this using the Linux utility awk.

$ awk 'BEGIN{FS="\t"; OFS="\t"} $3 == "transcript"{ $3="exon"; print}' \
       refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf > GRCh38-1.2.0.premrna.gtf

2. Run `cellranger mkref`

Use the unmodified genome.fa file and the new GTF file as inputs to cellranger mkref.

$ cellranger mkref --genome=GRCh38-1.2.0_premrna \
                   --fasta=refdata-cellranger-GRCh38-1.2.0/fasta/genome.fa \
                   --genes=GRCh38-1.2.0.premrna.gtf

Cell Ranger

Loupe

10x Genomics
Chromium Single Cell Gene Expression

Creating a Reference Package with cellranger mkref

Compatible Use Cases

Making a Reference Package

Filter with mkgtf

Index with cellranger mkref

Basic usage

Outputs

System Requirements

Single Species

Multiple Species

Adding One or More Genes to Your Reference

Generating a Cell Ranger compatible "pre-mRNA" Reference Package

1. Create a "pre-mRNA" GTF

2. Run `cellranger mkref`

About

Legal Notices

Resources

Headquarters

Social

Cell Ranger

Loupe

10x GenomicsChromium Single Cell Gene Expression

Creating a Reference Package with cellranger mkref

Compatible Use Cases

Making a Reference Package

Filter with mkgtf

Index with cellranger mkref

Basic usage

Outputs

System Requirements

Single Species

Multiple Species

Adding One or More Genes to Your Reference

Generating a Cell Ranger compatible "pre-mRNA" Reference Package

1. Create a "pre-mRNA" GTF

2. Run cellranger mkref

10x Genomics
Chromium Single Cell Gene Expression

2. Run `cellranger mkref`