Cell Ranger6.0, printed on 11/21/2024
Cell Ranger provides pre-built human (hg19, GRCh38), mouse (mm10), and ercc92 reference packages for read alignment and gene expression quantification in cellranger count.
To create and use a custom reference package, Cell Ranger requires a reference genome sequence (FASTA file) and gene annotations (GTF file). A tutorial 'Build a Custom Reference With cellranger mkref' is available to walk you through the steps.
Cell Ranger supports the use of customer-generated references under the following conditions:
To create a custom reference:
Example use cases:
GTF files downloaded from sources like Ensembl or the UCSC Genome Browser often
contain transcripts and genes which need to be filtered from your final
annotation. Cell Ranger provides mkgtf
, a simple utility to filter genes based
on their key-value pairs in the GTF attribute column:
$ cellranger mkgtf input.gtf output.gtf --attribute=key:allowable_value
The attributes contained in the GTF file will vary based on the source of the GTF and sometimes the release version. As an example, the following filtering could be used to filter a human or mouse GTF from Ensembl (release 97 or later):
$ cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \ --attribute=gene_biotype:protein_coding \ --attribute=gene_biotype:lncRNA \ --attribute=gene_biotype:IG_LV_gene \ --attribute=gene_biotype:IG_V_gene \ --attribute=gene_biotype:IG_V_pseudogene \ --attribute=gene_biotype:IG_D_gene \ --attribute=gene_biotype:IG_J_gene \ --attribute=gene_biotype:IG_J_pseudogene \ --attribute=gene_biotype:IG_C_gene \ --attribute=gene_biotype:IG_C_pseudogene \ --attribute=gene_biotype:TR_V_gene \ --attribute=gene_biotype:TR_V_pseudogene \ --attribute=gene_biotype:TR_D_gene \ --attribute=gene_biotype:TR_J_gene \ --attribute=gene_biotype:TR_J_pseudogene \ --attribute=gene_biotype:TR_C_gene
Note: prior to Ensembl release 97, the lncRNA
biotype was split into several
others. See here for more
details on biotypes in Ensembl and GENCODE annotations.
This command generates a filtered GTF file Homo_sapiens.GRCh38.ensembl.filtered.gtf from the original unfiltered GTF file Homo_sapiens.GRCh38.ensembl.gtf. In the output file, other biotypes such as gene_biotype:pseudogene are excluded from the GTF annotation.
To create custom references, use the cellranger mkref command,
passing it one or more matching sets of FASTA and GTF files. This utility copies
your FASTA and GTF, indexes these in several formats, and outputs a folder with
the name you pass to --genome
. Input GTF files are typically
filtered with mkgtf prior to mkref.
Argument | Description |
---|---|
--genome | Unique genome name(s), used to name output folder. Should contain only alphanumeric characters and optionally period, hyphen, and underscore characters [a-zA-Z0-9_-]+. Specify multiple genomes by specifying the --genome argument multiple times. |
--fasta | Path(s) to FASTA file containing your genome reference. Specify multiple genomes by specifying the --fasta argument multiple times. |
--genes | Path(s) to genes GTF file(s) containing annotated genes for your genome reference. Specify multiple genomes by specifying the --genes argument multiple times. |
--nthreads | (Optional) Number of threads used during STAR genome index generation. Defaults to 1. |
--memgb | (Optional) Maximum memory (GB) used during STAR genome index generation. Defaults to 16. Please note, the amount of memory specified must be greater than the number of gigabases in the input reference FASTA file. |
--ref-version | (Optional) Reference version string to include with reference. |
$ cellranger mkref --genome=output_genome --fasta=input.fa --genes=input.gtf
A successful mkref run should conclude with a message similar to this:
Creating new reference folder at output_genome ...done Writing genome FASTA file into reference folder... ...done Computing hash of genome FASTA file... ...done Writing genes GTF file into reference folder... WARNING: The following transcripts appear on multiple chromosomes in the GTF: This can indicate a problem with the reference or annotations. Only the first chromosome will be counted. ...done Computing hash of genes GTF file... ...done Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)... ...done Writing genome metadata JSON file into reference folder... ...done Generating STAR genome index (may take over 8 core hours for a 3Gb genome)... ...done. \>\>\> Reference successfully created! \<\<\<
Output listing:
genome_output/ ├── fasta │ └── genome.fa ├── genes │ └── genes.gtf ├── reference.json └── star # STAR genome index folder
Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and
requires 32 GB of memory. We recommend you run the mkref
command
with --nthreads
equal to the number of cores available on your
system.
The most common use case is to create a reference for only one species. In this case, there is one set of matched FASTA and GTF files typically obtained from Ensembl, NCBI, or UCSC.
$ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf
When possible, please obtain genome sequence (FASTA) and gene annotations (GTF) from the same source: Use Ensembl FASTA files with Ensembl GTF files. Chromosome or sequence names in the FASTA file must match the chromosome or sequence names in the GTF file.
As noted in the STAR manual, the most comprehensive genome sequence and annotations are recommended:
To create a reference for multiple species, run the mkref command
with multiple FASTA and GTF files. This is similar to the single species case
above, but note that the order of the arguments matters. The arguments are
grouped by the order they appear; for instance, the first --genome
option listed corresponds to the first --fasta
and
--genes
options listed. Please use or create this type of reference
when analyzing barnyard validation experiments for estimating multiplet rates.
$ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \ --genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf
Provided that you follow the format described above, it is fairly simple to add
custom gene definitions to an existing reference. First, add the additional
FASTA sequence records to the fasta/genome.fa
file. Next, update the GTF file,
genes/genes.gtf
, with the gene annotation record(s).
The GTF file format is essentially a list of records, one per line, each comprising nine tab-delimited non-empty fields.
Column | Name | Description |
---|---|---|
1 | Chromosome | Must refer to a chromosome/contig in the genome fasta. |
2 | Source | Unused. |
3 | Feature | cellranger count only uses rows where this line is exon . |
4 | Start | Start position on the reference (1-based inclusive). |
5 | End | End position on the reference (1-based inclusive). |
6 | Score | Unused. |
7 | Strand | Strandedness of this feature on the reference: + or - . |
8 | Frame | Unused. |
9 | Attributes | A semicolon-delimited list of key-value pairs of the form key "value" . The attribute keys transcript_id and gene_id are required; gene_name is optional and may be non-unique, but if present will be preferentially displayed in reports. |
After adding the necessary records to your FASTA file and the additional lines
to your GTF file, run cellranger mkref
as normal.
The single-nuclei RNA-seq assay captures unspliced pre-mRNA as well as mature
mRNA. However, after
alignment,
cellranger count
only counts reads aligned to exons. Since the pre-mRNA will generate intronic reads, it may be useful to count these reads as well. Previously, it was recommended to create a custom “pre-mRNA” reference package, listing each gene transcript locus as an exon, in order to count intronic reads. In Cell Ranger 5.0, there is a new include-introns option for counting intronic reads that should be used instead, and the usage of pre-mRNA references is deprecated.