Cell Ranger2.0, printed on 11/05/2024
Cell Ranger provides pre-built hg19, mm10 and ercc92 reference packages for use with the pipeline. If you would like to use your genome FASTA or gene GTF annotations, Cell Ranger supports the use of customer-generated references.
Cell Ranger supports the use of customer-generated references for the following scenarios:
There are 2 steps to construct a Cell Ranger-compatible reference.
GTF files downloaded from sites like ENSEMBL and UCSC often contain many transcripts and genes which often need to be filtered from your final annotation. Often, it is helpful to filter genes based on their key-value pairs in the GTF attribute column. For example, to filter for only protein-coding genes, run the following command on your GTF.
$ cellranger mkgtf hg19-ensembl.gtf hg19-filtered-ensembl.gtf --attribute=gene_biotype:protein_coding
This will generate a filtered GTF file hg19-filtered-ensembl.gtf from the original unfiltered GTF file hg19-ensembl.gtf.
To create a reference for only one species, run the cellranger mkref command on your FASTA and GTF files. Your FASTA and GTF files must meet the compatibility requirements above.
$ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf ... $ ls hg19 fasta/ genes/ pickle/ reference.json star/
This utility copies your FASTA and GTF, indexes it in several formats, and outputs a folder named <genome>
.
To create a reference for multiple species, run the mkreference command with your FASTA and GTF files similar to the single species case above. However, the order of the --genome
, --fasta
and --genes
options are important as the first --genome
option listed corresponds to the first --fasta
and --genes
options listed.
$ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \ --genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf ... $ ls hg19_and_mm10 fasta/ genes/ pickle/ reference.json star/
Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory. We recommend you run the mkreference
command with --nthreads
equal to the number of cores available on your system.
You can also specify the amount of memory (in GB) cellranger
should use during alignment via STAR. The default is set to 16 GB. Please note the amount of memory your reference uses during alignment must be greater than the number of gigabases in the input FASTA file.
The references in Cell Ranger reference package were generated with the steps described above. When creating the Cell Ranger hg19 reference, the GTF file downloaded from ENSEMBL was filtered using the following cellranger mkgtf command.
$ cellranger mkgtf hg19-ensembl.gtf hg19-filtered-ensembl.gtf \ --attribute=gene_biotype:protein_coding \ --attribute=gene_biotype:lincRNA \ --attribute=gene_biotype:antisense
Additionally, "chr" was prepended to the chromosome entries in the gtf.
The hg19 FASTA was then downloaded from UCSC and once alternate haplotype chromosomes were removed (any chromsome containing hap e.g. chr4_ctg9_hap1), running cellranger mkref as described above produced the Cell Ranger hg19 reference.
$ cellranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf
The Cell Ranger mm10 reference was generated similarly using filtered ENSEMBL GTF and UCSC FASTA files.