Cell Ranger2.0, printed on 03/03/2025
Cell Ranger provides a pre-built human reference package for use with the pipeline. If you would like to use your genome FASTA or gene GTF annotations, Cell Ranger supports the use of customer-generated references.
The cellranger mkvdjref tool can be used to generate a custom reference package.
$ cellranger mkvdjref --genome=my_vdj_ref \ --fasta=GRCh38_ensembl.fasta \ --genes=GRCh38_ensembl.gtf
A Cell Ranger V(D)J reference consists of germline gene segment sequences. It assumes that these sequences are contained within a genome reference FASTA, and that a gene annotation GTF points to the relevant gene segments. Currently it assumes the GTF is in an Ensembl-like format. If you are using a transcriptome- or segment- based V(D)J reference rather than a genome-based reference, you can make the "chromosomes" be the transcripts and construct a GTF which annotates the transcripts appropriately.
cellranger mkvdjref expects a FASTA file containing genomic reference sequences whose names are consistent with the names used in the GTF file.
Cell Ranger V(D)J expects a GTF file in an Ensembl-like format that contains information about V(D)J gene segments.
GTF Column | Name | Description |
1 | Chromosome | Must refer to a chromosome/contig in the genome fasta. |
2 | Source | Unused. |
3 | Feature | Cell Ranger only uses rows where this line is equal to one of CDS or five_prime_utr . |
4 | Start | Start position on the reference (1-based inclusive). |
5 | End | End position on the reference (1-based inclusive). |
6 | Score | Unused. |
7 | Strand | Strandedness of this feature on the reference: + or - . |
8 | Frame | Unused. |
9 | Attributes | A semicolon-delimited list of key-value pairs of the form key "value" . The attribute keys used by Cell Ranger V(D)J are detailed below. |
GTF Attribute | Description |
transcript_id | Becomes the record_id in the Cell Ranger V(D)J reference entry format. |
transcript_biotype | The value is used to infer the V(D)J segment type. Either transcript_biotype or gene_biotype must be a value in the "Accepted Biotypes" list below. If transcript_biotype is not on the accepted list, then gene_biotype is used. |
gene_biotype | See transcript_biotype . |
gene_name | Must be specified. Becomes the gene_name in the Cell Ranger V(D)J reference entry format. |
14 havana CDS 21621904 21621946 . + 0 transcript_id "ENST00000542354"; gene_name "TRAV1-1"; transcript_biotype "TR_V_gene";
cellranger mkvdjref creates a directory whose named is specified by the --genome
$ tree my_vdj_ref my_vdj_ref ├── fasta │ └── regions.fa └── reference.json
The Cell Ranger V(D)J human reference package refdata-cellranger-vdj-GRCh38-alts-ensembl-2.0.0 was generated with the following steps.
This reference was constructed by adding to and removing some entries from the Ensembl GTF. Adding entries from multiple GTFs is accomplished by specifying the --genes
argument multiple times. Entries are removed by providing a list of transcript IDs to the --rm-transcripts
argument. For details please see cellranger mkvdjref --help
$ wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz $ gunzip Homo_sapiens.GRCh38.dna.toplevel.fa.gz
$ wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz $ gunzip Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf.gz
$ cellranger mkvdjref --genome vdj_GRCh38_alts_ensembl \ --fasta=Homo_sapiens.GRCh38.dna.toplevel.fa \ --genes=Homo_sapiens.GRCh38.87.chr_patch_hapl_scaff.gtf \ --genes=vdj_GRCh38_alts_ensembl_10x_genes-2.0.0.gtf \ --rm-transcripts=vdj_GRCh38_alts_ensembl_10x_ignore_transcripts-2.0.0.txt \ --ref-version=2.0.0