Cell Ranger5.0, printed on 11/05/2024
10X Genomics provides pre-built references for human and mouse genomes to use with Cell Ranger. Researchers can make custom reference genomes for additional species or add custom marker genes of interest to the reference, e.g. GFP. The following tutorial outlines the steps to build a custom reference using the cellranger mkref pipeline.
In this tutorial, you will learn how to:
This tutorial follows the same steps used to create the 10X Genomics pre-built references for human and mouse. These steps can be found on this page: Build Notes for Reference Packages.
First, locate the reference genome FASTA and GTF files for your species. If the species is available from the Ensembl database, we recommend using the files from there. The GTF files from Ensembl contain optional tags that make filtering easy. If your species of interest is not available from Ensembl, GTF and FASTA files from other sources can also work. Note that a GTF file is required, while a GFF file is not supported. (See GFF/GTF File Format - Definition and supported options)
This tutorial generates a custom reference for the zebrafish, Danio rerio.
The files needed are located here in Ensembl.
Navigate to the Gene annotation section of the Ensembl website and click on the Download GTF link. This takes you to an ftp site with a list of GTF files available. Select the file called Danio_rerio.GRCz11.99.chr.gtf.gz. This is the GTF annotation file for this species. All species in Ensembl have similar files available to download. For more information on the GTF files in Ensembl, click on the README file.
Right-click the link to copy the address, paste the URL into the command line, and download using the wget
command:
The file is approximately 20 MB and takes less than a minute to download depending on your system.
wget ftp://ftp.ensembl.org/pub/release-99/gtf/danio_rerio/Danio_rerio.GRCz11.99.chr.gtf.gz
Decompress the file with the gunzip
command:
gunzip Danio_rerio.GRCz11.99.chr.gtf.gz
Next, navigate back to the Ensembl page for Danio rerio and click on Download FASTA to access the ftp site containing several types of FASTA files. Select dna to access the directory with genome files. Download the FASTA file containing all the chromosomes together in the genome, which has primary assembly in the filename. Right-click on the link to copy the address. Paste the URL into the comandline and download it with the wget
command:
The file is approximately 400 MB and takes several minutes to download, depending on your system.
wget ftp://ftp.ensembl.org/pub/release-98/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
Decompress the file with the gunzip
command:
gunzip Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
GTF files can contain entries for non-polyA transcripts that overlap with protein-coding gene models. These entries can cause reads to be flagged as mapped to multiple genes (multi-mapped) because of the overlapping annotations. In the case where reads are flagged as multi-mapped, they are not counted. See article on Which reads are considered for UMI counting by Cell Ranger. To remove these entries from the GTF, add this filter argument to the mkgtf
command: --attribute=gene_biotype:protein_coding
. If you are interested in seeing all of the filters used to build references available on our support site, click here. If you are using a GTF file that does not contain gene_biotype
attributes or is missing other entries, don't worry too much; there may still be enough information to build a reference. A minimal GTF file only needs to contain exon features for protein coding genes.
Setup the command:
cellranger mkgtf \
Danio_rerio.GRCz11.99.chr.gtf \
Danio_rerio.GRCz11.98.chr.filtered.gtf \
--attribute=gene_biotype:protein_coding
This will output the file Danio_rerio.GRCz11.98.chr.filtered.gtf, which will be used in the next step.
Now that you have the genome FASTA and filtered GTF files needed, set up the command to run the cellranger mkref
pipeline.
The following is the command:
cellranger mkref \
--genome=Danio.rerio_genome \
--fasta=Danio_rerio.GRCz11.dna.primary_assembly.fa \
--genes=Danio_rerio.GRCz11.98.chr.filtered.gtf
Run the command. This can take several hours, depending on your system. If you are working on a shared computing environment such as an HPC cluster, submit this as a job to prevent competing with other users for resources.
The output looks similar to this:
['.../cellranger/bin/rna/mkref', '--genome=Danio.rerio_genome', '--fasta=Danio_rerio.GRCz11.dna.primary_assembly.fa', '--genes=Danio_rerio.GRCz11.98.chr.filtered.gtf'] Creating new reference folder at Danio.rerio_genome ...done Writing genome FASTA file into reference folder... ...done Computing hash of genome FASTA file... ...done Indexing genome FASTA file... ...done Writing genes GTF file into reference folder... ...done Computing hash of genes GTF file... ...done Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)... ...done Writing genome metadata JSON file into reference folder... ...done Generating STAR genome index (may take over 8 core hours for a 3Gb genome)... Apr 28 09:45:16 ..... Started STAR run Apr 28 09:45:16 ... Starting to generate Genome files Apr 28 09:45:52 ... starting to sort Suffix Array. This may take a long time... Apr 28 09:45:56 ... sorting Suffix Array chunks and saving them to disk... Apr 28 10:05:40 ... loading chunks from disk, packing SA... Apr 28 10:05:58 ... Finished generating suffix array Apr 28 10:05:58 ... Generating Suffix Array index Apr 28 10:07:48 ... Completed Suffix Array index Apr 28 10:07:48 ..... Processing annotations GTF Apr 28 10:07:57 ..... Inserting junctions into the genome indices Apr 28 10:14:32 ... writing Genome to disk ... Apr 28 10:14:35 ... writing Suffix Array to disk ... Apr 28 10:14:44 ... writing SAindex to disk Apr 28 10:14:48 ..... Finished successfully ...done. Reference successfully created! You can now specify this reference on the command line: cellranger --transcriptome=Danio.rerio_genome ...
The reference was successfully created, as noted in the output above. If you do not see this message, there was probably an error that occured. Please copy the error message and send an email to [email protected].
There are cases where the publicly-available GTF and FASTA files will not contain information for some of the genes expressed in a given sample. A transgenic sample is a good example of when you would not expect a gene of interest to be in the reference. In this example, the common marker gene, Green Fluorescent Protein (GFP) (used as an in-vivo fluorescent reporter for gene expression) is added to the reference. This method of adding genes to a reference has been reported to work for detecting genes from viral infections provided the detected transcripts are polyadenylated.
Note: This is only one of many mRNA sequences available encoding for GFP. Make sure to use the sequence specific for your assay.
Next, get the GFP FASTA file from the European Nucleotide Archive:
wget -O GFP_orig.fa https://www.ebi.ac.uk/ena/browser/api/fasta/AAA27722.1?download=true
The header of this file looks like the following:
>ENA|AAA27722|AAA27722.1 Aequorea victoria green-fluorescent protein
There are special characters such as "|" and spaces in the header (all text after the >) of this FASTA sequence. These can be problematic for downstream applications. It can be helpful to change the header to be more informative and also to remove these characters. The following command opens the file and uses the stream editor (sed) function to search for a pattern (the original header), replace it with new text (GFP), then directs the output to a new output file, GFP.fa.
cat GFP_orig.fa | sed s/ENA\|AAA27722\|AAA27722\.\1\ Aequorea\ victoria\ green\-fluorescent\ protein/GFP/ > GFP.fa
Note: Another option is to open the GFP_orig.fa file with a text editor, such as nano, then manually edit the header and save the file as GFP.fa. Choose whichever method of changing the header you feel most comfortable with.
Now the FASTA file GFP.fa looks like the following:
>GFP ATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGT GATGTTAATGGGCACAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGA AAACTTACCCTTAAATTTATTTGCACTACTGGAAAGCTACCTGTTCCATGGCCAACACTT GTCACTACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAG CATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTAC AAAGATGACGGGAACTACAAATCACGTGCTGAAGTCAAGTTTGAAGGTGATACCCTCGTT AATAGAATTGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAA ATGGAATACAACTATAACTCACACAATGTATACATCATGGCAGACAAACAAAAGAATGGA ATCAAAGTTAACTTCAAAATTAGACACAACATTGAAGATGGAAGCGTTCAACTAGCAGAC CATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTAC CTGTCCACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTT CTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAA
To find the number of bases in this sequence, we will use the grep -v "^>"
command to search all lines that don't start with the >
character, which removes line returns with tr -d "\n"
so they aren't counted, and then counts the number of characters with the command wc -c
. Each command is sent to the next step with the pipe "|" command.
The results of this command shows there are 717 bases. This is important to know for the next step.
cat GFP.fa | grep -v "^>" | tr -d "\n" | wc -c
Now, make a custom GTF for GFP with the following command. This command uses the function echo -e
(prints everything in quotes; the -e enables interpretation of the backslash, e.g. \t
). Use \t
to insert the tabs that separate the 9 columns of information required for GTF.
echo -e 'GFP\tunknown\texon\t1\t717\t.\t+\t.\tgene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";' > GFP.gtf
This is what the GFP.gtf file looks like with the cat GFP.gtf
command:
GFP unknown exon 1 717 . + . gene_id GFP; transcript_id GFP; gene_name GFP; gene_biotype protein_coding;
Next, add the GFP.fa to the end of the D. rerio genome FASTA. But first, make a copy so that the original is unchanged.
cp Danio_rerio.GRCz11.dna.primary_assembly.fa Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa
Then, append the GFP.fa to the end of the Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa file. The >>
means append. Note: Do not use >
, which overwrites the original file.
cat GFP.fa >> Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa
To confirm that the GFP entry was added to the FASTA file, use the grep ">"
command to search for lines with the >
character:
grep ">" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa
The output looks similar to the following:
>1 dna:chromosome chromosome:GRCz11:1:1:59578282:1 REF >10 dna:chromosome chromosome:GRCz11:10:1:45420867:1 REF >11 dna:chromosome chromosome:GRCz11:11:1:45484837:1 REF >12 dna:chromosome chromosome:GRCz11:12:1:49182954:1 REF >13 dna:chromosome chromosome:GRCz11:13:1:52186027:1 REF >14 dna:chromosome chromosome:GRCz11:14:1:52660232:1 REF >15 dna:chromosome chromosome:GRCz11:15:1:48040578:1 REF >16 dna:chromosome chromosome:GRCz11:16:1:55266484:1 REF >17 dna:chromosome chromosome:GRCz11:17:1:53461100:1 REF >18 dna:chromosome chromosome:GRCz11:18:1:51023478:1 REF >19 dna:chromosome chromosome:GRCz11:19:1:48449771:1 REF >2 dna:chromosome chromosome:GRCz11:2:1:59640629:1 REF >20 dna:chromosome chromosome:GRCz11:20:1:55201332:1 REF >21 dna:chromosome chromosome:GRCz11:21:1:45934066:1 REF >22 dna:chromosome chromosome:GRCz11:22:1:39133080:1 REF >23 dna:chromosome chromosome:GRCz11:23:1:46223584:1 REF >24 dna:chromosome chromosome:GRCz11:24:1:42172926:1 REF >25 dna:chromosome chromosome:GRCz11:25:1:37502051:1 REF >3 dna:chromosome chromosome:GRCz11:3:1:62628489:1 REF >4 dna:chromosome chromosome:GRCz11:4:1:78093715:1 REF >5 dna:chromosome chromosome:GRCz11:5:1:72500376:1 REF >6 dna:chromosome chromosome:GRCz11:6:1:60270059:1 REF >7 dna:chromosome chromosome:GRCz11:7:1:74282399:1 REF >8 dna:chromosome chromosome:GRCz11:8:1:54304671:1 REF >9 dna:chromosome chromosome:GRCz11:9:1:56459846:1 REF >MT dna:chromosome chromosome:GRCz11:MT:1:16596:1 REF >KN149696.2 dna:scaffold scaffold:GRCz11:KN149696.2:1:368252:1 REF >KN147651.2 dna:scaffold scaffold:GRCz11:KN147651.2:1:351968:1 REF >KN149690.1 dna:scaffold scaffold:GRCz11:KN149690.1:1:343018:1 REF >KN149686.1 dna:scaffold scaffold:GRCz11:KN149686.1:1:260365:1 REF >KN147652.2 dna:scaffold scaffold:GRCz11:KN147652.2:1:252640:1 REF >KN149688.2 dna:scaffold scaffold:GRCz11:KN149688.2:1:252035:1 REF >KN149691.1 dna:scaffold scaffold:GRCz11:KN149691.1:1:233193:1 REF ... >GFP
You can also count the number of contigs in the FASTA. There should now be 994 contigs including the extra GFP.
grep -c "^>" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa
Use the cp
command to make a copy of the original GTF and modify the name to contain GFP. Then use the cat
command to append the contents of GFP.gtf to the end of the renamed copy of the filtered D. rerio GTF.
cp Danio_rerio.GRCz11.98.chr.filtered.gtf Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf cat GFP.gtf >> Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf
Check the file with the following command:
tail Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf
The output looks similar to the following with the GTF entry as the last line of the file:
MT RefSeq start_codon 15308 15310 . + 0 gene_id "ENSDARG00000063924"; gene_version "3"; transcript_id "ENSDART00000093625"; transcript_version "3"; exon_number "1"; gene_name "mt-cyb"; gene_source "RefSeq"; gene_biotype "protein_coding"; transcript_name "mt-cyb-201"; transcript_source "RefSeq"; transcript_biotype "protein_coding"; GFP unknown exon 1 717 . + . gene_id GFP; transcript_id GFP; gene_name GFP; gene_biotype protein_coding;
Now use the Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf and Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa files as inputs to the cellranger mkref
pipeline:
cellranger mkref --genome=Danio.rerio_genome_GFP --fasta=Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa --genes=Danio_rerio.GRCz11.98.chr.filtered.GFP.gtf
This outputs a custom reference directory called Danio.rerio_genome_GFP.
If you have used the Custom Panel Designer for the Targeted Gene Expression assay to design a custom panel with exogenous sequences, you will need to make a custom GRCh38-2020-A reference in a similar manner to the Add Your Marker Gene to the FASTA and GTF steps above. However, because the files are already provided for you as output by the Custom Panel Designer, there are not as many steps.
You will need the custom sequences FASTA file (e.g. custompanel.fa) and the custom sequences GTF file (custompanel.gtf) output from the custom panel designer. These files are available on the last page of the custom design, where the links to the order are made.
First, make copies of the GRCh38-2020-A reference files in a separate directory:
mkdir custom-GRCh38-2020-A cd custom-GRCh38-2020-A cp ../refdata-gex-GRCh38-2020-A/genes/genes.gtf customref-GRCh38-2020-A.gtf cp ../refdata-gex-GRCh38-2020-A/fasta/genome.fa customref-GRCh38-2020-A.fa
Then, append the custom panel files to the ends of the GRCh38-2020-A files.
cat custompanel.gtf >> customref-GRCh38-2020-A.gtf cat custompanel.fa >> customref-GRCh38-2020-A.fa
Check the files with the following commands to confirm that the process above worked:
tail customref-GRCh38-2020-A.gtf tail customref-GRCh38-2020-A.fa
Now use these files as inputs to the cellranger mkref pipeline:
cellranger mkref \
--genome=customref-GRCh38-2020-A \
--fasta=customref-GRCh38-2020-A.fa \
--genes=customref-GRCh38-2020-A.gtf
This outputs a custom reference directory called customref-GRCh38-2020-A.