10x Genomics
Chromium Single Cell ATAC

Cell Ranger ATAC1.2, printed on 03/03/2025

Cell Ranger ATAC Genome References

Overview
Pre-built Standard References
Arguments and Options for mkref
Building with mkref
System Requirements
Configuration file
Advanced: file structure in a reference

Overview

The reference data for Cell Ranger ATAC pipelines consists of the reference genome sequence and its associated genome annotation, which includes gene and transcript coordinates. The genome sequences and annotations can be obtained from reputable, well-established consortia such as NCBI, GENCODE, Ensembl and ENCODE. We provide pre-built single and mixed species references described in the next section, as well as a command-line tool mkref to build references that are not pre-built.

Single species pre-built references (hg19, b37, GRCh38, mm10) can be built using mkref with recognized keyword input arguments. However, there is no practical need to generate these references again and we strongly recommend you download the pre-built references directly (see advanced section for more details). We do not support building of custom mixed species references via mkref in the Cell Ranger ATAC 1.2.0 pipelines.

Pre-built Standard References

We provide the following pre-built references on the downloads page.

Standard single species reference packages:

Human GRCh37 build in two variants:
- hg19/UCSC-style chromosome naming convention ("chr1", "chrM")
- b37/1000 Genomes-style chromosome naming convention ("1", "MT")
Human GRCh38 build.
Mouse mm10 build.

Note that we do not use the decoy and alternate contigs in any analysis steps in the pipeline.

Standard multi-species reference packages:

hg19_and_mm10
GRCh38_and_mm10

These are made by taking the union of reference sequences and annotations from individual single species pre-built references.

Note that the contigs names are prefixed by species build. For example, chr1 from hg19 is labelled as hg19_chr1 inside the hg19_and_mm10 build.

Arguments and Options

cellranger-atac 1.2.0 supports building single species references using mkref.

Parameter	Function
`GENOME`	(Required) Name of the genome reference. New reference will be built as a new directory named GENOME under the current working directory.
`--config`	(Optional for standard references) Configuration file to build a custom reference. Ignored when GENOME is one of the standard references: hg19, b37, GRCh38 or mm10.

Building with mkref

To build a custom reference, a configuration file specifying the source for genome sequences and annotations as well as contigs present in the genome is required (more on this in the configuration file requirements). The following is an example config file fly_BDGP6.config for building a reference for Drosophila melanogaster.

{
	GENOME_FASTA_INPUT: "ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.25_FB2018_06/fasta/dmel-all-chromosome-r6.25.fasta.gz",
	GENE_ANNOTATION_INPUT: "ftp://ftp.ensembl.org/pub/release-95/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.95.gtf.gz",
	MOTIF_INPUT: "http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_insects_non-redundant_pfms_jaspar.txt",
	ORGANISM: "Drosophila melanogaster",
	PRIMARY_CONTIGS: ["2L", "2R", "3L", "3R", "4", "X", "Y"],
	NON_NUCLEAR_CONTIGS: ["mitochondrion_genome"]
}

To build the reference, run mkref:

$ cd /home/jdoe/ref
$ cellranger-atac mkref fly_BDGP6 --config fly_BDGP6.config 
 
Non-standard genome name detected, building custom reference...
 
>>> Creating reference for fly_BDGP6 <<<
 
Creating new reference folder at /home/jdoe/ref/fly_BDGP6
Downloading fasta files from source...
done
 
Generating samtools index...
done
 
Generating pyfasta indexes...
    Number of contigs: 1870
    Total genome size: 143726002
done
 
Downloading gene annotation files from source...
done
 
Writing TSS and transcripts bed file...
    Parsed 23541 unique TSS and 28827 unique transcripts.
done
 
Generating bwa index (may take over an hour for a 3Gb genome)...
[bwa_index] Pack FASTA... 1.23 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=287452004, availableWord=32225820
[BWTIncConstructFromPacked] 10 iterations done. 53158068 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 98205524 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 138239796 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 173818340 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 205436596 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 233534948 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 258504820 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 280694052 characters processed.
[bwt_gen] Finished constructing BWT in 84 iterations.
[bwa_index] 94.16 seconds elapse.
[bwa_index] Update BWT... 0.93 sec
[bwa_index] Pack forward-only FASTA... 0.74 sec
[bwa_index] Construct SA from BWT and Occ... 33.75 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /home/jdoe/ref/fly_BDGP6/fasta/genome.fa
[main] Real time: 131.225 sec; CPU: 130.816 sec
done
 
Downloading pfm files from source...
done
 
Finishing up...
>>> Reference successfully created! <<<

System Requirements

Indexing is the computational bottleneck in building references for Cell Ranger ATAC. Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory.

Configuration file

For building custom references, you must supply a configuration file like the drosophila example shown in Building with mkref section. The example file is written in "human readable" JSON format, though a strictly formatted JSON is perfectly acceptable. Below is a table of required input keys for the configuration file. Each key is provided with a value that must satisfy type constraints specified in the second column. There are format requirements on the values, for example if the value is a url pointing to a file or a file path.

Required Input Keys	Type Requirements	Format Requirements
GENOME_FASTA_INPUT	valid url or file path	Must be in valid fasta format Must contain all contigs listed in PRIMARY_CONTIGS and NON_NUCLEAR_CONTIGS
GENE_ANNOTATION_INPUT	valid url or file path	Must be in valid GTF or GFF3 format. Must contain all contigs listed in PRIMARY_CONTIGS and NON_NUCLEAR_CONTIGS. All entries must match the contig lengths defined in GENOME_FASTA_INPUT. Must contain "transcript" or "mRNA" in the third column. For each row with "transcript" or "mRNA" in the third column, must have "gene_name" defined in the attribute column for GTF format, or "Name" for GFF3 format. This is to denote the common name of the gene. Alternatives such as "gene_symbol" will not be accepted. Header lines starting with "#" are allowed in GTF or GFF3 input. However, comment lines starting with "#" in between GTF or GFF3 record rows will not be properly handled. (Optional) contain "gene_type" field, used for filtering in peak annotation in which we only annotate "protein_coding" genes and genes coded for VDJ segment in pre-built genomes. When not present, no filtering will be applied.
MOTIF_INPUT	valid url or file path. Use "" to indicate it as not available.	Must be in valid JASPAR format. For example: JASPAR 2010 matrix_only format: >MA0001.1 AGL3 A [ 0 3 79 40 66 48 65 11 65 0 ] C [94 75 4 3 1 2 5 2 3 3 ] G [ 1 0 3 4 1 0 5 3 28 88 ] T [ 2 19 11 50 29 47 22 81 1 6 ] JASPAR 2010-2014 PFMs format: >MA0001.1 AGL3 0 3 79 40 66 48 65 11 65 0 94 75 4 3 1 2 5 2 3 3 1 0 3 4 1 0 5 3 28 88 2 19 11 50 29 47 22 81 1 6 The expected naming scheme of the motif is "motif ID" and "gene name" separated by a tab.
PRIMARY_CONTIGS	list	Must be within the bracket `[]` and each contig must be within quote "". Note that PRIMARY_CONTIGS cannot be an empty list.
NON_NUCLEAR_CONTIGS	list	Must be within the bracket `[]` and each contig must be within quote "". Use empty brackets `[]` for specifying empty list.
ORGANISM	string	Can be left empty as "". If provided, it will be displayed on the summary html file.

Advanced: file structure in a reference

A single species reference compatible with the Cell Ranger ATAC pipelines has the following file structure:

$ tree /home/jdoe/ref
/home/jdoe/ref
├── fasta
│   ├── contig-defs.json    [required, input]
│   ├── genome.fa           [required, input, for pre-built references, sources: NCBI]
│   ├── genome.fa.amb       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.ann       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.bwt       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.fai       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.flat      [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.gdx       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   ├── genome.fa.pac       [required, derived from genome.fa using samtool faidx, bwa, pysam]
│   └── genome.fa.sa        [required, derived from genome.fa using samtool faidx, bwa, pysam]
├── genes
│   ├── genes.gtf           [required, input, GENCODE sources for pre-built references: hg19, b37, GRCh38 and mm10]
│   └── regulatory.gff      [pre-built references only, Ensembl sources: hg19, b37, GRCh38 and mm10]
├── genome                  [required, input]
├── metadata.json           [required, input]
└── regions
    ├── blacklist.bed       [pre-built references only, ENCODE sources: hg19, b37, GRCh38, mm10]
    ├── ctcf.bed            [pre-built references only]
    ├── dnase.bed           [pre-built references only, ENCODE sources: hg19, b37, mm10, Anshul Kundaje's pipeline: GRCh38]
    ├── enhancer.bed        [pre-built references only, source: Ensembl regulatory build release 95]
    ├── promoter.bed        [pre-built references only, source: Ensembl regulatory build release 95]
    ├── motifs.pfm          [optional, input, source for pre-built references: JASPAR vertebrate non-redundant collection] 
    ├── transcripts.bed     [required for 1.1 and later references, derived from transcript coordinates in genes.gtf]
    └── tss.bed             [required, derived from first nt position of each transcript in genes.gtf]

The required files mentioned above are the minimal set of files required to create a directory structure compatible with Cell Ranger ATAC pipelines. Some required files are specified as part of input in the config file described in the configuration file requirements section. Other required files are derived by processing a required input file. The regulatory and functional domain files such as promoter.bed are present only in the pre-built references. The transcripts.bed is a derived file not present in 1.0 references but the 1.2.0 pipelines are backwards compatible with old 1.0 references. Note that mkref recognizes four keywords (hg19,b37,mm10,GRCh38) and running cellranger-atac mkref will create our pre-built references.

Cell Ranger ATAC

Loupe

10x Genomics
Chromium Single Cell ATAC

Cell Ranger ATAC Genome References

Table of Contents

Overview

Pre-built Standard References

Arguments and Options

Building with mkref

System Requirements

Configuration file

Advanced: file structure in a reference

About

Legal Notices

Resources

Headquarters

Social

Cell Ranger ATAC

Loupe

10x GenomicsChromium Single Cell ATAC

Cell Ranger ATAC Genome References

Table of Contents

Overview

Pre-built Standard References

Arguments and Options

Building with mkref

System Requirements

Configuration file

Advanced: file structure in a reference

10x Genomics
Chromium Single Cell ATAC