Cell Ranger5.0, printed on 11/05/2024
The cellranger workflow starts by demultiplexing the Illumina sequencer's base call files (BCLs) for each flowcell directory into FASTQ files. 10x has developed cellranger mkfastq, a pipeline that wraps Illumina's bcl2fastq
and provides a number of convenient features in addition to the features of bcl2fastq
:
bcl2fastq
arguments, such as --use-bases-mask
.cellranger mkfastq supports single-indexed and dual-indexed flowcells. It will select the appropriate mode depending on the sample indexes used, and enable index-hopping filtering automatically for dual-indexed flowcells. For example, with the Dual Index Kit TT Set A, well A1 can be specified in the samplesheet as SI-TT-A1, and cellranger mkfastq will recognize the i7 and i5 indices as GTAACATGCG
and AGTGTTACCT
, respectively. Similarly for Single Index Kit T Set A, well A1 can be specified in the samplesheet as SI-GA-A1, and cellranger mkfastq will recognize the four i7 indexes GGTTTACT, CTAAACGG, TCGGCGTC, and AACCGTAA and merge the resulting FASTQ files.
In this example, we have two 10x libraries (each processed through a separate Chromium chip channel) that are multiplexed on a single flowcell. Note that after running cellranger mkfastq, we run a separate instance of the pipeline on each library:
In this example, we have one 10x library sequenced on two flowcells. Note that after running cellranger mkfastq, we run a single instance of the pipeline on all the FASTQ files generated:
cellranger mkfastq
accepts additional options beyond those shown in the table below because it is a wrapper around bcl2fastq
. Consult the User Guide for Illumina's bcl2fastq
for more information.
Parameter | Function |
---|---|
--run | (Required) The path of Illumina BCL run folder. |
--id | (Optional; defaults to the name of the flowcell referred to by --run ) Name of the folder created by mkfastq. |
--samplesheet | (Optional) Path to an Illumina Experiment Manager-compatible sample sheet which contains 10x sample index names (e.g., SI-GA-A1 or SI-TT-A12) in the sample index column. All other information, such as sample names and lanes, should be in the sample sheet. |
--sample-sheet | (Optional) Equivalent to --samplesheet above. |
--csv | (Optional) Path to a simple CSV with lane, sample, and index columns, which describe the way to demultiplex the flowcell. The index column should contain a 10x sample dual-index name (e.g., SI-TT-A12). This is an alternative to the Illumina IEM sample sheet, and will be ignored if --samplesheet is specified. |
--simple-csv | (Optional) Equivalent to --csv above. |
--filter-dual-index | (Optional) Only demultiplex samples identified by i7/i5 dual-indices (e.g., SI-TT-A6), ignoring single-index samples. Single-index samples will not be demultiplexed. |
--qc | (Optional) Calculate both sequencing and 10x-specific metrics, including per-sample barcode matching rate. Will not be performed unless this flag is specified. Not supported for NovaSeq flow cells. |
--lanes | (bcl2fastq option) Comma-delimited series of lanes to demultiplex (e.g. 1,3). Use this if you have a sample sheet for an entire flowcell but only want to generate a few lanes for further 10x analysis. |
--use-bases-mask | (bcl2fastq option) Same meaning as for bcl2fastq . Use to clip extra bases off a read if you ran extra cycles for QC. |
--delete-undetermined | (bcl2fastq option) Delete the Undetermined FASTQs generated by bcl2fastq . Useful if you are demultiplexing a small number of samples from a large flowcell. |
--output-dir | (bcl2fastq option) Generate FASTQ output in a path of your own choosing, instead of flowcell_id/outs/fastq_path . |
--project | (bcl2fastq option) Custom project name, to override the samplesheet or to use in conjunction with the --csv argument. |
--jobmode | (Martian option) Job manager to use. Valid options: local (default), sge , lsf , or a .template file. |
--localcores | (Martian option) Set max cores the pipeline may request at one time. Only applies when --jobmode=local . |
--localmem | (Martian option) Set max GB the pipeline may request at one time. Only applies when --jobmode=local . |
cellranger mkfastq recognizes two file formats for describing samples: a simple, three-column CSV format, and the Illumina Experiment Manager (IEM) sample sheet format used by bcl2fastq
. There is an example below for running mkfastq with each format.
To follow along, do the following:
tiny-bcl
subdirectory.A simple csv samplesheet is recommended for most sequencing experiments. The simple csv format has only three columns (Lane, Sample, Index), and is thus less prone to formatting errors. You can see an example of this in cellranger-tiny-bcl-simple-1.2.0.csv
:
Lane,Sample,Index 1,test_sample,SI-TT-D9
Here are the options for each column:
Lane | Which lane(s) of the flowcell to process. Can be either a single lane, a range (e.g., 2-4) or '*' for all lanes in the flowcell. |
Sample | The name of the sample. This name is the prefix to all the generated FASTQs, and corresponds to the --sample argument in all downstream 10x pipelines.Sample names must conform to the Illumina bcl2fastq naming requirements. Only letters, numbers, underscores and hyphens area allowed; no other symbols, including dots (".") are allowed. |
Index | The 10x sample index that was used in library construction, e.g., SI-TT-D9 or SI-GA-A1 |
To run mkfastq
with a simple layout CSV, use the --csv
argument.
Here's how to run mkfastq
on the tiny-bcl
sequencing run with the simple layout:
$ cellranger mkfastq --id=tiny-bcl \ --run=/path/to/tiny_bcl \ --csv=cellranger-tiny-bcl-simple-1.2.0.csv cellranger mkfastq Copyright (c) 2019 10x Genomics, Inc. All rights reserved. ------------------------------------------------------------------------------- Martian Runtime - 5.0.1-v4.0.2 Running preflight checks (please wait)... 2019-11-14 16:33:54 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2019-11-14 16:33:57 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2019-11-14 16:33:57 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main 2019-11-14 16:34:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET ...
The cellranger mkfastq pipeline can also be run with a samplesheet in the Illumina Experiment Manager (IEM) format. If you didn't sequence with sample indices, you'll need to use this format. Briefly look at cellranger-tiny-bcl-samplesheet-1.2.0.csv
before running the pipeline. You
will see a number of fields specific to running on Illumina platforms, and then a [Data] section. That section is where you put your sample, lane and index information. Here's an dual-indexing example:
[Data] Lane,Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description 1,s1,test_sample,,,SI-TT-D9,SI-TT-D9,SI-TT-D9,SI-TT-D9,p1,
Here, SI-TT-D9 refers to a 10x dual-index sample index.
In this example, only reads from lane 1 will be used. To demultiplex the given sample index across all lanes, omit the lanes column entirely.
Here's a single-indexing example:
[Data] Lane,Sample_ID,index,Sample_Project 1,Sample1,SI-GA-A3,tiny-bcl
Here, SI-GA-A3 refers to a 10x single-index sample index, a set of four oligo sequences. cellranger mkfastq also supports listing oligo sequences explicitly.
Sample names must conform to the Illumina bcl2fastq
naming requirements. Specifcally only letters, numbers, underscores and hyphens area allowed. No other symbols, including dots (.) are allowed.
Also note that while an authentic IEM sample sheet will contain other sections above the [Data] section, these are optional for demultiplexing. For demultiplexing an existing run with cellranger mkfastq, only the [Data] section is required.
Next, run the cellranger mkfastq pipeline, using the --samplesheet argument:
$ cellranger mkfastq --id=tiny-bcl \ --run=/path/to/tiny_bcl \ --samplesheet=cellranger-tiny-bcl-samplesheet-1.2.0.csv cellranger mkfastq Copyright (c) 2019 10x Genomics, Inc. All rights reserved. ------------------------------------------------------------------------------- Martian Runtime - 5.0.1-v4.0.2 Running preflight checks (please wait)... 2019-11-14 16:35:49 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2019-11-14 16:35:52 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET 2019-11-14 16:35:52 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main 2019-11-14 16:35:58 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET ...
If you encounter any preflight errors, refer to the Troubleshooting page.
Once the cellranger mkfastq pipeline has successfully completed, the output can be found in a new folder named with the value you provided to cellranger mkfastq in the --id
option (if not specified, defaults to the name of the flowcell):
$ ls -l drwxr-xr-x 4 jdoe jdoe 4096 Nov 14 12:05 tiny-bcl
The key output files can be found in outs/fastq_path
, and are organized
in the same manner as a conventional bcl2fastq
run:
$ ls -l tiny-bcl/outs/fastq_path/ drwxr-xr-x 3 jdoe jdoe 3 Nov 14 12:26 Reports drwxr-xr-x 2 jdoe jdoe 8 Nov 14 12:26 Stats drwxr-xr-x 3 jdoe jdoe 3 Nov 14 12:26 tiny-bcl -rw-r--r-- 1 jdoe jdoe 20615106 Nov 14 12:26 Undetermined_S0_L001_I1_001.fastq.gz -rw-r--r-- 1 jdoe jdoe 20615106 Nov 14 12:26 Undetermined_S0_L001_I2_001.fastq.gz -rw-r--r-- 1 jdoe jdoe 51499694 Nov 14 12:26 Undetermined_S0_L001_R1_001.fastq.gz -rw-r--r-- 1 jdoe jdoe 152692701 Nov 14 12:26 Undetermined_S0_L001_R2_001.fastq.gz $ tree tiny-bcl/outs/fastq_path/tiny_bcl/ tiny-bcl/outs/fastq_path/tiny_bcl/ Sample1 Sample1_S1_L001_I1_001.fastq.gz Sample1_S1_L001_I2_001.fastq.gz Sample1_S1_L001_R1_001.fastq.gz Sample1_S1_L001_R2_001.fastq.gz
This example was produced with a sample sheet that included "tiny-bcl" as the Sample_Project, so the directory containing the sample folders is named tiny-bcl. If a Sample_Project wasn't specified, or if a simple layout CSV file was used (which does not have a Sample_Project column), the directory containing the sample folders would be named according to the flow cell ID instead.
If you want to remove the Undetermined
FASTQs from the output to save space, you can run mkfastq
with the --delete-undetermined
flag. To see all cellranger mkfastq options, run cellranger mkfastq --help.
When the --qc
flag is specified, the cellranger mkfastq
pipeline writes both sequencing and
10x-specific quality control metrics into a JSON file. The metrics are in the outs/qc_summary.json
file.
The use of --qc flag is not supported on NovaSeq flow cells.
|
The qc_summary.json file contains a number of useful metrics. The sample_qc key is a good place to start exploring your data.
"sample_qc": { "Sample1": { "5": { "barcode_exact_match_ratio": 0.9336158258904611, "barcode_q30_base_ratio": 0.9611993091728814, "bc_on_whitelist": 0.9447542078230667, "mean_barcode_qscore": 37.770630795934, "number_reads": 2748155, "read1_q30_base_ratio": 0.8947676653366835, "read2_q30_base_ratio": 0.7771883245304577 }, "all": { "barcode_exact_match_ratio": 0.9336158258904611, "barcode_q30_base_ratio": 0.9611993091728814, "bc_on_whitelist": 0.9447542078230667, "mean_barcode_qscore": 37.770630795934, "number_reads": 2748155, "read1_q30_base_ratio": 0.8947676653366835, "read2_q30_base_ratio": 0.7771883245304577 } } }
The sample_qc metric is a series of key value pairs for each sample in the sample sheet, and one metrics structure per lane per sample, plus an 'all' structure in case a sample spans multiple lanes.
The metrics are as follows:
Key | Meaning |
barcode_exact_match_ratio | The percentage of barcode sequences that exactly match a whitelisted 10x barcode. |
barcode_q30_base_ratio | The percentage of barcode bases at or above Q30. |
bc_on_whitelist | The percentage of barcode sequences that match a 10x barcode on the whitelist, post error-correction. Corresponds to the "Valid Barcodes" value in cellranger output metrics. |
mean_barcode_qscore | Mean quality score of barcode bases. |
number_reads | Reads per lane matching the sample's sample index (or overall in 'all'). |
read1_q30_base_ratio | The percentage of R1 bases at or above Q30. |
read2_q30_base_ratio | The percentage of R2 bases at or above Q30. |
By looking at this output, you can diagnose low barcode mapping rates and read quality before running a cellranger
pipeline.
Additional metrics in outs/qc_summary.json
include per-cycle quality metrics, yield, cluster density and percent passing filter, and both
cellranger
and bcl2fastq
version information.
If you encounter a crash while running cellranger mkfastq, upload the tarball (with the extension .mri.tgz) in your output directory:
cellranger upload [email protected] jobid.mri.tgz
where jobid is what you input into the --id option of mkfastq (if not specified, defaults to the ID of the flowcell). This tarball contains numerous diagnostic logs that we can use for debugging.
You will receive an automated email from 10x Genomics. If not, email [email protected]. For the fastest service, respond with the following: