Cell Ranger ATAC2.0, printed on 12/26/2024
The cellranger-atac pipeline performs cell calling where it
determines whether each barcode is a cell of any species included in the
reference. Based on mapping information, the pipelines also provides QC
information associated with the fragments per barcode. Additionally, the
pipeline computes the ATAC signal per barcode, captured by various targeting
metrics such as number of fragments overlapping transcription start sites (TSS)
annotated in the reference package. All of this per barcode information is
collated and produced in a single output table: singlecell.csv
.
The structure and contents of singlecell.csv from a single species analysis are shown below:
$ cd /home/jdoe/runs/sample345/outs $ head -5 singlecell.csv barcode,total,duplicate,chimeric,unmapped,lowmapq,mitochondrial,nonprimary,passed_filters,is_cell_barcode,excluded_reason,TSS_fragments,DNase_sensitive_region_fragments,enhancer_region_fragments,promoter_region_fragments,on_target_fragments,blacklist_region_fragments,peak_region_fragments,peak_region_cutsites NO_BARCODE,986507,102223,401,118334,63547,1882,324,699796,0,0,0,0,0,0,0,0,0,0 AAACGAAAGAAAGGGT-1,8,0,0,5,0,0,0,3,0,0,1,0,0,0,1,0,1,2 AAACGAAAGAAATACC-1,7,2,0,3,0,0,0,2,0,2,0,0,0,0,0,0,0,0 AAACGAAAGAAATGGG-1,10,4,0,1,1,0,0,4,0,0,0,0,0,0,0,0,1,2
The table contains many columns, including the primary barcode
column. All the
barcodes in the dataset are listed in this column. The NO_BARCODE row contains a
summary of fragments that are not associated with any whitelisted barcodes. It
usually forms a small fraction of all reads.
Column | Type | Description | Pipeline specific changes | Reference specific changes |
---|---|---|---|---|
barcode | key | barcodes present in input data | ||
total | sequencing | total read-pairs | absent in aggr, reanalyze | |
duplicate | mapping | number of duplicate read-pairs | ||
chimeric | mapping | number of chimerically mapped read-pairs | absent in aggr, reanalyze | |
unmapped | mapping | number of read-pairs with at least one end not mapped | absent in aggr, reanalyze | |
lowmapq | mapping | number of read-pairs with <30 mapq on at least one end | absent in aggr, reanalyze | |
mitochondrial | mapping | number of read-pairs mapping to mitochondria and non-nuclear contigs | absent in aggr, reanalyze | |
nonprimary | mapping | the number of reads that map to non-primary contigs | ||
passed_filters | mapping | number of non-duplicate, usable read-pairs i.e. "fragments" | absent in aggr, reanalyze | for multi species, for example hg19 and mm10, expect additional columns: passed_filters_hg19 and passed_filtered_mm10 |
is_cell_barcode | cell calling | binary indicator of whether barcode is associated with a cell | for multi species, for example hg19 and mm10, expect columns is_hg19_cell_barcode and is_mm10_cell_barcode instead. | |
excluded_reason | cell calling | 0: barcode was not excluded; 1: barcode was excluded because it is a gel bead doublet; 2: barcode was excluded because it is low-targeting; 3: barcode was excluded because it is a barcode multiplet | ||
TSS_fragments | targeting | number of fragments overlapping with TSS regions | ||
DNase_sensitive_region_fragments | targeting | number of fragments overlapping with DNase sensitive regions | For custom references or references missing the dnase.bed file, this count is 0 | |
enhancer_region_fragments | targeting | number of fragments overlapping enhancer regions | For custom references or references missing the enhancer.bed file, this count is 0 | |
promoter_region_fragments | targeting | number of fragments overlapping promoter regions | For custom references or references missing the promoter.bed file, this count is 0 | |
on_target_fragments | targeting | number of fragments overlapping any of TSS, enhancer, promoter and DNase hypersensitivity sites (counted with multiplicity) | For custom references or references having only the tss.bed file, this count is simply equal to the TSS_fragments | |
blacklist_region_fragments | targeting | number of fragments overlapping blacklisted regions | ||
peak_region_fragments | denovo targeting | number of fragments overlapping peaks | for multi species, for example hg19 and mm10, expect additional columns: peak_region_fragments_hg19 and peak_region_fragments_mm10 | |
peak_region_cutsites | denovo targeting | number of ends of fragments in peak regions |
Note that the number of columns and the column names themselves change and depend on what pipeline and what reference was used to generate the output file. Briefly, as described in the last two columns in the table,
mapping
type columns
(whatever subset is present) will be equal to the total
.singlecell.csv can be loaded easily in Python as a pandas dataframe:
import pandas as pd singlecell_file = "/home/jdoe/runs/sample345/outs/singlecell.csv" # load without index scdf = pd.read_csv(singlecell_file, sep=",") # load with barcode as index scdf2 = pd.read_csv(singlecell_file, sep=",", index_col="barcode" )
You can use this file in many ways. Below are some examples:
Assume you are analyzing data from a single species library, such as hg19. To reproduce the targeting plot on the right side in Targeting section of the websummary, you can do the following:
import matplotlib as plt cell_mask = (scdf['is__cell_barcode'] == 1) noncell_mask = (scdf['is__cell_barcode'] != 1 && scdf['barcode'] != 'NO_BARCODE') plt.plot(scdf[cell_mask]['passed_filters'], scdf[cell_mask]['peak_region_fragments'] / scdf[cell_mask]['passed_filters'], c='b') plt.plot(scdf[noncell_mask]['passed_filters'], scdf[noncell_mask]['peak_region_fragments'] / scdf[noncell_mask]['passed_filters'], c='r')
The singlecell.csv
file captures the cell calling information in the
is_{species}_cell_barcode
field. The Cell Ranger ATAC aggr pipeline
requires you to specify the singlecell.csv as part of the aggr_csv
argument.