Cell Ranger2.0, printed on 11/05/2024
Each assembled contig in each cell is aligned against all of the germline segment reference sequences via Smith-Waterman.
First the contig is aligned to all V reference sequences. The best match is found and the matching bases are masked from the contig. Then the same procedure is followed one-by-one for D, J, C, and 5′ UTR reference sequences.
Next, the CDR3 region is searched for in 2 different ways. If the sequence fully spans the L+V region, which contains the start codon, then search for a CDR3 motif (Cys-FGXG/WGXG) in that frame. Otherwise, search for a CDR3 sequence in all frames. A contig is labelled productive if it
It is expected that each cell-barcode typically contains one productive TRA and one productive TRB contig. Extra productive contigs produced by the assembler are less likely to be legitimate. For each chain, extra productive contigs with distinct CDR3s must have at least 2 UMIs to be considered confident. Extra productive contigs with 1 UMI are considered low-confidence.
Additionally, extra productive contigs with the same CDR3 as an existing contig for that chain are considered low-confidence; these are likely induced by assembly artifacts.
Cell-barcodes are grouped together into clonotypes if they share a set of productive CDR3 nucleotide sequences by exact match.
For each clonotype and each CDR3, the contigs in all cells are assembled together to produce a clonotype consensus sequence.
Because this sequence is constructed using multiple cells, its accuracy is expected to be even higher than sequences constructed from a single cell.