Statistical methods for adaptive immune receptor repertoire analysis and comparison
B and T cell receptors, also known as adaptive immune receptors, perform key roles in adaptive immunity.
These proteins identify and deal with foreign invaders like viruses or bacteria, allowing for robust and long-lasting immunological protection.
The DNA sequences coding for these receptors arise by a complex recombination process followed by a series of productivity-based filters, as well as affinity maturation for B cells, giving considerable diversity to the circulating pool of these sequences.
Thus, proper analysis of adaptive immune receptor repertoires (AIRR) as well as the immune context surrounding them presents a formidable but necessary challenge to computational biologists.
I will present work on several topics in AIRR sequence analysis with an emphasis on statistical methods for repertoire comparison.
BCR sequences diversify through mutations introduced by purpose-built cellular machinery.
A recent paper has concluded that templated mutagenesis, a hypothesized process in which mutations in the BCR locus are introduced by copying short segments from other BCR genes, is a major contributor to BCR diversification in mice and humans.
If true, this would overturn decades of research and methodology involving B cell diversification.
In joint work with Julia Fukuyama, I re-evaluate this hypothesis by directing the author's method at potential template donor genes not present in B cell genomes to obtain estimates of the methods's false positive rates.
We find FPRs that are similar to or even higher than the original inferences, resulting in little to no evidence that templated mutagenesis occurs at a substantial rate.
As AIRR datasets are typically large and complex, it is non-trivial to characterize and compare them in precise yet interpretable ways.
I introduce a comprehensive summary statistic framework that efficiently performs a wide variety of biologically-meaningful repertoire summaries and comparisons, and demonstrate how it can be used to perform general-purpose model validation.
We find that summaries vary in their ability to differentiate between datasets, although many can distinguish between certain dataset covariates.
Further, we show that recombination-based statistics tend to be more discriminative characterizations of a repertoire than those describing the amino acid composition of the CDR3 region.
The framework also directly provides a convenient multidimensional scaling setup for visualizing dissimilarities between repertoires.
Current methods of TCR repertoire comparison often incur a high loss of distributional information by considering overly simplistic sequence- or repertoire-level characteristics.
Optimal transport methods can be used to compare distributions given some distance or metric between values in the sample space, with appealing theoretical and computational properties.
For my final project, I aim to apply the Sinkhorn distance, a fast, contemporary optimal transport method, equipped with a recently-created distance on the space of TCRs, to compare TCR repertoires.
I hope to establish a randomization test for significance estimates of ``interesting'' regions of a repertoire, to be validated using a proxy null distribution based on biological replicates.