

The pairing of heavy and light chains that occurs in polyclonally activated B cells chains is another mechanism that increases Ig diversity.Ī Schematic representation of human Ig receptor repertoire. Here, antigen specificity remains unchanged, while the heavy chain VDJ regions join with different constant (C) regions, such as IgG, IgA, or IgE isotypes, and alter the immunological properties of Igs.

Isotype switching is another mechanism that contributes to B-cell functional diversity. These changes are mostly single-base substitutions occurring at extremely high rates-somatic hypermutation can undergo 10 −5 to 10 −3 mutations per base pair per generation 3. In addition, upon activation of a B cell, somatic hypermutation further diversifies Ig in their variable region. Ig repertoire diversity is key for an individual’s immune system to confer protection against a wide variety of potential pathogens 2. This process enables the Ig repertoire to develop astonishing diversity of antigen receptors from any given individual, with >10 13 theoretically possible distinct Ig receptors 1. The resulting DNA sequences are then translated into antigen receptor proteins. Igs are diversified through somatic recombination, a process that randomly combines variable (V), diversity (D), and joining (J) gene segments, and inserts or deletes non-templated bases at the recombination junctions 1 (Fig. A typical Ig repertoire is composed of one immunoglobulin heavy chain (IGH) and two light chains, κ (IGK) and λ (IGL). B cells recognize their specific antigens through immunoglobulins (Ig), surface antigen receptors, which are unique to each cell and its progeny. The result shows that (1) the indexing-enhanced HC (e.g., using the Vantage-Point tree for indexing) preserves the clustering quality very well, while also significantly reducing the time complexity of the original HC (2) SCT with HC is the fastest approximate HC method with slightly sacrificed quality and (3) SparkMST scales out satisfactorily and gives significant performance gain with a large Spark cluster.A key function of the adaptive immune system is to mount protective memory responses to a given antigen. We have implemented all these algorithms and experimented with real sequence datasets for B-cell clones analysis. And we also experimented with the Spark based minimum-spanning-tree algorithm (SparkMST) that generates the equivalent result of single linkage hierarchical clustering (SLINK) for comparative analysis. The two strategies include (1) non-Euclidean indexing methods for speeding up the classical hierarchical clustering(HC), (2) a new tree-based sequence summarization approach - SCT that scans the large sequence dataset once and generates summaries for hierarchical clusters(HC).

In this thesis, we study two different strategies, aiming at finding the best scalable methods that can preserve the quality of hierarchical clustering structure. Surprisingly, no algorithms have been developed to address this scalability issue for immunology research. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. Large sequence datasets (e.g., millions of sequences) are being collected to comprehensively understand how the immune system of a patient evolves over different stages of disease development.

The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level.
