Research

My research program involves two synergistic components: a methodological component focused on developing novel statistical methods for statistical genomics and an applied component focused on using these methods in clinical and biological studies. To develop my research program, I take a multidisciplinary approach that integrates methods drawn from statistics, machine learning, bioinformatics, and computational biology.

Developing machine learning tools for single-cell RNA sequencing analysis

Developing machine learning tools for single-cell RNA sequencing analysis

The advent and rapid development of single-cell technologies have made it possible to study cellular heterogeneity at an unprecedented resolution and scale. Cellular heterogeneity underlies phenotypic differences among individuals, and studying cellular heterogeneity is an important step toward our understanding of the disease molecular mechanism. Single-cell technologies offer opportunities to characterize cellular heterogeneity from different angles, but how to link cellular heterogeneity with disease phenotypes requires careful computational analysis. This project proposes to develop computational strategies for the analysis of single-cell data for various purpose, including batch effect removal, missing gene imputation, cell type identification and cell differentiation trajectory. We also describe how single-cell data can be integrated with bulk tissue data and data generated from genome-wide association studies.

Statistical and Computational Methods
Deciphering tissue microenvironment from spatial transcriptomics

Deciphering tissue microenvironment from spatial transcriptomics

Recent technology advances in spatial transcriptomics (ST) have enabled gene expression profiling while preserving spatial location information in tissues. ST has been applied to study diverse tissues, and these applications have transformed our views of transcriptome complexity. A popular ST technology is based on spatial barcoding followed by next-generation sequencing in which transcriptome-wide gene expression is measured in spatially barcoded spots. Data from ST technologies are typically complemented by high-resolution hematoxylin and eosin (H&E) stained histology images of the same tissue section, which are invaluable for examining cellular morphology and how it changes over embryonic development or disease progression. This project proposes to develop computational tools to combine gene expression and histological features within spatial context to inform the origin, developmental trajectory, and progression of complex diseases.

Statistical and Computational Methods
Developing label-free analytic tools for digital pathology

Developing label-free analytic tools for digital pathology

The adoption of digital pathology has enabled the curation of large repositories of gigapixel whole-slide images (WSIs). WSIs are invaluable for examining cellular morphology and how it changes over embryonic development or disease progression. Many existing methods employ well-trained deep neural networks to extract image features from histology images, and then use these image features for downstream analysis. A drawback of using deep neural networks, e.g., ResNet and Vision Transformer (ViT), is that these models require a large number of well-annotated images from pathologists for model training, which limits their usefulness. This project aims to develop label-free machine learning method for medical imaging data analysis. The developed tools be easily applied to studies where training samples are not available and bypass cumbersome labeling steps.

Statistical and Computational Methods
Uncover the mechanisms of Alzheimer’s disease from multi-omics atlas

Uncover the mechanisms of Alzheimer’s disease from multi-omics atlas

Alzheimer’s disease (AD), the leading cause of dementia in the elderly, is a progressive and fatal neurodegenerative disease that affects 40-50 million people worldwide1. Pathologically, AD is characterized by intracellular hyperphosphorylated tau aggregates and extracellular-amyloid plaques, which coincide with the activation of innate immunity, gliosis induced by activated microglia and reactive astrocytes, white matter degeneration, dysfunctions in oligodendrocytes, and neuronal loss2-5. Genome-wide association studies (GWAS) have identified >30 AD genetic risk loci, many of which are related to innate immunity and microglial function, including APOE and TREM2 variants, which are associated with high genetic risks for sporadic AD6-10. Numerous studies have shown that AD pathology spreads from regions like the medial temporal lobe to the cortex11,12. However, the molecular mechanisms underlying the cell- and region-specific distribution of AD pathology during AD progression are still not fully elucidated. The transcriptome of the AD brain can pinpoint key differences in disease that may be crucial for elucidating the pathogenesis of AD and for developing disease-modifying therapeutics for the prevention and treatment of AD. To develop effective cell therapies for AD, it is necessary to know the spatial distribution of different immune and glial cells in AD brains and how they interact with neuronal cells during AD. Such precise knowledge is required for precision medicine and therapeutic development of small molecules and their delivery to a specific tissue domain. This project addresses key computational challenges in the analysis of spatial transcriptomics, single-cell/single-nucleus RNA-seq data, and single-cell ATAC-seq data generated from Alzheimer’s disease (AD) studies.

Systematic Reviews
Inferring cell-to-cell communications in tumor ecosystems from multi- omics data

Inferring cell-to-cell communications in tumor ecosystems from multi- omics data

Cell-to-cell communication reveals a dynamic cellular ecosystem that develops, evolves, and responds to environmental factors. The role of cell-to-cell communication has been extensively investigated, particularly in cancer. Breakthroughs arising from discoveries in cell-to-cell communication have led to important clinical applications in cancer therapy. Most existing ligand-receptor interaction studies only utilize single-modal omics data, which only offers a limited view of the cellular communications. The combination of multi-omics data provides information that is more than the sum of its parts and opens new opportunities to comprehensively characterize the cell interactions in tumor ecosystems. This project proposes to address key computational challenges when analyzing multi-omics data in deciphering cell-to-cell communications. By developing and applying innovative statistical and machine learning methods to multi-omics datasets, both newly generated and publicly available, we will discover novel ligand-receptor pairs and provide insights into their mechanisms that empower precision therapeutic targeting of a broad array of complex human diseases.

Human Disease Studies