Undergraduate research in statistics offers students a unique opportunity to delve deeply into advanced topics beyond standard coursework. By engaging in STA496/STA497/STA498/STA499: Readings in Statistics, high-achieving students can explore complex areas under the guidance of a faculty member. This hands-on experience not only enhances understanding and appreciation of statistical science but also provides a solid foundation for those considering graduate studies. Participation in these courses requires departmental approval and can significantly bolster a student's academic profile.
Fall 2025

By: Aditya Khan
Supervisor: Meredith Franklin
Spatial autoregressive error (SEM) models are widely used to characterise spatial dependence in areal data, where closed-form measures such as Moran’s I and the approximate profile likelihood estimator (APLE) are routinely used for diagnostics and exploratory spatial data analysis. Existing APLE-type estimators are derived from the full maximum likelihood profile for the spatial dependence parameter in SEM models, and therefore depend on the joint specification of all parameters, including regression coefficients, even though spatial autocorrelation is typically summarised from regression residuals after removing fixed effects. Motivated by the restricted maximum likelihood (REML) framework, we propose RAPLE, a REML-type one-step restricted profile likelihood estimator of the spatial dependence parameter in the Gaussian SEM. We study its properties theoretically and provide simulations and an applied case study to show its efficacy against existing alternatives. Attention is paid to the role of covariates and the choice of spatial weights.
By: Afra Azad and Madelyne Zhang
Supervisor: Meredith Franklin
Bangladesh faces frequent extreme weather events such as monsoons, floods, and heatwaves, along with high levels of air pollution, yet infrastructure to measure air quality is extremely limited. This project applies a machine learning ensemble model to predict PM2.5 and a climate model to predict heat events at residential locations of over 900 pregnant participants in the Bangladesh Cookstove Pregnancy Cohort Study. We examine associations between prenatal and postnatal exposure to these environmental factors and infant acute lower respiratory infections (ALRI), which can recur over time—necessitating specialized recurrent event modeling.
By: Haocheng Ding
Supervisor: Jesse Gronsbell
This project examines fairness in machine learning models for time-to-event predictions. Using the framework proposed by Rahman and Purushotham (2022), we implement two fair survival models—FIDP and FIPNAM—and evaluate them alongside Cox regression using both real clinical data (FLChain) and simulated datasets. Performance is assessed using metrics such as C-index, AUC, and Brier score, while fairness is measured across individual, group, and censoring-based criteria.
By: Weixuan Jiang
Supervisor: Fanny Chevalier
This project presents a systematic literature review on psychological factors relevant to financial advising, with a focus on robo-advisors. It synthesizes research in two streams: (1) how robo-advisors affect client psychology, including trust and perceived social presence; and (2) communication, empathy, and stress in relationships between human advisors and financially vulnerable clients.
By: Yinuo Yang
Supervisor: Jesse Gronsbell
This project evaluates prediction-powered inference methods in settings where only a subset of data is labeled and machine learning predictions fill in unobserved outcomes. Through 12 simulation scenarios varying prediction quality, outcome-generating mechanisms, and missingness structures, we compare how well different methods recover unbiased estimates and calibrated uncertainty.
By: Chuxuan Ai
Supervisor: Joshua Speagle
Repelling-Attracting Hamiltonian Monte Carlo (RAHMC) is a variant of HMC designed to improve sampling efficiency in complex posterior landscapes. By alternating repelling and attracting phases within conformal Hamiltonian dynamics, RAHMC encourages movement across low-density regions while preserving the target distribution, enabling improved global mixing and exploration.
By: Shuxin Tan
Supervisor: Jesse Gronsbell
This project compares modern prediction-powered inference methods across a range of simulated scenarios. By varying missing-data mechanisms, outcome models, and prediction error structures, the study assesses how different approaches perform under realistic semi-supervised learning conditions.
By: Shijun Yu
Supervisors: Patrick Brown and James Stafford
The BOSS algorithm provides approximate Bayesian inference for models where evaluating the likelihood across values of a low-dimensional parameter is computationally expensive. By placing a Gaussian process surrogate on the log posterior and using Bayesian optimization (GP-UCB) to select informative evaluation points, BOSS efficiently identifies high posterior mass regions and constructs normalized posterior approximations with far fewer model evaluations than brute-force grid searches.
By: Leo Liu
Supervisor: Meredith Franklin
This project develops a machine learning ensemble to estimate ground-level particulate matter and dust pollution over the Middle East using satellite observations of aerosol optical depth (MODIS), airport visibility measurements, and climate model meteorology. Boosted models, neural networks, and decision trees are combined to generate daily 1 km spatial grids of predicted air pollution.
By: Rebecca Li
Supervisors: Shion Guha and Rohan Alexander
This project evaluates seven post-hoc algorithmic fairness interventions applied to datasets from a public college in Ontario. Performance is assessed using fairness metrics, accuracy trade-offs, and subgroup-specific effects.
By: Katrina Sha
Supervisor: Zhou Zhou
This project reviews Ding & Zhou (2025), who propose a sieve-based estimator for nonlinear nonstationary time series and develop uniform inference results for time-varying regression structures. The presentation discusses theoretical foundations and empirical performance across simulation scenarios.
By: John Zhang
Supervisor: Rohan Alexander
This project develops an Elo-based rating model to predict match outcomes (win–draw–loss) in Europe’s Big Five football leagues. Team ratings update after each match based on home advantage, goal difference, and other factors, and are converted into probabilistic forecasts. Predictive accuracy is compared with baselines such as league position and always-pick-the-home-team strategies.
By: Jared Xu
Supervisor: Christopher Blier-Wong
This project implements and validates a copula-based collective risk model for insurance portfolios. Aggregate loss is modeled as the sum of a random number of claims (frequency) and random claim sizes (severity). Using geometric frequency and Pareto severity distributions, and dependence modeled via the FGM copula, the project develops a simulation algorithm and a constrained maximum likelihood estimation procedure. A Monte Carlo study confirms the estimator’s ability to recover true parameters.
Winter 2024
By Xiao Wu
Supervisors: Clement Ma, and Nathan Taback
Precision medicine aims to treat patients using targeted therapies based on the patient’s genetic or molecular profile. Biomarker-based trial designs can test the efficacy of an intervention within a subgroup of participants who are positive for a specific biomarker. These designs have been applied successfully to increase the efficiency of trials testing targeted cancer therapies. However, their application to interventions for mental health disorders has not been explored. This reading course will review existing biomarker-based trial designs published in the literature. Analytical power calculations will be used to evaluate and compare different biomarker-based trial designs in the context of mental health trials.
By Hanlong Chen
Supervisor: Murari Singh
An inverse Gaussian distribution is suited for modelling the positive observations. Projects with regularly conducted experiments lead to collection of results forming a basis for obtaining an a-priori distribution(s) that can be used in the analysis of a future experiment. This project will conduct a review work on standard Bayesian approach and distributions and will identify the steps of data analysis from experiments designed in a CRD with a view to applying on real data.
By Yihan Wang
Supervisor: Joshua Speagle
The DESI Milky Way Survey will observe approximately seven million stars over the next five years, enhancing our understanding of Galactic structure and stellar evolution. Efficient extraction of stellar parameters from DESI spectra is crucial. However, the original DESI Pipeline labels exhibit significant discrepancies compared to those high-resolution data from the SDSS APOGEE survey. This study used normalizing flows to predict metallicity and stellar parameters from DESI spectra.
Our SBI-driven approach substantially improved performance relative to APOGEE results, retained proper marginal uncertainty coverage, and successfully recovered bimodal patterns among stellar parameters. Future work will focus on improving bias and uncertainty quantification in the tails of the parameter distributions.
By Qianyu Fan
Supervisor: Joshua Speagle
Many real-world datasets occupy low-dimensional manifolds in high-dimensional spaces (i.e., many observables are correlated in complex, non-linear ways). This project will implement various dimensionality reduction algorithms on an astronomical dataset to uncover hidden patterns and structures, including PCA, tSNE, UMAP, AE, and VAE, as well as explore approaches to compare their performances.
By Zichun Xu
Supervisor: Meredith Franklin
With our changing climate, both the occurrence and magnitude of wildfires have been increasing, and an objective assessment of these trends including their impact on air quality is needed. This analysis-driven project will involve processing two decades of satellite observations of thermal detections and smoke plumes in order to examine spatial and temporal trends in wildfire occurrence and intensity in western North America. Further, these satellite data will be linked to air quality monitoring networks to better understand the scope of the impact of wildfires on air quality. We focus on characterizing the chemical signatures of wildfire pollution through non-negative matrix factorization, a dimension reduction technique that enables the interpretation of air pollution sources based on factor loadings.
By Aidan Mann Rong Li, Liyan Wang, and Tianye Dou
Supervisor: Jeffrey Rosenthal
Through a combination of literature review, mathematical analysis, and computer simulations, the students will investigate Markov chain Monte Carlo (MCMC) computer algorithms such as the Metropolis Algorithm, with a focus on what factors make them run more efficiently and can lead to improvements.
By Naihe Xiao
Supervisor: Jessica Gronsbell
Examine whether the machine learning models are fair across different attributes including gender, race, and age.
By Yi Dai
Supervisor: Meredith Franklin
The student will conduct a literature review of the state-of-the-art data and methods used to estimate ground-level particulate matter air quality from satellite observations. Then, data from the Sentinel-5 will be acquired, processed and linked to measured data that we have collected as part of a larger project being conducted in Bangladesh. Spatiotemporal statistical and machine learning approaches will be developed on the linked dataset to produce air quality predictions over the study region.