2024 DoSS Summer Research Projects: Applications now open

March 5, 2024 by Kal Romain

We’re excited to announce the opening of applications for the 2024 Summer Undergraduate Research Program. This year, we're offering a wide range of projects, from robust regression analysis to ground-breaking work in ecological statistics, population models, neurological imaging, financial algorithms, large-scale data training, cosmological studies, and innovative aerosol research. These projects reflect our commitment to addressing diverse, real-world problems through statistical methodologies. Students who pursue research experience will immerse themselves in a 16-week journey from May to August, guided by our faculty's expertise and driven by curiosity and ambition.


Applications are open to DoSS specialists or majors who have successfully conquered STA302H1 with at least a B+ grade. Your academic achievements, experience, and passion for research will be the keystones of your application.


Ready to make this summer transformative? Apply now for a venture into the depths of data and discovery.


Check out the list of research projects available:


Supervised by: Nancy Reid


This project will consider comparison of Bayesian and frequentist methods on robust regression.

It will involve a mix of theoretical analysis and simulations, initially in fixed-dimension

regression models, and then in high-dimensional regression with regularization.

Comparing Bayesian and frequentist approaches is of interest both for the foundations of

inference and for the practical assessment of the reliability of Bayesian approaches; the latter is

closely related to the asymptotic theory of likelihood-based inference. Robust regression

methods are an important technique for ensuring that statistical conclusions remain valid even

when the model used for inference differs from the model generating the data.

Supervised by: Vianey Leos Barajas


Ecological statistics is a rapidly growing interdisciplinary field. Statisticians work in

interdisciplinary settings with ecologists to develop novel statistical models for common

ecological data structures. Within this field lies the study of animal movement. Typical types of

data collected for the study of animal movement are GPS, i.e. positional data, and accelerometer

data. However, as technology evolves so does the type of data that can be collected. For the

study of shark aggregations, researchers are now collecting data using drones to capture aerial

footage. From this we can extract locations of sharks and also collect environmental variables

that may drive their aggregation patterns. This type of spatial data structure is relatively new and

spatial models have not yet been developed to model repeated aggregation observations.

Supervised by: Monica Alexander & Radu Craiu


Probabilistic projections for human populations are commonly obtained through the use of

cohort component models, where the components of population change (fertility, mortality, and

migration) are themselves estimated using time series models. Existing approaches assume the

components of population change are independent, however, this is not the case in most


This project will investigate the degree and nature of dependency in components of population

change, and sensitivities to population projections to different assumptions about dependence.

The student will use copulas to model dependency in UN population data across a wide range of

countries and carry out a simulation study to assess the impact of different assumptions and

models on resulting projections.

Supervised by: Jun Young Park


Technological advances in brain magnetic resonance imaging (MRI) allowed researchers to use

non-invasive methods to understand brain structure and functions and develop novel research

questions. Among them, individual differences in “coupling” across measures of brain structure

and function may underlie differential risk for neuropsychiatric disorders, and research in this

area has gained significant attention in neuroscience. While several approaches have emerged for

quantifying intermodal coupling at the individual level and testing its existence at the group

level, it has yet to be determined whether these intermodal coupling are regulated by genetic

factors (i.e., “heritable”). Understanding the genetic underpinnings of coupling, if they exist,

would provide invaluable biomarkers for brain-phenotype associations.

Several methodological issues must be considered to evaluate its possibility carefully, which is

the goal of the summer research. These include (i) high-dimensionality of brain MRI data, (ii)

low signal-to-noise ratio, and (iii) a relatively small number of samples, all of which would lead

to an underpowered study. Therefore, we will study how multivariate (spatial-extent) modelling

and inference would help overcome the limitations. During the summer project, students will

gain experience in (i) exploratory data analysis of real brain imaging data, (ii) methods

development, and (iii) software implementation. Depending on the research progress, the

resulting research outputs would be submitted to a peer-reviewed journal for publication

(although it is typically expected to take over three months).

It is expected that students meet 1-2 times each week with me to discuss progress and challenges.

Those are welcome to contact me (junjy.park@utoronto.ca) if they want to discuss more details

of the summer research project.

Supervised by: Xiaofei Shi


This project aims to develop novel deep reinforcement learning algorithms and compare with

existing ones for portfolio optimization problem and risk management. In particular, with risk

preferences such as expected shortfall and value-at-risk, closed-form solution are very limited

and efficient numerical algorithms are in need.

How to minimize risks and maximize profits in a financial market are essential tasks for financial

institutions such as investment banks, hedge funds, insurance and reinsurance companies. These

problems are usually formulated mathematically as portfolio optimization and/or risk

management problems. Since the financial market is usually a complicated system and the

optimization problems are intrinsically high-dimensional, we cannot expect simple closed-form

solution. Deep reinforcement learning algorithms, due to their capabilities to overcome curse-ofdimensionalities, may offer a class of numerical solutions to these portfolio optimization and risk

management problems.

The project will involve a mix of theory and development of numerical algorithms, to fully

utilize the power of deep reinforcement learning as a efficient numerical tool.

Supervised by: Christopher Maddison


Pretraining on very large-scale data is one of the key factors in the state-of-the-art (SOTA)

performance of large language models (LLMs) on many natural language processing tasks.

However, when it comes to their performance on biochemical tasks, these general purpose

models still lag far behind specialized predictors.

In this project, our aim is to develop a large-scale dataset of biochemical data, together with

science texts. There are a number of design decisions that need to be made, including the

sourcing of the data, the impact of tokenization, how to manage links, and data ordering. We will

evaluate the quality of our data, as well as the impact of these design consideration, by

evaluating the performance of models pre-trained on our data, compared to SOTA LLMs.

Supervised by: Joshua Speagle & Tanveer Karim


Modern astronomical surveys are collecting data on hundreds of millions of galaxies to measure

the properties of the Universe at the largest (i.e. cosmological) scales. One of the main goals of

these efforts is to finally uncover the true nature of the many mysterious components that make

up our Universe, including Dark Energy, Dark Matter, and neutrinos (the smallest particles found

in nature). To constrain various physics models, astronomers need to simultaneously infer the

properties of multiple parameters of interest as well as a large number of nuisance parameters.

This is often done under a Bayesian framework and relies on several (strong) assumptions to

make claims of discovery or to test which model of cosmology best explains the data. Although

many of these assumptions seem well-motivated, the extent to which these assumptions can be

safely trusted is unclear.

In this project, the student will explore two interconnected research areas. The first involves

developing methods to better understand how observed discrepancies between a few parameters

of interest from different datasets generalize to high-dimensional spaces. The second involves

exploring the robustness of various model comparison strategies when the desired parameters are

close to the edge of the parameter space; this latter problem is highly relevant to the problem of

estimating the sum of neutrino masses.

Supervised by: Joshua Speagle & Michael Walmsley


Euclid is a $500M USD space telescope that has just (Feb 14!) started operations and aims to

capture the first images of hundreds of millions of galaxies at a time when the Universe was only

a few billion years old. UofT researchers are providing deep learning models to measure the

appearance of these galaxies (e.g. counting spiral arms) based on images of galaxies from other

telescopes labelled by 100k+ volunteers. Given the massive increase in data volume in Euclid,

future volunteers will only be able to provide high-quality characterizations for a tiny fraction of

these galaxies. Which galaxies should these be?

This project will explore various active learning strategies to identify which will work best for

Euclid and under what conditions (supercomputer access and state-of-the-art models will be

provided). It will also explore the potential consequences of these strategies on expected model

performance, uncertainty quantification, and robustness to domain shifts and rare events. These

efforts may also involve collecting labels on new Euclid galaxies -- galaxies which the student

would likely be the first person to see.

Supervised by: Meredith Franklin


The spatial distributions of brain metastasis are hypothesized to vary according to primary cancer

subtype, but an understanding of these patterns remains poorly understood despite having major

implications for treatment. Through this project we hope to elucidate the topographic patterns of brain

metastases for 5 different primary cancers (melanoma, lung, breast, renal, and colorectal), which may be

indicative of the abilities of various cancers to adapt to regional neural microenvironments, facilitate

colonization, and establish metastasis. Our findings could be used as a predictive diagnostic tool and for

therapeutic treatments to disrupt growth of brain metastasis on the basis of anatomical region.

To test our hypothesis that brain metastases have different spatial patterns depending on the primary

cancer type, we will leverage 3D coordinates of brain metastases derived from stereostatic radiosurgery

procedures in over 2100 patients. With these data we will explore two types of spatial models: one where

the X, Y, Z spatial coordinates of the metastases are compared between the 5 different primary cancer

types, and another where we compare the spatial coordinates of the metastases from each cancer type

separately to spatially random processes on a sphere. Both approaches will use flexible generalized

additive models. However, in the latter approach, methods will be developed to generate random spatial

Poisson point processes in three dimensions.

Supervised by: Meredith Franklin


Exposure to particulate matter (PM) air pollution has been associated with a myriad of adverse

health outcomes, yet the relative toxicity of PM mixtures with different sizes, shapes, and

chemical compositions is poorly understood. This research will help future satellite missions to

be better equipped to understand aerosol particle type and its role on human health.

Using hourly data collected over the past 2 years by multiple co-located instruments at several

locations in California and New York, we will explore how to predict PM properties

differentiated by size and chemical composition from aerosol optical depth properties (as

measured through remote sensing). Given the high dimensionality of the measured aerosol

parameters, we will leverage machine learning techniques such as XGBoost with SHAP to

understand what variables are important in predicting PM. Furthermore, we will incorporate

temporal information to explicitly model autocorrelations in the time series data.

This work is in collaboration with the NASA and the Jet Propulsion Laboratory.


Please fill out and submit the following application form: Application for DoSS Summer Undergraduate Research Awards 2024. If you have any questions regarding these awards, please contact ug.statistics@utstat.utoronto.ca Completed applications are due by 11:59 PM EST Thursday, March 14, 2024

 Completed applications are due by 11:59PM EST Thursday, March 14, 2024