Student Publications

We offer our students opportunities to deeply engage in research under the guidance of our passionate and accomplished academic community. Our students continue to impress us with their outstanding quality of work and a high level of research productivity in the most prestigious journals. Meet the leading researchers of tomorrow!

Al-Aradi, Ali

Ali completed his PhD in mathematical finance in October 2020 under the supervision of Professor Sebastian Jaimungal. His main research interests are in stochastic portfolio theory, stochastic optimal control and portfolio selection problems.

By Ali Al-Aradi and Sebastian Jaimungal

Active and Passive Portfolio Management with Latent Factors

Quantitative Finance | 2021 (accepted)

Link to publication

Abstract: We address a portfolio selection problem that combines active (outperformance) and passive (tracking) objectives using techniques from convex analysis. We assume a general semimartingale market model where the assets' growth rate processes are driven by a latent factor. Using techniques from convex analysis we obtain a closed-form solution for the optimal portfolio and provide a theorem establishing its uniqueness. The motivation for incorporating latent factors is to achieve improved growth rate estimation, an otherwise notoriously difficult task. To this end, we focus on a model where growth rates are driven by an unobservable Markov chain. The solution in this case requires a filtering step to obtain posterior probabilities for the state of the Markov chain from asset price information, which are subsequently used to find the optimal allocation. We show the optimal strategy is the posterior average of the optimal strategies the investor would have held in each state assuming the Markov chain remains in that state. Finally, we implement a number of historical backtests to demonstrate the performance of the optimal portfolio.

Layman Summary: If you asked a prospective investor what they hope to achieve by investing, the most common (and obvious) answer would be "to make money!" The goal of making as much money as possible or enough money to pay off a mortgage within a certain timeframe, for example, can be sensible goals. However, they are goals that are absolute in nature - meaning they do not involve any external random factors that determine whether or not a certain outcome is deemed a success. An alternative (and also very common) goal would be to "beat the market." A goal of this nature would be considered a relative goal because, regardless of how well an investment strategy does, the investor would be disappointed if their strategy did not make them as much as a passive investment in the market. In this day and age, the average retail investor is savvy enough to be able to invest their money easily in a broad market index and gain whatever returns the market generates. As such, they would expect a professional money manager to be able to outperform that benchmark. Benchmarks also serve as an expression of the investor's risk tolerance, in that they do not wish to invest in a way that subjects them to a level of volatility that is vastly different than that of the benchmark. In other words, they may be opposed to deviating too much from the market - a notion that is commonly referred to as tracking. In short, the combination of all these facets of the problem illustrates that performance can be tied to an external process (the benchmark) that evolves randomly. The main goal of this paper is to be able to solve a wide range of problems that fall under both of these categories (absolute and relative) and may involve elements of outperformance (active investing) and tracking (passive investing).

Successfully solving this problem ultimately hinges on our ability to estimate the rate at which asset prices grow - which can itself vary through time as assets go through different phases of booms and busts. The approach that we take is to use so-called "latent factors" to model asset prices. Intuitively, this means that we assume that an unobservable variable - a "hidden truth", so to speak - dictates how assets grow through time. Often this variable is interpreted as the state of the economy. Under some states, all assets perform well; under others, all assets perform poorly; and there are certain states still that favour some assets over others. The mathematical complexity here lies in the fact that we must figure out which states are good/bad for which assets and that the state of the economy is not something we can observe; instead, it must be inferred from what we can observe, namely the asset prices themselves. Despite the added complexity, using latent factors provides us with the flexibility to better capture the behavior of asset prices and ultimately leads to better investment results.


Bilodeau, Blair

Blair is a third-year PhD candidate in statistical sciences at the University of Toronto, supervised by Professor Daniel Roy. His research is supported by an NSERC Doctoral Canada Graduate Scholarship and the Vector Institute. His research focuses on combining techniques from statistics and computer science to obtain theoretical performance guarantees for decision making. Read more.

By Blair Bilodeau, Dylan J. Foster, and Daniel M. Roy

In Proceedings of the 37th International Conference on Machine Learning | 2020

Link to publication

Abstract: We consider the classical problem of sequential probability assignment under logarithmic loss while competing against an arbitrary, potentially nonparametric class of experts. We obtain tight bounds on the minimax regret via a new approach that exploits the self-concordance property of the logarithmic loss. We show that for any expert class with (sequential) metric entropy $O(\gamma^{-p})$ at scale $\gamma$, the minimax regret is $O(n^{\frac{p}{p+1}})$, and that this rate cannot be improved without additional assumptions on the expert class under consideration. As an application of our techniques, we resolve the minimax regret for nonparametric Lipschitz classes of experts.

Layman Summary: We study the general problem of probabilistic forecasting: the forecaster is tasked with specifying probabilities over the possible outcomes of an upcoming event (e.g., the probability of rain; or of cardiac failure; or that an adjacent driver will change lanes). Probabilistic forecasts are scored based on how much probability they assign to the actual observed outcome. We suppose that, in addition to covariate information (e.g., current weather; or patient medical history; or the car's sensor readings), the forecaster has access to a class of benchmark forecasters, which we refer to as experts. These experts may exist in the real world in the form of domain-specific opinions, or may simply be tools for the forecaster, such as a collection of statistical forecasting models. Irrespective of their form, the experts' predictions are available to the forecaster to use when producing their own forecast.

The goal of the forecaster is to perform nearly as well as the expert who ends up performing best once all the outcomes have been recorded. The goal of our research is to quantify how well the best forecaster can do at this task. One way to understand the best forecaster’s performance is to assume that the data will follow some statistical relationship, but this exposes the forecaster to potentially poor performance if reality diverges from these assumptions. In contrast, we characterize the limits of forecasting performance without making any assumptions on the data. Specifically, we show that a particular geometric notion of the “complexity” of a class of experts characterizes the performance limits for very general and large classes.

Our work partially resolves a decades-long line of inquiry by showing that a specific measure of complexity can precisely characterize probabilistic forecasting performance for general expert classes simultaneously. Additionally, an important implication of our results is that probabilistic forecasting is fundamentally different from the related problem of regression (i.e., forecasting the average event), since different complexity measures must be used to characterize the limits of performance for these two tasks. Finally, our results are stated abstractly in terms of the experts, and consequently can be broadly applied across domains where forecasting future events must account for multiple sources of uncertainty.


Campbell, Steven

I am a PhD student at the University of Toronto co-supervised by Professors Leonard Wong and Yuchong Zhang. My current research focus is in mathematical finance and some related areas. In particular, I am working on problems in stochastic portfolio theory and mean field games. Read more.

By Steven Campbell and Ting-Kam Leonard Wong

SIAM J. Financial Mathematic | (accepted January 2022, publication forthcoming)

Link to publication

Abstract: In this paper we develop a concrete and fully implementable approach to the optimization of functionally generated portfolios in stochastic portfolio theory. The main idea is to optimize over a family of rank-based portfolios parameterized by an exponentially concave function on the unit interval. This choice can be motivated by the long term stability of the capital distribution observed in large equity markets and allows us to circumvent the curse of dimensionality. The resulting optimization problem, which is convex, allows for various regularizations and constraints to be imposed on the generating function. We prove an existence and uniqueness result for our optimization problem and provide a stability estimate in terms of a Wasserstein metric of the input measure. Then we formulate a discretization which can be implemented numerically using available software packages and analyze its approximation error. Finally, we present empirical examples using CRSP data from the U.S. stock market, including the performance of the portfolios allowing for dividends, defaults, and transaction costs.

Layman Summary: A natural question to study in Mathematical Finance is the optimization of portfolios. There are different flavours of portfolio optimization problems. Some problems look at maximizing returns, others consider optimizing performance relative to some benchmark, and yet others consider a modified formulation that incorporates additional features or constraints. In this paper we consider maximizing the performance of a portfolio relative to the market and we allow for various constraints to be imposed. For instance, a portfolio manager may want to maximize their performance while taking care to limit the aggressiveness of their positions. Our portfolio optimization problem is posed in the setting of Stochastic Portfolio Theory (SPT). This is a mathematical framework that was introduced by Robert Fernholz for analyzing the behaviours of equity markets and portfolio selection. The theory identifies macroscopic properties of large markets that can be exploited by carefully constructed portfolios. To this end, Fernholz constructed a family of portfolios known as functionally generated portfolios. As suggested by their name, these portfolios are determined by a function, and we require these functions to have some specific properties. The existing theory tells us that under some assumptions the portfolios generated by these functions can outperform relative to the market over the long run. The focus of this paper is on deciding how to best choose these functions. If all the functions in this family define a portfolio that should eventually outperform, can we pick the one that performs the best? What about picking the best performer subject to some constraints? Solving this problem is generally difficult without assuming a particular structure for the market or a parametric form for the functions. In this work we pose and solve a non-parametric problem that does not rely on a model of the market and can be implemented numerically. We prove several desirable properties of this problem including the existence and uniqueness of the solution, and a notion of stability of the solution to the behaviour of the market. We then provide an algorithm to solve the problem numerically and demonstrate that the resulting numerical approximation to the problem can be taken arbitrarily close to the true solution. Finally, we provide some real-world examples of our solution by using CRSP data from the US stock market to illustrate the results of the optimization.


Casgrain, Philippe

Philippe Casgrain is currently a Postdoctoral researcher appointed jointly between ETH Zürich and Princeton University. Prior to this position, Philippe was a quantitative researcher in algorithmic execution at Citadel Asset Management in New York City. He completed his PhD in the Department of Statistical Sciences at the University of Toronto in 2018, under the supervision of Sebastian Jaimungal, where his thesis was focused integration of Mean-Field Games and Machine Learning for Algorithmic Trading. His main areas of research include Algorithmic Trading, Stochastic Control, Optimization and their intersection.

By Philippe Casgrain and Sebastian Jaimungal

Mathematical Finance 30.3 | 2020

Link to publication

Abstract: Even when confronted with the same data, agents often disagree on a model of the real-world. Here, we address the question of how interacting heterogenous agents, who disagree on what model the real-world follows, optimize their trading actions. The market has latent factors that drive prices, and agents account for the permanent impact they have on prices. This leads to a large stochastic game, where each agents' performance criteria are computed under a different probability measure. We analyse the mean-field game (MFG) limit of the stochastic game and show that the Nash equilibrium is given by the solution to a non-standard vector-valued forward-backward stochastic differential equation. Under some mild assumptions, we construct the solution in terms of expectations of the filtered states. Furthermore, we prove the MFG strategy forms an ϵ-Nash equilibrium for the finite player game. Lastly, we present a least-squares Monte Carlo based algorithm for computing the equilibria and show through simulations that increasing disagreement may increase price volatility and trading activity.

Layman Summary: This paper considers the problem of optimally trading an asset in a competitive market involving many agents with imperfect information who may have disagreeing views on how the market evolves over time. In order to be robust to the actions of others agents we identify strategies which form a Nash Equilibrium in this setting, where agents attempt maximize their future total book value, all while remaining conscious of price impact, transaction costs and risk exposure. Since the finite-agent version of this problem proves to be intractable, we consider the infinite-agent or 'mean-field' limit of this game. We show that Nash equilibria can be fully characterized in this regime and that these solutions also form approximate Nash equilibria when the number of agents is finite. Lastly, we test the resulting strategies on a variety of market simulations and show that disagreement empirically increases price volatility.


Chen, Bo

Bo Chen graduated in 2019 with a PhD from the U of T Department of Statistical Sciences where he was jointly supervised by Professor Radu V. Craiu and Professor Lei Sun. His research focuses on statistical genetics, and his doctoral thesis title was "Statistical Methods for X-inclusive Genome-wide Association Study". He is currently a postdoctoral fellow at the Princess Margaret Cancer Centre.

By Bo Chen, Radu Craiu, and Lei Sun

Biostatistics | 2020

Link to publication

Abstract: X-chromosome is often excluded from the so-called “whole-genome” association studies due to the differences it exhibits between males and females. One particular analytical challenge is the unknown status of X-inactivation, where one of the two X-chromosome variants in females may be randomly selected to be silenced. In the absence of biological evidence in favor of one specific model, we consider a Bayesian model averaging framework that offers a principled way to account for the inherent model uncertainty, providing model averaging-based posterior density intervals and Bayes factors. We examine the inferential properties of the proposed methods via extensive simulation studies, and we apply the methods to a genetic association study of an intestinal disease occurring in about 20% of cystic fibrosis patients. Compared with the results previously reported assuming the presence of inactivation, we show that the proposed Bayesian methods provide more feature-rich quantities that are useful in practice.

Layman Summary: Genome-wide association studies have successfully identified many genes influencing heritable and complex traits (e.g., blood pressure, breast cancer) in the past decade. However, one major gap in the current ‘whole-genome’ association analysis is the routine omission of the X-chromosome, because the available statistical association methods, developed for analyzing genes from the autosomes, are not directly applicable to the X-chromosome.

In genetic association studies, genotype of a gene must be coded numerically so that its association with the trait of interest can be modelled statistically. For genes on the autosomes, there is only one correct coding scheme. However, a phenomenon called X-chromosome inactivation affects the genotype coding of a X-chromosomal gene for a female. Whether X-chromosome is inactivated or not is usually unknown, which leads to two possible statistical models, resulting in different association results.
In the presence of uncertainty about the correct model, we adopt a Bayesian approach that incorporates model uncertainty when building evidence for association. This leads to considering a “weighted average” model that combines the posterior distribution and Bayes factor estimate from each model, where the weight represents the support for each model by the data. We differentiate and adapt our procedure depending on whether the outcome of interest is continuous (e.g., blood pressure) or binary (e.g., breast cancer).
In order to evaluate the performance of our proposed model averaging method, we use numerical experiments involving genes with different XCI status and different levels of association strength.
Finally, we apply our method to an X-chromosome association study of meconium ileus, a binary intestinal disease occurring in about 20% of the individuals with Cystic Fibrosis. Method previously used in that study assumed all genes on the X-chromosome are inactivated. We show that the Bayesian Model Averaging method i) confirms the findings from the previous study, ii) pinpoints additional genes that merit follow-up studies, and iii) provides more feature-rich quantities that help investigate and interpret genetic association in practice.

By Bo Chen, Radu V Craiu, Lisa J Strug, and Lei Sun

Genetic Epidemiology | 2021

Link to publication

Abstract: The X-chromosome is often excluded from genome-wide association studies because of analytical challenges. Some of the problems, such as the random, skewed, or no X-inactivation model uncertainty, have been investigated. Other considerations have received little to no attention, such as the value in considering nonadditive and gene-sex interaction effects, and the inferential consequence of choosing different baseline alleles (i.e., the reference vs. the alternative allele). Here we propose a unified and flexible regression-based association test for X-chromosomal variants. We provide theoretical justifications for its robustness in the presence of various model uncertainties, as well as for its improved power when compared with the existing approaches under certain scenarios. For completeness, we also revisit the autosomes and show that the proposed framework leads to a more robust approach than the standard method. Finally, we provide supporting evidence by revisiting several published association studies. Supporting Information for this article are available online.

Layman Summary: The work was highlighted by the International Genetic Epidemiology Society (IGES): The paper offers a deep dive into how best to model X-inactivation uncertainty in association testing. The challenge of meaningful effect size estimation is highlighted, which is critical to future X-chromosome-inclusive studies and translation such as with polygenic risk score-based prediction.


Chen, Sigeng

Sigeng Chen is a student of statistics entering the PhD program in 2019, specializing in Markov chain Monte Carlo (MCMC) algorithms. He also has interest in Bayesian Inference, Machine Learning, and Statistical Computing. He received his bachelor degree and master degree from the University of Waterloo. Read more.

By J.S. Rosenthal, A. Dote, K. Dabiri, H. Tamura, S. Chen, and A. Sheikholeslami

Computational Statistics | 2021

Link to publication

Abstract: We consider versions of the Metropolis algorithm which avoid the inefficiency of rejections. We first illustrate that a natural Uniform Selection algorithm might not converge to the correct distribution. We then analyze the use of Markov jump chains which avoid successive repetitions of the same state. After exploring the properties of jump chains, we show how they can exploit parallelism in computer hardware to produce more efficient samples. We apply our results to the Metropolis algorithm, to Parallel Tempering, to a Bayesian model, to a two-dimensional ferromagnetic 4×4Ising model, and to a pseudo-marginal MCMC algorithm.

Layman Summary: The Metropolis algorithm is a method of designing a Markov chain which converges to a given target density π on a state space S. We consider versions of the Metropolis algorithm which avoid the inefficiency of rejections. We first illustrate that a natural Uniform Selection algorithm might not converge to the correct distribution. We then analyze the use of Markov jump chains which avoid successive repetitions of the same state. After exploring the properties of jump chains, we show how they can exploit parallelism in computer hardware to produce more efficient samples. We apply our results to the Metropolis algorithm, to Parallel Tempering, to a Bayesian model, to a two-dimensional ferromagnetic 4×4Ising model, and to a pseudo-marginal MCMC algorithm.

By Sigeng Chen, Jeffrey S. Rosenthal, Aki Dote, Hirotaka Tamura, and Ali Sheikholeslami

Computational Statistics | 2022

Link to publication

Abstract: Simulated Annealing using Metropolis steps at decreasing temperatures is widely used to solve complex combinatorial optimization problems.  To improve its efficiency, we can use the Rejection-Free version of the Metropolis algorithm which avoids the inefficiency of rejections by considering all the neighbors at each step. To avoid the algorithm getting stuck in a local extreme area, we propose an enhanced version of Rejection-Free called Partial Neighbor Search, which only considers random part of the neighbors while applying the Rejection-Free technique. In this paper, we apply these methods to several examples such as quadratic unconstrained binary optimization (QUBO) problems to demonstrate superior performance of the Rejection-Free Partial Neighbor Search algorithm.

Layman Summary: Simulated Annealing is widely used to solve many complex computational problems such as the quadratic unconstrained binary optimization, the knapsack problem, the ising models, etc. Simulated Annealing proposes one neighbor at a time, computes a ratio of target probabilities, and decides whether accept the proposal or not. To improve its efficiency, we can use the Rejection-Free version of the Metropolis algorithm which avoids the inefficiency of rejections by considering all the neighbors at each step. To avoid the algorithm getting stuck in a local extreme area, we propose an enhanced version of Rejection-Free called Partial Neighbor Search, which only considers a random subset of the neighbors when applying the Rejection-Free technique. Rejection-Free Partial Neighbor Search algorithm can increase the efficiency and shorten the time used to find the optimal solution greatly. With the help of parallelism hardware, such improvement can be 100x to 10000x.


Deng, Wei

My research focuses on statistical methods for large scale high-dimensional genomics data with a particular emphasis on variance. My current interest includes addition genetics and the integration of multi-omics data to address relevant health research questions. Read more. 

By Wei Q. Deng and Lei Sun

G3: Genes, Genomes, Genetics 12(4):1-6 | 2022

Link to publication

Abstract: A joint analysis of location and scale can be a powerful tool in genome-wide association studies to uncover previously overlooked markers that influence a quantitative trait through both mean and variance, as well as to prioritize candidates for gene–environment interactions. This approach has recently been generalized to handle related samples, dosage data, and the analytically challenging X-chromosome. We disseminate the latest advances in methodology through a user-friendly R software package with added functionalities to support genome-wide analysis on individual-level or summary-level data. The implemented R package can be called from PLINK or directly in a scripting environment, to enable a streamlined genome-wide analysis for biobank-scale data. Application results on individual-level and summary-level data highlight the advantage of the joint test to discover more genome-wide signals as compared to a location or scale test alone. We hope the availability of gJLS2 software package will encourage more scale and/or joint analyses in large-scale datasets, and promote the standardized reporting of their P-values to be shared with the scientific community.

Layman Summary: Facilitated by efficient computational packages that can read and process large genomic data, the statistical software R is increasingly becoming the tool of choice for both methods development and novel analyses of genetic data. This trend to customize analysis and disseminate the latest research advances through software tools is evitable and exciting. Our work marks a significant update in software capabilities and in methods advancement to include the previously overlooked X chromosome as an important predictor of complex disease and traits. The added functionalities simplify the analysis pipeline for large scale genomic datasets and reduce computational time needed to work with such data via parallel computing. It is our hope that by including the recent advances in X-chromosome methodologies, we eliminate some of the barriers to an X-inclusive approach to the discovery of genetic determinants of health and disease risks.


Jia, Tianyi

Tianyi is a PhD student supervised by Professor Sebastian Jaimungal. His research studies algorithmic trading strategies of foreign exchange and equities under model misspecifications. Besides his PhD program, he is currently a senior manager at Enterprise Model Risk Management, GRM with Royal Bank Of Canada.

By Álvaro Cartea, Sebastian Jaimungal, and Tianyi Jia

SIAM J. Financial Mathematic | 2020 (accepted)

Link to publication

Abstract: We develop the optimal trading strategy for a foreign exchange (FX) broker who must liquidate a large position in an illiquid currency pair. To maximize revenues, the broker considers trading in a currency triplet which consists of the illiquid pair and two other liquid currency pairs. The liquid pairs in the triplet are chosen so that one of the pairs is redundant. The broker is risk-neutral and accounts for model ambiguity in the FX rates to make her strategy robust to model misspecification. When the broker is ambiguity neutral (averse) the trading strategy in each pair is independent (dependent) of the inventory in the other two pairs in the triplet. We employ simulations to illustrate how the robust strategies perform. For a range of ambiguity aversion parameters, we find the mean Profit and Loss (P&L) of the strategy increases and the standard deviation of the P&L decreases as ambiguity aversion increases.

Layman Summary: In the foreign exchange (FX) market, there are many currency pairs. Some of them are traded with large volumes, like the ones involving U.S. dollar (USD). Others are illiquid, such as the Australian and New Zealand dollar (AUD-NZD) pair. A FX broker needs to trade with her clients for their FX needs. Hence she needs to manage her FX exposures. Indeed, due to low clients' trading activities, our FX broker will accumulate non-zero positions of illiquid currency pairs easily. In addition, unlike the equity market, given three currencies, there are so-called "triangle relations" among their exchange rates. For example, when someone wants to convert Canadian dollar (CAD) to Euro (EUR), he or she can directly convert CAD to EUR, or convert CAD to USD first and then USD to EUR. Ideally, ignoring the transaction fees, the cost of CAD to gain 1 unit of EUR should be the same for both approaches. In our work, we develop trading strategies utilizing the triangle relations to manage our FX broker's inventories in the illiquid currency pairs. To the best of our knowledge our work is the first to show how FX brokers manage large positions in currency pairs.


Lalancette, Michaël

Michaël is a PhD candidate in the Department of Statistical Sciences, University of Toronto since September 2017, where he is supervised by Professor Stanislav Volgushev. Prior to joining U of T, he completed a BSc in Mathematics and an MSc in Statistics in the Department of Mathematics and Statistics, Université de Montréal. Read more.

By Michaël Lalancette, Sebastian Engelke, Stanislav Volgushev

Annals of Statistics | 2021 (accepted)

Link to publication

Abstract: Multivariate extreme value theory is concerned with modelling the joint tail behavior of several random variables. Existing work mostly focuses on asymptotic dependence, where the probability of observing a large value in one of the variables is of the same order as observing a large value in all variables simultaneously. However, there is growing evidence that asymptotic independence is equally important in real-world applications. Available statistical methodology in the latter setting is scarce and not well understood theoretically. We revisit non-parametric estimation and introduce rank-based M-estimators for parametric models that simultaneously work under asymptotic dependence and asymptotic independence, without requiring prior knowledge on which of the two regimes applies. Asymptotic normality of the proposed estimators is established under weak regularity conditions. We further show how bivariate estimators can be leveraged to obtain parametric estimators in spatial tail models, and again provide a thorough theoretical justification for our approach.

Layman summary: Extreme value theory is a branch of statistics that seeks to extrapolate outside the range of available data in order to model extreme events and forecast their severity. For instance, in this paper we use a data set containing daily rainfall measurements at 92 different locations across southern Australia over a period of 50 years. One may wonder how severe the most extreme rainfall in the next 100 years will be at each location, but also how dependent those extremes are on each other, and whether they are likely to occur simultaneously.

To describe our main contributions, first consider a bivariate context where only two variables of interest (such as rainfall at two locations) are measured. Each data point now consists of two measurements. In classical extreme value theory, an observation is considered extreme if at least one of the two variables exceeds a high threshold. Only those observations are then used to model the extremal behavior. However, this approach can fail in many environmental applications, such as rainfall at two selected locations in the aforementioned example. The trouble is that some environmental data sets tend to exhibit a property termed tail independence: for most observations where one variable is extreme, the other is not. This poses difficulties for accurate modelling and can lead to a considerable underestimation of the probability that the two variables are extreme simultaneously. To remedy this issue, the usual solution when the variables of interest are believed to be tail independent relies on restricting the sample to observations where both variables are simultaneously large. In summary, available methods for modelling data with extremal dependence and independence differ considerably, and there exists no approach that works in both scenarios.

In this paper, we offer new insights on bivariate extreme value theory under tail dependence and independence. We introduce a way to model bivariate distributions based on the simultaneous behavior of their extremes without prior knowledge of whether tail dependence or independence is appropriate in a particular application. The main practical advantage of our approach over existing ones is that it is agnostic to the presence of tail independence, making it applicable in a wide range of settings.

We further demonstrate that the results obtained in the bivariate setting are also useful in the spatial context where many locations are considered simultaneously. Here our method can be used to fit parametric models for spatial processes (in the earlier example, the spatial distribution of extreme rainfall over the whole region of interest). The strategy consists of considering a sufficiently large number of pairs of locations and treat each one as a bivariate data set. Our method may then be used to learn the extremal behavior of each pair, and we demonstrate how this knowledge can be leveraged to infer the extremal behavior of the entire spatial process.

Along the way, we provide new theoretical results about certain mathematical objects called empirical tail processes that form a fundamental building block for the analysis of most estimation procedures dealing with multivariate extremes.


Levi, Evgeny

Dr. Evgeny Levi is a Model Risk Specialist at Bank of Montreal. He studied Mathematics and Statistics at The University of Toronto (BS 2009, MS 2013) and received a PhD from the Department of Statistical Sciences at The University of Toronto in 2019. His supervisor was Professor Radu Craiu. His main research interests are in computational methods in statistics, especially, Markov Chain Monte Carlo (MCMC) and Approximate Bayesian Computation algorithms (ABC), dependence models and model selection procedures for copula models.

During his PhD studies, he published four articles in Bayesian Analysis, Computational Statistics & Data Analysis and Journal of Computational and Graphical Statistics journals. He was a recipient of Andrews Academic Achievement Award, Teaching Assistant Award, Queen Elizabeth II Graduate Scholarship and Ontario Graduate Scholarship.

By Evgeny Levi and Radu Craiu

Bayesian Analysis | 2021 (accepted)

Link to publication

Abstract: With larger data at their disposal, scientists are emboldened to tackle complex questions that require sophisticated statistical models. It is not unusual for the latter to have likelihood functions that elude analytical formulations. Even under such adversity, when one can simulate from the sampling distribution, Bayesian analysis can be conducted using approximate methods such as Approximate Bayesian Computation (ABC) or Bayesian Synthetic Likelihood (BSL). A significant drawback of these methods is that the number of required simulations can be prohibitively large, thus severely limiting their scope. In this paper we design perturbed MCMC samplers that can be used within the ABC and BSL paradigms to significantly accelerate computation while maintaining control on computational efficiency. The proposed strategy relies on recycling samples from the chain’s past. The algorithmic design is supported by a theoretical analysis while practical performance is examined via a series of simulation examples and data analyses.

Layman Summary: Bayesian computation currently encounters two type of challenges. One concerns the issue of computationally expensive likelihoods that usually arise from large data. The second, which is relevant for this paper, occurs because modern statistical models are often complex enough to yield intractable likelihoods. Classical Bayesian inference relies entirely on the posterior distribution which, in turn, is explored using Markov chain Monte Carlo (MCMC) simulation methods. The MCMC sampling algorithms are iterative and require at each iteration the calculation of the likelihood. Clearly, when the likelihood is not available analytically, the usual MCMC samplers  required to study the posterior cannot be implemented. A remedy is proposed by the so-called approximate Bayesian computation (ABC) algorithms which rely on the ability to generate, given values for the model parameters, data from the model.  Such an assumption is satisfied, for instance, by model emulation experiments, and is met by models increasingly used in scientific domains such as Astronomy and Astrophysics, Hydrology and Genetics, to name a few. The algorithms are still iterative and each iteration requires to simulate data like the ones observed, so their success depends on the ability to do this hundreds or thousands of times. This paper considers new strategies to mitigate the cost incurred when the data generation is computationally expensive. The approach proposed relies on recycling samples, thus minimizing, at each iteration, the need to produce multiple new data replicates. We consider implementations in the case of ABC and another approximate Bayesian method, called Bayesian Synthetic Likelihood (BSL). By using past samples, we create another approximation of the algorithm that one would have used in the ABC/BSL sampling, This compels us to demonstrate theoretically that the approximation introduced does not prove too costly in terms of statistical efficiency for the resulting estimators. The numerical experiments show that this is indeed the case, while the computational costs are significantly  smaller.

By Evgeny Levi, and Radu V Craiu

In: La Rocca M., Liseo B., Salmaso L. (eds) Nonparametric Statistics. ISNPS 2018. Springer Proceedings in Mathematics & Statistics, vol 339. Springer, Cham.

Link to publication

Abstract: The paper considers the problem of establishing data support for the simplifying assumption (SA) in a bivariate conditional copula model. It is known that SA greatly simplifies the inference for a conditional copula model, but standard tools and methods for testing SA in a Bayesian setting tend to not provide reliable results. After splitting the observed data into training and test sets, the method proposed will use a flexible Bayesian model fit to the training data to define tests based on randomization and standard asymptotic theory. Its performance is studied using simulated data. The paper’s supplementary material also discusses theoretical justification for the method and implementations in alternative models of interest, e.g. Gaussian, Logistic and Quantile regressions.

Layman summary: Let us imagine a situation in which a couple of dependent response variables are measured simultaneously along with a number of covariates. Traditional regression models will specify functional connections between each response variable and the covariates. In addition to marginal effects, one may be interested in understanding the dependence structure between the variables and how it changes with covariates.

Copulas are distribution functions that bind, in a mathematically coherent way, continuous marginal distributions to form a multivariate distribution. Any copula will correspond to a specific dependence structure, independently from the marginal models, thus yielding rich classes of multivariate distributions that can capture a large variety of dependence patterns.  

In a regression setting, the effect of covariates on the dependence between response variables is captured by a conditional copula model which allows the copula to evolve dynamically as covariates change. Estimation for conditional copula models can rarely be done in closed form, thus requiring numerical procedures for optimization or integration. This can increase the computational cost, especially when there is a cascade of estimations to be done, as in the case of vine-type factorizations. However, if each copula is constant, a condition known as the simplifying assumption (SA), then estimation is a lot simpler to perform since the parameter space of the model is significantly reduced. Various diagnostics have been proposed to identify whether the SA holds or not, but many suffer from lack of power to identify SA when it holds. In this paper we propose a permutation based diagnostic procedure that outperforms the competing methods in terms of identifying the correct structure.


Li, Mufan

Mufan (Bill) Li is a fifth year PhD student. His research interests are primarily focused on the theoretical analysis of Langevin diffusion based sampling complexity and scaling limits of deep neural networks.

By Mihai Nica, and Daniel M. Roy

Advances in Neural Information Processing Systems 34 | 2021

Link to publication

Abstract: Theoretical results show that neural networks can be approximated by Gaussian processes in the infinite-width limit. However, for fully connected networks, it has been previously shown that for any fixed network width, n, the Gaussian approximation gets worse as the network depth, d, increases. Given that modern networks are deep, this raises the question of how well modern architectures, like ResNets, are captured by the infinite-width limit. To provide a better approximation, we study ReLU ResNets in the infinite-depth-and-width limit, where both depth and width tend to infinity as their ratio, d/n, remains constant. In contrast to the Gaussian infinite-width limit, we show theoretically that the network exhibits log-Gaussian behaviour at initialization in the infinite-depth-and-width limit, with parameters depending on the ratio d/n. Using Monte Carlo simulations, we demonstrate that even basic properties of standard ResNet architectures are poorly captured by the Gaussian limit, but remarkably well captured by our log-Gaussian limit. Moreover, our analysis reveals that ReLU ResNets at initialization are hypoactivated: fewer than half of the ReLUs are activated. Additionally, we calculate the interlayer correlations, which have the effect of exponentially increasing the variance of the network output. Based on our analysis, we introduce Balanced ResNets, a simple architecture modification, which eliminates hypoactivation and interlayer correlations and is more amenable to theoretical analysis.

Layman Summary: Theoretical understanding neural networks remain unsatisfactory today largely due to a lack of a tractable mathematical description. While the best available infinite-width theories have achieve mathematical success, they are a poor model of real networks in practice. Our work studies the infinite-depth-and-width limit of a specific network architecture called ResNets, and show that it is possible to give a very precise description of finite and practical neural networks at initialization.

By Sinho Chewi, Murat A. Erdogdu, Mufan (Bill) Li, Ruoqi Shen, and Matthew Zhang

Conference on Learning Theory | 2022

Link to publication

Abstract: Classically, the continuous-time Langevin diffusion converges exponentially fast to its stationary distribution π under the sole assumption that π satisfies a Poincaré inequality. Using this fact to provide guarantees for the discrete-time Langevin Monte Carlo (LMC) algorithm, however, is considerably more challenging due to the need for working with chi-squared or Rényi divergences, and prior works have largely focused on strongly log-concave targets. In this work, we provide the first convergence guarantees for LMC assuming that π satisfies either a Latała--Oleszkiewicz or modified log-Sobolev inequality, which interpolates between the Poincaré and log-Sobolev settings. Unlike prior works, our results allow for weak smoothness and do not require convexity or dissipativity conditions.

Layman Summary: Intuitively, functional inequalities such as the logarithmic Sobolev and Poincaré inequalities describe the "difficulty" of sampling from a distribution. In particular, a well known open problem is to provide a sampling guarantee for distribution that satisfy a Poincaré inequality. In this work, we provide first upper bounds to this open problem for a specific algorithm known as Langevin Monte Carlo.


Markes, Sonia

Sonia Markes is currently a third-year PhD student, supervised by Nancy Reid and Linbo Wang, supported by an Ontario Graduate Scholarship. She is interested in causal inference. Her recent research efforts have explored applications in a range of settings, including sports and quantum physics. Previously, she studied math and physics at the University of Waterloo, earning a B.Sc in Honours Mathematical Physics and an M.Math in Applied Mathematics, following which she worked in marketing analytics.

By Jiaqi Yin, Thomas S. Richardson, and Linbo Wang

Biometrika | 2021

Link to publication

Abstract: Generalized linear models, such as logistic regression, are widely used to model the association between a treatment and a binary outcome as a function of baseline covariates. However, the coefficients of a logistic regression model correspond to log odds ratios, while subject-matter scientists are often interested in relative risks. Although odds ratios are sometimes used to approximate relative risks, this approximation is appropriate only when the outcome of interest is rare for all levels of the covariates. Poisson regressions do measure multiplicative treatment effects including relative risks, but with a binary outcome not all combinations of parameters lead to fitted means that are between zero and one. Enforcing this constraint makes the parameters variation dependent, which is undesirable for modelling, estimation and computation. Focusing on the special case where the treatment is also binary, Richardson et al. (2017) proposed a novel binomial regression model that allows direct modelling of the relative risk. The model uses a log odds product nuisance model leading to variation-independent parameter spaces. Building on this we present general approaches to modelling the multiplicative effect of a continuous or categorical treatment on a binary outcome. Monte Carlo simulations demonstrate the desirable performance of our proposed methods. An analysis of the relationship between passenger class and survival for passengers on the Titanic further exemplifies our methods.

Layman Summary: A standard technique for testing the efficacy of a treatment is to hold out a control group, that is, a portion of a study population is not given the treatment in question. Analysts or scientists compare the probability of an outcome in the group that received the treatment relative to the probability of that outcome in the control group. Often, the quantity they want to estimate is called relative risk, defined as the ratio between the risk in the treated group and the risk in the control group. For example, a marketing analyst may want to say that customers who recieved an email were twice as likely to shop as those who did not receive the offer, or a medical researcher may want to report that the risk of death under treatment was lower by a factor by 1/9.   When the probability of the outcome is small, relative risk can be approximated by an odds ratio and estimated with logistic regression, a widely used method. However, modelling is difficult when this approximation is not valid. In Richardson et al. (2017), they introduced a novel method for modeling relative risk when both the treatment and outcome are binary. In this paper, we extend this approach to continuous and categorical treatments, presenting generalized methods for modelling the multiplicative effect of treatment on a binary outcome. Simulation results demonstrate the efficiency of our methods compared to other methods. An analysis of the relationship between passenger class and survival for passengers on the Titanic is given as an example of our method's applicability.


Shrivats, Arvind

Arvind completed his PhD in April 2021, under the supervision of Professor Sebastian Jaimungal. During his time in the Department of Statistical Sciences, Arvind was part of the Mathematical Finance Group, with his work supported by the Ontario Graduate Scholarship. His research uses tools from stochastic control and mean-field games to study emissions markets (such as cap-and-trade or renewable energy certificate markets).

By Arvind Shrivats and Sebastian Jaimungal

Applied Mathematical Finance | 2020

Link to publication

Abstract: SREC markets are a relatively novel market-based system to incentivize the production of energy from solar means. A regulator imposes a floor on the amount of energy each regulated firm must generate from solar power in a given period and provides them with certificates for each generated MWh. Firms offset these certificates against the floor and pay a penalty for any lacking certificates. Certificates are tradable assets, allowing firms to purchase/sell them freely. In this work, we formulate a stochastic control problem for generating and trading in SREC markets from a regulated firm’s perspective. We account for generation and trading costs, the impact both have on SREC prices, provide a characterization of the optimal strategy and develop a numerical algorithm to solve this control problem. Through numerical experiments, we explore how a firm who acts optimally behaves under various conditions. We find that an optimal firm’s generation and trading behaviour can be separated into various regimes, based on the marginal benefit of obtaining an additional SREC, and validate our theoretical characterization of the optimal strategy. We also conduct parameter sensitivity experiments.

Layman summary: In this work, we consider optimal behaviour within a sub-class of environmental markets known as Solar Renewable Energy Certificate (SREC) markets. SREC markets can be thought of as an inverse to the more common carbon cap-and-trade markets. They are designed to incentivize the production of energy from solar means. A regulator imposes a floor on the amount of energy each regulated firm must generate from solar power in a given period and provides them with certificates for each generated MWh. Firms offset these certificates against the floor and pay a penalty for any lacking certificates. Certificates are tradable assets, allowing firms to purchase/sell them freely.

Firms within these markets face obvious questions of how to navigate the market optimally. To that end, we propose a stochastic environment that the firms face that represents this market, and use techniques from stochastic control to find exactly that. In doing so, we account for numerous costs and real-life complexities that these firms face. In particular, this includes generation and trading costs, as well as possible price impacts that firms have within this market. This work was among the first to provide thorough and rigorous analysis of optimal  behaviour within SREC markets (and indeed, carbon cap-and-trade markets) on the firm level, while accounting for these complexities. Prior work often focused on social optimality from the perspective of a fictitious omniscient social planner who can control everyone's behaviour. Instead, this work is more realistic, as it focuses on how individual firms would behave when faced by an exogenous market, which is closer to what actually happens.

After characterizing the optimal strategy theoretically, we develop a numerical algorithm to solve for it, and conduct many experiments to explore the characteristics and market implications of optimal firm behaviour. We also conduct parameter sensitivity experiments.


Slater, Justin

Justin is a second-year PhD candidate in statistical sciences at the University of Toronto, supervised by Patrick Brown and Jeff Rosenthal. His work is currently focused on Bayesian applications to modelling challenges in COVID-19. He is currently working on spatio-temporal models for COVID-19 that utilize population movement data, and Bayesian infection-fatality rate estimation in Canada. Read more.

By Justin J. Slater, Patrick E. Brown, and Jeffrey S. Rosenthal

Stat, Volume 10, Issue 1 | 2021

Link to publication

Abstract: As of October 2020, the death toll from the COVID‐19 pandemic has risen over 1.1 million deaths worldwide. Reliable estimates of mortality due to COVID‐19 are important to guide intervention strategies such as lockdowns and social distancing measures. In this paper, we develop a data‐driven model that accurately and consistently estimates COVID‐19 mortality at the regional level early in the epidemic, using only daily mortality counts as the input. We use a Bayesian hierarchical skew‐normal model with day‐of‐the‐week parameters to provide accurate projections of COVID‐19 mortality. We validate our projections by comparing our model to the projections made by the Institute for Health Metrics and Evaluation and highlight the importance of hierarchicalization and day‐of‐the‐week effect estimation.

Layman summary: As of October 2020, the death toll from the COVID‐19 pandemic has risen over 1.1 million deaths worldwide. Reliable estimates of mortality due to COVID‐19 are important to guide intervention strategies such as lockdowns and social distancing measures. We focused on mortality as opposed to case counts because case counts were mostly a function of how much testing was being done, while COVID-19 related death counts were more reliable because it was unlikely that someone would die of COVID-19 without being diagnosed. At the time of publication, one of the more popular websites used to track the pandemic was that of the Institute for Health Metrics and Evaluation (IHME). The model they were employing at the time was just a Normal distribution, which doesn’t really fit the shape of epidemic curves. Typically, there is exponential growth at first, followed by a slow decline in death counts as people start to distance more, lockdowns are implemented etc. The models that we were using in this paper (so called skew-normal models) captured this sharp increase and slow decline, and do a lot of other neat statistical things that allow for more accurate forecasts at the small-area level. Particularly, our model is good when a region’s death counts are low because our model implicitly borrows information from nearby regions. Additionally, our model accounted for day-of-the-week, allowing for more accurate day-to-day forecasts. The projections made by the IHME would be more pessimistic/optimistic depending on what day of the week you checked them.
In short, we compared our model to the IHME’s in the first wave of the epidemic and showed that our model forecasted COVID-19 mortality substantially better than the IHME. We showed that our model is fantastic at predicting first-wave mortality. Nearing the end of the project, many countries were entering their 2nd wave, and both our model and the IHME’s started performing worse. The IHME have since changed their methodology. Our model has since been extended to account for multiple waves. Modelling the COVID-19 pandemic is a constantly evolving challenge and models will need to continue to adapt in order to accurately capture the rise and fall of this deadly disease.

By Justin J. Slater, Patrick E. Brown, Jeffrey S. Rosenthal, and Jorge Mateu

Spatial Statistics | 2021

Link to publication

Abstract: Spatial dependence is usually introduced into spatial models using some measure of physical proximity. When analysing COVID-19 case counts, this makes sense as regions that are close together are more likely to have more people moving between them, spreading the disease. However, using the actual number of trips between each region may explain COVID-19 case counts better than physical proximity. In this paper, we investigate the efficacy of using telecommunications-derived mobility data to induce spatial dependence in spatial models applied to two Spanish communities’ COVID-19 case counts. We do this by extending Besag York Mollié (BYM) models to include both a physical adjacency effect, alongside a mobility effect. The mobility effect is given a Gaussian Markov random field prior, with the number of trips between regions as edge weights. We leverage modern parametrizations of BYM models to conclude that the number of people moving between regions better explains variation in COVID-19 case counts than physical proximity data. We suggest that this data should be used in conjunction with physical proximity data when developing spatial models for COVID-19 case counts.

Layman Summary: Infectious diseases, like COVID-19, can only be spread through person-to-person contact. Thus, any infectious disease model should account for the number of contacts between people in some way. With the advent of GPS tracking data from cellular phones, mobility and commuting patterns have recently been employed to model the spread of COVID-19, as mobility can be seen as a proxy for contacts between people. However, these data are relatively new and unexplored. We set out to see if this mobility data can augment statistical models describing the spread of COVID-19.  This study was focussed on two autonomous communities (AC) in Spain, Comunidad de Madrid and Castilla Y León. Each AC was separated into roughly 200 subregions. Our data included the number of trips between each these subregions within each AC, alongside the number of cases in each region.  In the absence of mobility data, it is common to account for contacts between people using a measure of physical proximity. In other words, the further apart two subregions are, the less likely two people from those regions will come into contact with one another. To investigate whether the mobility data outperforms physical proximity in this regard, we developed a model that included both proximity and mobility and observed which term better explained COVID-19 case counts. Much of the challenge of this work was extending an existing model in such a way that the physical proximity term and the mobility term were directly comparable.  We found that our new model with cellphone mobility data was substantially better at explaining COVID-19 case counts than the one with just physical proximity. Although the mobility signal was much stronger than the proximity signal, we recommend using both simultaneously when modelling the spread of COVID-19.


Stringer, Alex

Alex is a PhD candidate in Statistics at the University of Toronto, supervised by Professors Patrick Brown and Jamie Stafford. He develops novel methods for fitting complex models to large datasets using Bayesian inference. In his doctoral work, he has introduced a novel class of models, Extended Latent Gaussian Models (ELGMs), and demonstrated their application in spatial epidemiology, astrophysics, and other areas. His work is funded by an NSERC Postgraduate Fellowship - Doctoral, and he has been awarded the Faculty of Arts and Science's Doctoral Excellence Scholarship.

By Alex Stringer, Patrick Brown, and Jamie Stafford

Biometrics | 2020

Link to publication

Abstract: A case‐crossover analysis is used as a simple but powerful tool for estimating the effect of short‐term environmental factors such as extreme temperatures or poor air quality on mortality. The environment on the day of each death is compared to the one or more “control days” in previous weeks, and higher levels of exposure on death days than control days provide evidence of an effect. Current state‐of‐the‐art methodology and software (integrated nested Laplace approximation [INLA]) cannot be used to fit the most flexible case‐crossover models to large datasets, because the likelihood for case‐crossover models cannot be expressed in a manner compatible with this methodology. In this paper, we develop a flexible and scalable modeling framework for case‐crossover models with linear and semiparametric effects which retains the flexibility and computational advantages of INLA. We apply our method to quantify nonlinear associations between mortality and extreme temperatures in India. An R package implementing our methods is publicly available.

Layman summary: We developed methods for estimating associations between mortality/morbidity (death/illness) risk and environmental factors like temperature or air pollution. Doing so involves comparing exposure levels for each subject on the date of death/illness and several previous days, and determining risk as a function of exposure from such data requires advanced statistical techniques. Our method accommodates very general types of exposure/risk associations and scales to large datasets, which are both of substantial practical concern in problems of this type. We applied the method to study association of temperature mortality in India using data from the Indian Million Deaths Study, a large and high-quality source of individual mortality data. 


Tang, Dingke

Dingke is a second-year PhD candidate in statistical sciences at the University of Toronto, supervised by Professor Linbo Wang and Dehan Kong. His main areas of research include High Dimensional Statistics, Causal inference, Statistical Genetics and their intersection. Read more.

By Dingke Tang, Dehan Kong, Wenliang Pan, and Linbo Wang

Biometrics | 2022

Link to publication

Abstract: Causal inference has been increasingly reliant on observational studies with rich covariate information. To build tractable causal procedures, such as the doubly robust estimators, it is imperative to first extract important features from high or even ultra-high dimensional data. In this paper, we propose causal ball screening for confounder selection from modern ultra-high dimensional data sets. Unlike the familiar task of variable selection for prediction modeling, our confounder selection procedure aims to control for confounding while improving efficiency in the resulting causal effect estimate. Previous empirical and theoretical studies suggest excluding causes of the treatment that are not confounders. Motivated by these results, our goal is to keep all the predictors of the outcome in both the propensity score and outcome regression models. A distinctive feature of our proposal is that we use an outcome model-free procedure for propensity score model selection, thereby maintaining double robustness in the resulting causal effect estimator. Our theoretical analyses show that the proposed procedure enjoys a number of properties, including model selection consistency and pointwise normality. Synthetic and real data analysis show that our proposal performs favorably with existing methods in a range of realistic settings. Data used in preparation of this paper were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database.

Layman Summary: We developed a novel method that allows for adjustment of millions of genetic covariates simultaneously when estimating the average causal effect . This method leads to novel understanding of how tau protein level in cerebrospinal fluid affects the development of Alzheimer’s disease.


Tang, Yanbo

Yanbo Tang is currently a fourth-year PhD student at the Department of Statistical Sciences under the joint supervision of Nancy Reid and Daniel Roy. His primary research interest lies in asymptotic theory and, more specifically in the performance of inferential and approximation procedures in high-dimensional problems.

By Yanbo Tang and Nancy Reid

Journal of the Royal Statistical Society: Series B | 2020

Link to publication

Abstract: We examine a higher order approximation to the significance function with increasing numbers of nuisance parameters, based on the normal approximation to an adjusted log‐likelihood root. We show that the rate of the correction for nuisance parameters is larger than the correction for non‐normality, when the parameter dimension p is O(n^α) for 𝛼<1/2 . We specialize the results to linear exponential families and location–scale families and illustrate these with simulations.

Layman summary: Although there are many predictive procedures in high dimensions whose performances are well characterized, such as LASSO and its many variants, the problem of inference in high dimensions is not as well studied. In this work we examine the use of the modified likelihood root for inference on a scalar parameter of interest when the number of nuisance parameters increases with the number of observations. The modified likelihood root can be expressed as the likelihood root statistic (a statistic obtained by taking the signed square root of likelihood ratio statistic) with an additional adjustment factor. The adjustment factor can be further broken down into an adjustment for the non-normality of the statistic and an adjustment for the influence of the nuisance parameters. We show that for general models, the adjustment for the nuisance parameters is larger than the non-normality adjustment. The above suggests that the main issue with the use of the unadjusted likelihood root in high dimensions is not a failure to converge to a normal limit, but rather the possible presence of a location or scale bias. We also present specialized results for the linear exponential and location scale families. Simulations studies are performed, demonstrating the normal approximation to the modified likelihood root is superior to the un-modified version in the logistic and Weibull regression models.


Tseung, Chau Lung Ngan Spark

Chau Lung Ngan Spark Tseung is a third-year PhD student in the Department of Statistical Sciences, advised by Professors Andrei Badescu and Sheldon Lin. His research focus is in actuarial science. His academic goal is to apply novel machine learning and data science methods to solving actuarial problems such as pricing and claims reserving. Read more.

By Spark C. Tseung, Andrei L. Badescu, Tsz Chai Fung, and X. Sheldon Lin

Annals of Actuarial Science | 2021

Link to publication

Abstract: This paper introduces a new julia package, LRMoE, a statistical software tailor-made for actuarial applications, which allows actuarial researchers and practitioners to model and analyse insurance loss frequencies and severities using the Logit-weighted Reduced Mixture-of-Experts (LRMoE) model. LRMoE offers several new distinctive features which are motivated by various actuarial applications and mostly cannot be achieved using existing packages for mixture models. Key features include a wider coverage on frequency and severity distributions and their zero inflation, the flexibility to vary classes of distributions across components, parameter estimation under data censoring and truncation and a collection of insurance ratemaking and reserving functions. The package also provides several model evaluation and visualisation functions to help users easily analyse the performance of the fitted model and interpret the model in insurance contexts.

Layman summary: We build a statistical software package that allows actuaries to better model insurance losses. The package implements a theoretically flexible modelling framework, which is shown to outperform some classical tools currently used by practitioners. The package is written in julia, a novel programming language which is considerably faster than traditional languages designed for statistical analysis, facilitating the analysis of large datasets commonly encountered in practice.


Wang, Fan

Fan Wang received her B.A. in Statistics and Actuarial Science from the University of Toronto in 2016, and continued her studies in Statistics at the University of Toronto as a direct entry into the PhD program. She is currently a PhD candidate under the supervision of Professors Lisa Strug and Lei Sun. Her thesis focuses on developing theoretical and computational models to integrate ‘omics’ data such as genetic association and expression quantitative trait locus summary statistics. Her main areas of research include statistical genetics and high-dimensional data analysis.

By Fan Wang, Naim Panjwani, Cheng Wang, Lei Sun, and Lisa J Strug

The American Journal of Human Genetics | 2022

Link to publication

Abstract: Mucus obstruction is a central feature in the cystic fibrosis (CF) airways. A genome-wide association study (GWAS) of lung disease by the CF Gene Modifier Consortium (CFGMC) identified a significant locus containing two mucin genes, MUC20 and MUC4. Expression quantitative trait locus (eQTL) analysis using human nasal epithelia (HNE) from 94 CF-affected Canadians in the CFGMC demonstrated MUC4 eQTLs that mirrored the lung association pattern in the region, suggesting that MUC4 expression may mediate CF lung disease. Complications arose, however, with colocalization testing using existing methods: the locus is complex and the associated SNPs span a 0.2 Mb region with high linkage disequilibrium (LD) and evidence of allelic heterogeneity. We previously developed the Simple Sum (SS), a powerful colocalization test in regions with allelic heterogeneity, but SS assumed eQTLs to be present to achieve type I error control. Here we propose a two-stage SS (SS2) colocalization test that avoids a priori eQTL assumptions, accounts for multiple hypothesis testing and the composite null hypothesis, and enables meta-analysis. We compare SS2 to published approaches through simulation and demonstrate type I error control for all settings with the greatest power in the presence of high LD and allelic heterogeneity. Applying SS2 to the MUC20/MUC4 CF lung disease locus with eQTLs from CF HNE revealed significant colocalization with MUC4 (p = 1.31 × 10−5) rather than with MUC20. The SS2 is a powerful method to inform the responsible gene(s) at a locus and guide future functional studies. SS2 has been implemented in the application LocusFocus.

Layman Summary: Genome-wide association studies (GWAS) have been very successful in identifying variants in the genome that associate with complex human traits.  But GWAS identify loci and to make progress and move forward with functional investigation one must determine the responsible gene and tissue of action represented by the association.  In Wang et al (2022) we provide a robust, easy to implement colocalization tool that integrates GWAS results with other omics data such as gene expression to uncover the mechanism responsible for the GWAS signal and guide future functional investigations.  The new colocalization tool is implemented in the application LocusFocus.


Yu, Lu

Lu Yu is a Ph.D. candidate at the Department of Statistical Sciences under the joint supervision of Stanislav Volgushev and Murat A. Erdogdu. Her primary research interest lies in the confluence of statistics, optimization, and machine learning. Currently, she is working on stochastic optimization and statistical learning theory.

By Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A. Erdogdu

Advances in Neural Information Processing Systems (NeurIPS) | 2021

Link to publication

Abstract: Structured non-convex learning problems, for which critical points have favorable statistical properties, arise frequently in statistical machine learning. Algorithmic convergence and statistical estimation rates are well-understood for such problems. However, quantifying the uncertainty associated with the underlying training algorithm is not well-studied in the non-convex setting. In order to address this shortcoming, in this work, we establish an asymptotic normality result for the constant step size stochastic gradient descent (SGD) algorithm—a widely used algorithm in practice. Specifically, based on the relationship between SGD and Markov Chains [1], we show that the average of SGD iterates is asymptotically normally distributed around the expected value of their unique invariant distribution, as long as the non-convex and non-smooth objective function satisfies a dissipativity property. We also characterize the bias between this expected value and the critical points of the objective function under various local regularity conditions. Together, the above two results could be leveraged to construct confidence intervals for non-convex problems that are trained using the SGD algorithm.

Layman Summary: Recently, the stochastic gradient descent (SGD) algorithm has been a popular choice for training statistical models due to its simplicity and good performance in large-scale settings. Lots of research has been conducted to study the fluctuation of this algorithm when the objective function is strongly convex and smooth, with the step size satisfying a specific decreasing schedule. However, little is known in the context of non-convex optimization with the SGD algorithm. In this work, we focus on the widely used SGD algorithm with constant step size, and develop results for quantifying the uncertainty associated with this algorithm for a class of non-convex problems.

By Lu Yu, Tobias Kaufmann, and Johannes Lederer

AISTATS | 2021

Link to publication

Abstract: The increasing availability of data has generated unprecedented prospects for network analyses in many biological fields, such as neuroscience (e.g., brain networks), genomics (e.g., gene-gene interaction networks), and ecology (e.g., species interaction networks). A powerful statistical framework for estimating such networks is Gaussian graphical models, but standard estimators for the corresponding graphs are prone to large numbers of false discoveries. In this paper, we introduce a novel graph estimator based on knockoffs that imitate the partial correlation structures of unconnected nodes. We show that this new estimator guarantees accurate control of the false discovery rate in theory, simulations, and biological applications, and we provide easy-to-use R code.

Layman Summary: Biological processes can often be formulated as networks; examples include gene-gene regulation, functional brain networks, and microbiome networks. A common statistical framework for such networks are Gaussian graphical models. In this work, our objective is inference in terms of control over the false discovery rate (FDR), which is the expected proportion of falsely selected edges over all selected edges. To this end, we introduce a novel method for FDR control in graphical modeling and support our method both mathematically and numerically. We apply the proposed method to biological network data sets and provide new insights into biological data.

By Nuri Mert Vural, Lu Yu, Krishnakumar Balasubramanian, Stanislav Volgushev, and Murat A. Erdogdu

COLT | 2022

Link to publication

Abstract: We study stochastic convex optimization under infinite noise variance. Specifically, when the stochastic gradient is unbiased and has uniformly bounded (1+κ)-th moment, for some κ∈(0,1], we quantify the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem. Interestingly this algorithm does not require any explicit gradient clipping or normalization, which has been extensively used in several recent empirical and theoretical works. We complement our convergence results with information-theoretic lower bounds showing that no other algorithm using only stochastic first-order oracles can achieve improved rates. Our results have several interesting consequences for devising online/streaming stochastic approximation algorithms for problems arising in robust statistics and machine learning.

Layman Summary: In this work, we investigate stochastic convex optimization under infinite noise variance. To be more specific, we show that the stochastic mirror descent with an appropriate mirror map has inherent robustness to heavy-tailed gradient noise, and achieves the information-theoretic lower bound for stochastic convex optimization under infinite noise variance.


Zhang, Lin

Lin is a PhD candidate in the Department of Statistical Sciences at the University of Toronto, working with Professor Lei Sun. Her thesis focuses on developing statistical models for robust association testing with applications in genetics. She is also a trainee of the CANSSI-Ontario STAGE training program at the University of Toronto.

By Lin Zhang and Lei Sun

Biometrics | 2021

Link to publication

Abstract: The allele-based association test, comparing allele frequency difference between case and control groups, is locally most powerful. However, application of the classical allelic test is limited in practice, because the method is sensitive to the Hardy-Weinberg equilibrium (HWE) assumption, not applicable to continuous traits, and not easy to account for covariate effect or sample correlation. To develop a generalized robust allelic test, we propose a new allele-based regression model with individual allele as the response variable. We show that the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from HWE, and encompasses the classical allelic test as a special case. When the trait of interest is continuous, the corresponding allelic test evaluates a weighted difference between individual-level allele frequency estimate and sample estimate where the weight is proportional to an individual's trait value, and the test remains valid under Y-dependent sampling. Finally, the proposed allele-based method can analyze multiple (continuous or binary) phenotypes simultaneously and multi-allelic genetic markers, while accounting for covariate effect, sample correlation and population heterogeneity. To support our analytical findings, we provide empirical evidence from both simulation and application studies.

Layman summary: Genetic association studies aim to identify genetic variations that influence heritable traits while accounting for environmental effects. A trait of interest can be continuous (e.g. height) or binary (e.g. the status of Alzheimer’s disease), and association analysis of a binary trait is also known as a case-control study.  
The most common type of genetic variations are single nucleotide polymorphisms (SNPs), with over a million SNPs across the human genome. The million SNPs allow us to interrogate the whole genome for important genetic variations but also lead to the multiple hypothesis testing issue; a SNP can appear to be important just by chance. Thus, a good association method needs to be powerful, flexible and computationally efficient.   
For a case-control study, there is a powerful method called allelic association test. However, the existing allelic association test is not flexible: It cannot account for environmental effects, it cannot analyze family members who are genetically related with each other, it cannot analyze a continuous trait, and it is sensitive to the assumption of Hardy-Weinberg equilibrium that may be violated in practice.  
This work resolves these issues by proposing a new association method via a novel allele-based regression, which includes the classical allelic test as a special case for simple data settings. This work also shows that “simple” regression remains an important method in the era of big and complex data. 

By Lin Zhang and Lei Sun

Frontiers in Genetics - Statistical Genetics and Methodology | 2022

Link to publication

Abstract: For genetic association studies with related individuals, the linear mixed-effect model is the most commonly used method. In this report, we show that contrary to the popular belief, this standard method can be sensitive to departure from Hardy–Weinberg equilibrium (i.e., Hardy–Weinberg disequilibrium) at the causal SNPs in two ways. First, when the trait heritability is treated as a nuisance parameter, although the association test has correct type I error control, the resulting heritability estimate can be biased, often upward, in the presence of Hardy–Weinberg disequilibrium. Second, if the true heritability is used in the linear mixed-effect model, then the corresponding association test can be biased in the presence of Hardy–Weinberg disequilibrium. We provide some analytical insights along with supporting empirical results from simulation and application studies.

Layman Summary: Genetic association tests aim to identify genetic markers associated with a phenotype trait. For a sample of unrelated individuals, linear regression model is the most commonly used method in modern genome-wide association studies (GWAS). When the sample contains related individuals, a common practice is to adopt the linear mixed-effect model (LMM) where the genetic heritability and phenotypic variation are reflected in the variance-covariance matrix. In this report, we show that contrary to the popular belief, LMM can be sensitive to departure from Hardy–Weinberg equilibrium (e.g. Hardy–Weinberg disequilibrium), at the causal SNPs. Hardy-Weinberg equilibrium is a fundamental principle in population genetics, stating that genotype frequencies in a population can be determined by allele frequencies when there are no other evolutionary influences. In this report, we demonstrate the influence of Hardy–Weinberg disequilibrium at the causal SNPs on the accuracy on the LMM-based association tests with analytical insights along with supporting empirical results from simulation and application studies.