Publications: Computational Statistics

2022-23


By Radu V Craiu and Evgeny Levi

Annual Reviews of Statistics and Its Application | 2023 | 10, 379-400

Rich data generating mechanisms are ubiquitous in this age of information and require complex statistical models to draw meaningful inference. While Bayesian analysis has seen enormous development in the last 30 years, benefitting from the impetus given by the successful application of Markov chain Monte Carlo (MCMC) sampling, the combination of big data and complex models conspire to produce significant challenges for the traditional MCMC algorithms. We review modern algorithmic developments addressing the latter and compare their performance using numerical experiments.

Read more

By Lindsay Katz, and Rohan Alexander

Scientific Data | 2023 | 10, 567

Public knowledge of what is said in parliament is a tenet of democracy, and a critical resource for political science research. In Australia, following the British tradition, the written record of what is said in parliament is known as Hansard. While the Australian Hansard has always been publicly available, it has been difficult to use for the purpose of large-scale macro- and micro-level text analysis because it has only been available as PDFs or XMLs. Following the lead of the Linked Parliamentary Data project which achieved this for Canada, we provide a new, comprehensive, high-quality, rectangular database that captures proceedings of the Australian parliamentary debates from 1998 to 2022. The database is publicly available and can be linked to other datasets such as election results. The creation and accessibility of this database enables the exploration of new questions and serves as a valuable resource for both researchers and policymakers.

Read more

By Roberto Casarin, Radu V. Craiu, Lorenzo Frattarolo and Christian Robert

Statistical Science | 2023 | accepted

We identify recurrent ingredients in the antithetic sampling literature leading to a unified sampling framework. We introduce a new class of antithetic schemes that includes the most used antithetic proposals.

Read more

By Annie Collins, and Rohan Alexander

Scientometrics | 2022 | 7/4/2023

To examine the reproducibility of COVID-19 research, we create a dataset of pre-prints posted to arXiv, bioRxiv, and medRxiv between 28 January 2020 and 30 June 2021 that are related to COVID-19. We extract the text from these pre-prints and parse them looking for keyword markers signaling the availability of the data and code underpinning the pre-print. For the pre-prints that are in our sample, we are unable to find markers of either open data or open code for 75% of those on arXiv, 67% of those on bioRxiv, and 79% of those on medRxiv.

Read more


Previous Publications

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

by Will Grathwohl, Milad Hashemi, Kevin Swersky, David Duvenaud, and Chris Maddison

International conference on machine learning | 2021 (accepted)

Short Summary: We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate our improved sampler for training deep energy-based models on high dimensional discrete image data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.

Read more

 


 

Robust Risk-Aware Reinforcement Learning

by Sebastian Jaimungal, Silvana M. Pesenti, Ye Sheng Wang, and Hariom Tatsat

SIAM Journal on Financial Mathematics | 2021 | 13 (1), 213-226

Short Summary: We present a reinforcement learning (RL) approach for robust optimization of risk-aware performance criteria. To allow agents to express a wide variety of risk-reward profiles, we assess the value of a policy using rank dependent expected utility (RDEU). RDEU allows agents to seek gains, while simultaneously protecting themselves against downside risk. To robustify optimal policies against model uncertainty, we assess a policy not by its distribution but rather by the worst possible distribution that lies within a Wasserstein ball around it. Thus, our problem formulation may be viewed as an actor/agent choosing a policy (the outer problem) and the adversary then acting to worsen the performance of that strategy (the inner problem). We develop explicit policy gradient formulae for the inner and outer problems and show their efficacy on three prototypical financial problems: robust portfolio allocation, benchmark optimization, and statistical arbitrage.

Read more

Computational Skills by Stealth in Introductory Data Science Teaching

by Wesley Burr, Fanny Chevalier, Christopher Collins, Alison L Gibbs, Raymond Ng, and Chris Wild

Teaching Statistics | 2021 (accepted) | DOI: 10.1111/test.12277

Short Summary: In 2010 the Nolan and Temple Lang-proposed “integration of computing concepts into statistics curricula at all levels.” The unprecedented growth in data and emphasis on data science has provided an impetus to finally realizing full implementations of this in new statistics and data science programs and courses. We discuss a proposal for the stealth development of computational skills in students’ exposure to introductory data science through careful, scaffolded exposure to computation and its power. Our intent is to support students, regardless of interest and self-efficacy in coding, in becoming data-driven learners, who are capable of asking complex questions about the world around them, and then answering those questions through the use of data- driven inquiry. Reference is made to the computer science and statistics consensus curriculum frameworks the International Data Science in Schools Project (IDSSP) recently published for secondary school data science or introductory tertiary programs, designed to optimize data- science accessibility.

Read more

 


 

Double Happiness: Enhancing the Coupled Gains of L-lag Coupling via Control Variates

by Radu V. Craiu and Xiao-Li Meng

Statistica Sinica | 2021 (Accepted)

Short Summary: The paper adds two innovations to the general construction of unbiased MCMC estimators using L-lag coupling that has been developed, in a series of papers, by Pierre Jacob and his collaborators. One is to consider the use of control variates to increase the efficiency of the estimators. The control variates are easily available since they are provided by the coupling construction itself. An added bonus is that the new estimators leads to tighter bounds of the total variation distance between the chain's distribution after k iterations and its stationary distribution.

Read more

 


 

Dual Space Preconditioning for Gradient Descent

by Chris J. Maddison, Daniel Paulin, Yee Whye Teh, and Arnaud Doucet

SIAM J. Optim. | 2021 | 31 (1), 991-1016

Short Summary: The conditions of relative smoothness and relative strong convexity were recently introduced for the analysis of Bregman gradient methods for convex optimization. We introduce a generalized left-preconditioning method for gradient descent and show that its convergence on an essentially smooth convex objective function can be guaranteed via an application of relative smoothness in the dual space. Our relative smoothness assumption is between the designed preconditioner and the convex conjugate of the objective, and it generalizes the typical Lipschitz gradient assumption. Under dual relative strong convexity, we obtain linear convergence with a generalized condition number that is invariant under horizontal translations, distinguishing it from Bregman gradient methods. Thus, in principle our method is capable of improving the conditioning of gradient descent on problems with a non-Lipschitz gradient or nonstrongly convex structure. We demonstrate our method on p-norm regression and exponential penalty function minimization.

Read more

 


 

Finding Our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods

by Evgeny Levi and Radu V, Craiu

Bayesian Analysis | 2021 (Accepted)

Short Summary: In this paper we design perturbed MCMC samplers that can be used within the Approximate Bayesian Computation (ABC) and Bayesian Synthetic Likelihood (BSL) paradigms to significantly accelerate computation while maintaining control on computational efficiency. The proposed strategy relies on recycling samples from the chain’s past.

Read more

 


 

Gradient Estimation with Stochastic Softmax Tricks

by Max B. Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, and Chris J. Maddison

2020

Short Summary: The Gumbel-Max trick is the basis of many relaxed gradient estimators. These estimators are easy to implement and low variance, but the goal of scaling them comprehensively to large combinatorial distributions is still outstanding. Working within the perturbation model framework, we introduce stochastic softmax tricks, which generalize the Gumbel-Softmax trick to combinatorial spaces. Our framework is a unified perspective on existing relaxed estimators for perturbation models, and it contains many novel relaxations. We design structured relaxations for subset selection, spanning trees, arborescences, and others. When compared to less structured baselines, we find that stochastic softmax tricks can be used to train latent variable models that perform better and discover more latent structure.

Read more

 


 

In praise of small data

by Nancy Reid

Notices American Math Soc | 2021 | Volume: 68, 105-113

Short Summary: The over-promotion of ``Big Data'' has perhaps settled down, but the data are still there, and the rapid development of the new field of data science is a response to this. As more data become available, the questions asked become more complex, and big data can quickly turn into small data. Statistical science has developed an arsenal of methods and models for learning under uncertainty over its 200-year history. Some thoughts on the interplay between statistical and data science, their interactions with science, and the ongoing relevance of statistical theory are presented and illustrated.

Read more

 


 

LocusFocus: Web-based colocalization for the annotation and functional follow-up of GWAS

by Naim Panjwani, Fan Wang, Scott Mastromatteo, Allen Bao, Cheng Wang, Gengming He, Jiafen Gong, Johanna M. Rommens, Lei Sun, and Lisa J. Strug

PLOS Computational Biology | 2020 | 16(10):e1008336

Short Summary: Genome-wide association studies (GWAS) have primarily identified trait-associated loci in the non-coding genome. Colocalization analyses of SNP associations from GWAS with expression quantitative trait loci (eQTL) evidence enable the generation of hypotheses about responsible mechanism, genes and tissues of origin to guide functional characterization. Here, we present a web-based colocalization browsing and testing tool named LocusFocus. LocusFocus formally tests colocalization using our established Simple Sum method to identify the most relevant genes and tissues for a particular GWAS locus in the presence of high linkage disequilibrium and/or allelic heterogeneity. We demonstrate the utility of LocusFocus, following up on a genome-wide significant locus from a GWAS of meconium ileus (an intestinal obstruction in cystic fibrosis). Using LocusFocus for colocalization analysis with eQTL data suggests variation in ATP12A gene expression in the pancreas rather than intestine is responsible for the GWAS locus. LocusFocus has no operating system dependencies and may be installed in a local web server. LocusFocus is available under the MIT license, with full documentation and source code accessible on GitHub.

Read more

 


 

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

by Will Grathwohl, Milad Hashemi, Kevin Swersky, David Duvenaud, and Chris Maddison

International Conference on Machine Learning | 2021 (accepted)

Short Summary: We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.

Read more

 


 

Scalable Gradients for Stochastic Differential Equations

by Xuechen Li, Ting-Kam Leonard Wong, Ricky Tian Qi Chen, and David Duvenaud

Conference on AI and Statistics | 2020

Short Summary: We generalize the adjoint sensitivity method to stochastic differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memory-efficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50-dimensional motion capture dataset.

Read more

 


 

Statistical power in COVID-19 case-control host genomic study design

by Yu-Chung Lin, Jennifer D. Brooks, Shelley B. Bull, France Gagnon, Celia M. T. Greenwood, Rayjean J. Hung, Jerald Lawless, Andrew D. Paterson, Lei Sun, and Lisa J. Strug

Genome Medicine | 2020 | Volume 12, Article 115

Short Summary: The identification of genetic variation that directly impacts infection susceptibility to SARS-CoV-2 and disease severity of COVID-19 is an important step towards risk stratification, personalized treatment plans, therapeutic, and vaccine development and deployment. Given the importance of study design in infectious disease genetic epidemiology, we use simulation and draw on current estimates of exposure, infectivity, and test accuracy of COVID-19 to demonstrate the feasibility of detecting host genetic factors associated with susceptibility and severity in published COVID-19 study designs. We demonstrate that limited phenotypic data and exposure/infection information in the early stages of the pandemic significantly impact the ability to detect most genetic variants with moderate effect sizes, especially when studying susceptibility to SARS-CoV-2 infection. Our insights can aid in the interpretation of genetic findings emerging in the literature and guide the design of future host genetic studies.

Read more

 


 

The Building Blocks of Statistical Education in the Data Science Ecosystem

by Alison L. Gibbs, and Nathan Taback

Harvard Data Science Review | 2021 (accepted)

Read more

Adaptive Component-wise Multiple-Try Metropolis

by JinyoungYang, Evgeny Levi, R.V. Craiu and J.S. Rosenthal

Journal of Computational and Graphical Statistics | 2018

Short Summary: Adaptive MCMC for targets with irregular characteristics

Read more

 


 

Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

by Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud

International Conference on Learning Representations | 2018

Short Summary: We learn low-variance, unbiased gradient estimators for any function of random variables. We backprop through a neural net surrogate of the original function, which is optimized to minimize gradient variance during the optimization of the original objective. We train discrete latent-variable models, and do continuous and discrete reinforcement learning with an adaptive, action-conditional baseline.

Read more

 


 

Global Non-convex Optimization with Discretized Diffusions

by Murat A. Erdogdu, Lester Mackey, and Ohad Shamir

Advances in Neural Information Processing Systems | 2018 (to appear)

Short summary: An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure.

Read more

 


 

Neural Ordinary Differential Equations

by Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud

Advances in Neural Information Processing Systems | 2018

Short Summary: We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.

Read more

 


 

Scalable Approximations for Generalized Linear Problems

by Murat A. Erdogdu, Mohsen Bayati, and Lee H. Dicker

Journal of Machine Learning Research | 2018 (to appear)

Short Summary: In stochastic optimization, the population risk is generally approximated by the empirical risk. However, in the large-scale setting, minimization of the empirical risk may be computationally restrictive. In this paper, we design an efficient algorithm to approximate the population risk minimizer in generalized linear problems such as binary classification with surrogate losses and generalized linear regression models. We focus on large-scale problems, where the iterative minimization of the empirical risk is computationally intractable, i.e., the number of observations $n$ is much larger than the dimension of the parameter $p$, i.e. $n \gg p \gg 1$. We show that under random sub-Gaussian design, the true minimizer of the population risk is approximately proportional to the corresponding ordinary least squares (OLS) estimator. Using this relation, we design an algorithm that achieves the same accuracy as the empirical risk minimizer through iterations that attain up to a quadratic convergence rate, and that are computationally cheaper than any batch optimization algorithm by at least a factor of $\mathcal{O}(p)$. We provide theoretical guarantees for our algorithm, and analyze the convergence behavior in terms of data dimensions.

Read more

 


 

Stability of Adversarial Markov Chains, with an Application to Adaptive MCMC Algorithms

by R.V. Craiu, L. Gray, K. Latuszynski, N. Madras, G.O. Roberts, and J.S. Rosenthal

Annals of Applied Probability | 2015 | Vol. 25(6), pp. 3592-3623

Short Summary: Provides a simple way to verify the correct convergence of adaptive MCMC algorithms, thus opening up new avenues for computational progress and accurate estimation.

Read more