# 2022-23

By *Zhong Wang, Andrew Paterson, and Lei Sun*

Annals of Applied Statistics | 2023 | Accepted

Sex difference in allele frequency is an emerging topic that is crucial to our understanding of data quality and features, particularly when it comes to the largely overlooked X chromosome. To detect sex differences in allele fre- quency for both X chromosomal and autosomal variants, the existing method is conservative when applied to samples from multiple ancestral populations. Additionally, it remains unexplored whether the sex difference in allele fre- quency varies between populations, which is important for trans-ancestral genetic studies. To answer these questions, we thus developed a novel, ret- rospective regression-based testing framework that led to interpretable and easy-to-implement solutions. We then applied the proposed methods to the high-coverage whole genome sequence data of the 1000 Genomes Project, robustly analyzing all samples available from the five super-populations. We had 97 novel findings by recognizing and modelling ancestral differences. Finally, we replicated the specific findings and overall conclusion using the gnomAD v3.1.2 data.

By *Zhenhua Lin, Dehan Kong, and Linbo Wang*

Journal of the Royal Statistical Society Series B: Statistical Methodology | 2023 | 85(2), 378-398

Understanding causal relationships is one of the most important goals of modern science. So far, the causal inference literature has focused almost exclusively on outcomes coming from the Euclidean space Rp. However, it is increasingly common that complex datasets are best summarized as data points in non-linear spaces. In this paper, we present a novel framework of causal effects for outcomes from the Wasserstein space of cumulative distribution functions, which in contrast to the Euclidean space, is non-linear. We develop doubly robust estimators and associated asymptotic theory for these causal effects. As an illustration, we use our framework to quantify the causal effect of marriage on physical activity patterns using wearable device data collected through the National Health and Nutrition Examination Survey.

By *Robert Zimmerman, Radu V. Craiu, and Vianey Leos-Barajas*

Journal of the American Statistical Association | 2023 | accepted

We propose a copula-based extension of the hidden Markov model (HMM) which applies when the observations recorded at each time in the sample are multivariate. The joint model produced by the copula extension allows decoding of the hidden states based on information from multiple observations. However, unlike the case of independent marginals, the copula dependence structure embedded into the likelihood poses additional computational challenges. We tackle the latter using a theoretically-justified variation of the EM algorithm developed within the framework of inference functions for margins. We illustrate the method using numerical experiments and an analysis of house occupancy.

By *Yichi Zhang, Weining Shen, and Dehan Kong*

Journal of the American Statistical Association | 2023 | accepted

Covariance estimation for matrix-valued data has received an increasing interest in applications. Unlike previous works that rely heavily on matrix normal distribution assumption and the requirement of fixed matrix size, we propose a class of distribution-free regularized covariance estimation methods for high-dimensional matrix data under a separability condition and a bandable covariance structure. Under these conditions, the original covariance matrix is decomposed into a Kronecker product of two bandable small covariance matrices representing the variability over row and column directions. We formulate a unified framework for estimating bandable covariance, and introduce an efficient algorithm based on rank one unconstrained Kronecker product approximation. The convergence rates of the proposed estimators are established, and the derived minimax lower bound shows our proposed estimator is rate-optimal under certain divergence regimes of matrix size. We further introduce a class of robust covariance estimators and provide theoretical guarantees to deal with heavy-tailed data. We demonstrate the superior finite-sample performance of our methods using simulations and real applications from a gridded temperature anomalies dataset and an S&P 500 stock data analysis.

By *Dehan Kong, Shu Yang, and Linbo Wang*

Biometrika | 2022 | 109(1), 265-272

Unobserved confounding presents a major threat to causal inference in observational studies. Recently, several authors have suggested that this problem could be overcome in a shared confounding setting where multiple treatments are independent given a common latent confounder. It has been shown that under a linear Gaussian model for the treatments, the causal effect is not identifiable without parametric assumptions on the outcome model. In this note, we show that the causal effect is indeed identifiable if we assume a general binary choice model for the outcome with a non-probit link. Our identification approach is based on the incongruence between Gaussianity of the treatments and latent confounder and non-Gaussianity of a latent outcome variable. We further develop a two-step likelihood-based estimation procedure.

By *Lin Zhang, Lisa Strug, and Lei Sun*

Annals of Applied Statistics | 2023 | 17(2):1764-1781

Modern genome-wide association studies (GWAS) remove single nucleotide polymorphisms (SNPs) that are in Hardy–Weinberg disequilibrium (HWD), despite limited rigor for this practice. In a case-control GWAS, although HWD in the control sample is an evidence for genotyping error, a truly associated SNP may be in HWD in the case and/or control populations. We, therefore, develop a new case-control association test that: (i) leverages HWD attributed to true association to increase power, (ii) is robust to HWD caused by genotyping error, and (iii) is easy-to-implement at the genome-wide level. The proposed robust allele-based joint test incorporates the difference in HWD between the case and control samples into the traditional association measure to gain power. We provide the asymptotic distribution of the proposed test statistic under the null hypothesis. We evaluate its type 1 error control at the genome-wide significance level of 5×10−8 in the presence of HWD attributed to factors unrelated to phenotype-genotype association, such as genotyping error. Finally, we demonstrate that the power of the proposed allele-based joint test is higher than the standard association test for a variety of genetic models, through derivations of the noncentrality parameters of the tests, as well as simulation and application studies.

By *Steffen Lauritzen, and Piotr Zwiernik*

Annals of Statistics | 2023 | Vol. 50, No. 5, 3009-3038.

The notion of multivariate total positivity has proved to be useful in finance and psychology but may be too restrictive in other applications. In this paper we propose a concept of local association, where highly connected components in a graphical model are positively associated and study its properties. Our main motivation comes from gene expression data, where graphical models have become a popular exploratory tool. The models are instances of what we term mixed convex exponential families and we show that a mixed dual likelihood estimator has simple exact properties for such families as well as asymptotic properties similar to the maximum likelihood estimator. We further relax the positivity assumption by penalizing negative partial correlations in what we term the positive graphical lasso. Finally, we develop a GOLAZO algorithm based on block-coordinate descent that applies to a number of optimization procedures that arise in the context of graphical models, including the estimation problems described above. We derive results on existence of the optimum for such problems.

By *Dengdeng Yu, Linbo Wang, Dehan Kong, and Hongtu Zhu*

Journal of the American Statistical Association | 2022 | 117(540), 1656-1668

Alzheimer's disease is a progressive form of dementia that results in problems with memory, thinking, and behavior. It often starts with abnormal aggregation and deposition of beta amyloid and tau, followed by neuronal damage such as atrophy of the hippocampi, leading to Alzheimer's Disease (AD). The aim of this paper is to map the genetic-imaging-clinical pathway for AD in order to delineate the genetically regulated brain changes that drive disease progression based on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. We develop a novel two-step approach to delineate the association between high-dimensional 2D hippocampal surface exposures and the Alzheimer's Disease Assessment Scale (ADAS) cognitive score, while taking into account the ultra-high dimensional clinical and genetic covariates at baseline. Analysis results suggest that the radial distance of each pixel of both hippocampi is negatively associated with the severity of behavioral deficits conditional on observed clinical and genetic covariates. These associations are stronger in Cornu Ammonis region 1 (CA1) and subiculum subregions compared to Cornu Ammonis region 2 (CA2) and Cornu Ammonis region 3 (CA3) subregions.

By *Wenlong Mou, Ashwin Pananjady, and Martin Wainwright*

Mathematics of Operations Research | 2023 | Article in Advance

Linear fixed-point equations in Hilbert spaces arise in a variety of settings, including reinforcement learning, and computational methods for solving differential and integral equations. We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space. First, we prove an instance-dependent upper bound on the mean-squared error for a linear stochastic approximation scheme that exploits Polyak–Ruppert averaging. This bound consists of two terms: an approximation error term with an instance-dependent approximation factor and a statistical error term that captures the instance-specific complexity of the noise when projected onto the low-dimensional subspace. Using information-theoretic methods, we also establish lower bounds showing that both of these terms cannot be improved, again in an instance-dependent sense. A concrete consequence of our characterization is that the optimal approximation factor in this problem can be much larger than a universal constant. We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation, establishing their optimality.

By *Ziang Zhang, and Lei Sun*

Bioinformatics | 2023 | 39(4):btad139

Accurate power and sample size estimation is crucial to the design and analysis of genetic association studies. When analyzing a binary trait via logistic regression, important covariates such as age and sex are typically included in the model. However, their effects are rarely properly considered in power or sample size computation during study planning. Unlike when analyzing a continuous trait, the power of association testing between a binary trait and a genetic variant depends, explicitly, on covariate effects, even under the assumption of gene-environment independence. Earlier work recognizes this hidden factor but the implemented methods are not flexible. We thus propose and implement a generalized method for estimating power and sample size for (discovery or replication) association studies of binary traits that (i) accommodates different types of nongenetic covariates E, (ii) deals with different types of G-E relationships, and (iii) is computationally efficient. Extensive simulation studies show that the proposed method is accurate and computationally efficient for both prospective and retrospective sampling designs with various covariate structures. A proof-of-principle application focused on the understudied African sample in the UK Biobank data. Results show that, in contrast to studying the continuous blood pressure trait, when analyzing the binary hypertension trait ignoring covariate effects of age and sex leads to overestimated power and underestimated replication sample size.

By *Ying Zhou, Dehan Kong, and Linbo Wang*

Biometrika | 2023 | accepted

A key challenge in causal inference from observational studies is the identification and estimation of causal effects in the presence of unmeasured confounding. In this paper, we introduce a novel approach for causal inference that leverages information in multiple outcomes to deal with unmeasured confounding. The key assumption in our approach is conditional independence among multiple outcomes. In contrast to existing proposals in the literature, the roles of multiple outcomes in our key identification assumption are symmetric, hence the name parallel outcomes. We show nonparametric identifiability with at least three parallel outcomes and provide parametric estimation tools under a set of linear structural equation models. Our proposal is evaluated through a set of synthetic and real data analyses.

By *Frank Röttger, Sebastian Engelke, and Piotr Zwiernik*

Annals of Statistics | 2023 | Vol. 51, No. 3, 962-1004.

Positive dependence is present in many real world data sets and has appealing stochastic properties that can be exploited in statistical modeling and in estimation. In particular, the notion of multivariate total positivity of order 2 (MTP2) is a convex constraint and acts as an implicit regularizer in the Gaussian case. We study positive dependence in multivariate extremes and introduce EMTP2, an extremal version of MTP2. This notion turns out to appear prominently in extremes, and in fact, it is satisfied by many classical models. For a Hüsler--Reiss distribution, the analogue of a Gaussian distribution in extremes, we show that it is EMTP2 if and only if its precision matrix is a Laplacian of a connected graph. We propose an estimator for the parameters of the Hüsler--Reiss distribution under EMTP2 as the solution of a convex optimization problem with Laplacian constraint. We prove that this estimator is consistent and typically yields a sparse model with possibly nondecomposable extremal graphical structure. Applying our methods to a data set of Danube River flows, we illustrate this regularization and the superior performance compared to existing methods.

By *T.-K. L. Wong, and J. Zhang*

IEEE Transactions on Information Theory | 2022 | 68 (8), 5353 - 5373

Tsallis and Rényi entropies, which are monotone transformations of each other, are deformations of the celebrated Shannon entropy. Maximization of these deformed entropies, under suitable constraints, leads to the q-exponential family which has applications in non-extensive statistical physics, information theory and statistics. We show that a generalized λ -duality, where λ=1−q is to be interpreted as the constant information-geometric curvature, leads to a generalized exponential family which is essentially equivalent to the q-exponential family and has deep connections with Rényi entropy and optimal transport. Using this generalized convex duality and its associated logarithmic divergence, we show that our λ-exponential family satisfies properties that parallel and generalize those of the exponential family. Under our framework, the Rényi entropy and divergence arise naturally, and we give a new proof of the Tsallis/Rényi entropy maximizing property of the q-exponential family. We also introduce a λ-mixture family which may be regarded as the dual of the λ-exponential family, and connect it with other mixture-type families.

# Previous Publications

**Bidimensional Linked Matrix Factorization for Pan-omics Pan-cancer Analysis**

by *Eric F. Lock, Jun Young Park, Katherine A. Hoadley*

Annals of Applied Statistics | 2022 (published) | 16,1, 193-215

*Short Summary: *Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogeneity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+. BIDIFAC+ decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives a unique decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across four different omics platforms and 29 different cancer types.

**D.A.S. Fraser: From Structural Inference to Asymptotics**

by *Nancy Reid*

Canadian Journal of Statistics | to appear in 2022 | Volume 50

*Short Summary: *Don Fraser was my collaborator and life partner, so I had a uniquely close view of his life in research. This note describes how his early work in the structure of models informed our work in asymptotic theory.

**Instrumental Variable Estimation of the Causal hazard Ratio**

by *Linbo Wang, Eric Tchetgen Tchetgen, Torben Martinussen, and Stijn Vansteelandt*

Biometrics | 2022 (accepted)

*Short Summary: *Cox's proportional hazards model is one of the most popular statistical models to evaluate associations of exposure with a censored failure time outcome. When confounding factors are not fully observed, the exposure hazard ratio estimated using a Cox model is subject to unmeasured confounding bias. To address this, we propose a novel approach for the identification and estimation of the causal hazard ratio in the presence of unmeasured confounding factors. Our approach is based on a binary instrumental variable, and an additional no-interaction assumption in a first stage regression of the treatment on the IV and unmeasured confounders. We propose, to the best of our knowledge, the first consistent estimator of the (population) causal hazard ratio within an instrumental variable framework. A version of our estimator admits a closed-form representation. We derive the asymptotic distribution of our estimator, and provide a consistent estimator for its asymptotic variance. Our approach is illustrated via simulation studies and a data application.

**Learning Partial Correlation Graphs and Graphical Models by Covariance Queries**

by *Gábor Lugosi, Jakub Truszkowski, Vasiliki Velona, Piotr Zwiernik*

Journal of Machine Learning Research | 2021 | 22 (2021) 1-41

*Short Summary: *We study the problem of recovering the structure underlying large Gaussian graphical models or, more generally, partial correlation graphs. In high-dimensional problems it is often too costly to store the entire sample covariance matrix. We propose a new input model in which one can query single entries of the covariance matrix. We prove that it is possible to recover the support of the inverse covariance matrix with low query and computational complexity. Our algorithms work in a regime when this support is represented by tree-like graphs and, more generally, for graphs of small treewidth. Our results demonstrate that for large classes of graphs, the structure of the corresponding partial correlation graphs can be determined much faster than even computing the empirical covariance matrix.

**Model-robust Designs for Nonlinear Quantile Regression**

by *Selvakkadunko Selvaratnam, Linglong Kong, and Douglas Wiens*

Statistical Methods in Medical Research | 2021 | 30(1): 221 – 232

*Short Summary: *We construct robust designs for nonlinear quantile regression, in the presence of both a possibly misspecified nonlinear quantile function and heteroscedasticity of an unknown form. The asymptotic mean-squared error of the quantile estimate is evaluated and maximized over a neighbourhood of the fitted quantile regression model. This maximum depends on the scale function and on the design. We entertain two methods to find designs that minimize the maximum loss. The first is local – we minimize for given values of the parameters and the scale function, using a sequential approach, whereby each new design point minimizes the subsequent loss, given the current design. The second is adaptive – at each stage, the maximized loss is evaluated at quantile estimates of the parameters, and a kernel estimate of scale, and then the next design point is obtained as in the sequential method. In the context of a Michaelis–Menten response model for an estrogen/hormone study, and a variety of scale functions, we demonstrate that the adaptive approach performs as well, in large study sizes, as if the parameter values and scale function were known beforehand and the sequential method applied. When the sequential method uses an incorrectly specified scale function, the adaptive method yields an, often substantial, improvement. The performance of the adaptive designs for smaller study sizes is assessed and seen to still be very favourable, especially so since the prior information required to design sequentially is rarely available.

**Multiplicative Effect Modeling: The General Case**

by *Yin J., Markes S., Richardson T.S., and Wang L.*

Biometrika | 2022 (accepted)

*Short Summary: *Generalized linear models, such as logistic regression, are widely used to model the association between a treatment and a binary outcome as a function of baseline covariates. However, the coefficients of a logistic regression model correspond to log odds ratios, while subject-matter scientists are often interested in relative risks. Although odds ratios are sometimes used to approximate relative risks, this approximation is appropriate only when the outcome of interest is rare for all levels of the covariates. Poisson regressions do measure multiplicative treatment effects including relative risks, but with a binary outcome not all combinations of parameters lead to fitted means that are between zero and one. Enforcing this constraint makes the parameters variation dependent, which is undesirable for modeling, estimation and computation. Focusing on the special case where the treatment is also binary, Richardson2017 propose a novel binomial regression model, that allows direct modeling of the relative risk. The model uses a log odds-product nuisance model leading to variation independent parameter spaces. Building on this we present general approaches to modeling the multiplicative effect of a continuous or categorical treatment on a binary outcome. Monte Carlo simulations demonstrate the desirable performance of our proposed methods. A data analysis further exemplifies our methods.

**Outcome Model Free Causal Inference With Ultra-High Dimensional Covariates**

by *Dingke Tang, Dehan Kong, Wenliang Pan, Linbo Wang*

Biometrics | 2022 (accepted)

*Short Summary: *Causal inference has been increasingly reliant on observational studies with rich covariate information. To build tractable causal procedures, such as the doubly robust estimators, it is imperative to first extract important features from high or even ultra-high dimensional data. In this paper, we propose causal ball screening for confounder selection from modern ultra-high dimensional data sets. Unlike the familiar task of variable selection for prediction modeling, our confounder selection procedure aims to control for confounding while improving efficiency in the resulting causal effect estimate. Previous empirical and theoretical studies suggest excluding causes of the treatment that are not confounders. Motivated by these results, our goal is to keep all the predictors of the outcome in both the propensity score and outcome regression models. A distinctive feature of our proposal is that we use an outcome model-free procedure for propensity score model selection, thereby maintaining double robustness in the resulting causal effect estimator. Our theoretical analyses show that the proposed procedure enjoys a number of properties, including model selection consistency and pointwise normality. Synthetic and real data analysis show that our proposal performs favorably with existing methods in a range of realistic settings. Data used in preparation of this paper were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database.

**Three Influential Design Quantities on the Power of Wald-Type Tests for Treatment Comparisons in Clinical Trials**

by *Selvakkadunko Selvaratnam, Alwell Oyet, and Yanqing Yi*

Journal of Statistical Research | 2022 (accepted)

*Short Summary: *In clinical trials, efficient statistical inference is critical to the well-being of future patients. We therefore construct Wald-type tests for the hypothesis of treatment by-covariate interaction when treatments are assigned to patients by an adaptive design and the true model is a generalized linear model. Our measure of efficiency is the power of the test while ethics of a trial or well-being of participating patients is measured by the success rate of treatments. We demonstrate that the power of the test depends on the target allocation proportion, the bias of the randomization procedure from the target, and the variability induced by the randomization process (design variability) for adaptive designs. We prove that these quantities influence the power when the trial involves two treatments and a single covariate. We also show that, in this case, as design variability decreases the power increases. Due to the complexity of the problem, we demonstrate by simulation that this result still holds when more than one covariate is present in the model. In simulation studies, we compare the measures of efficiency and ethics under response-adaptive (RA), covariate-adjusted response-adaptive (CARA), and completely randomized (CR) designs. The methods are applied to data from a clinical trial on stroke prevention in atrial fibrillation (SPAF).

**Total Positivity in Exponential Families With Application to Binary Variables**

by *Steffen Lauritzen, Caroline Uhler, Piotr Zwiernik*

Annals of Statistics | 2021 | 49(3): 1436-1459 (June 2021)

*Short Summary: *We study exponential families of distributions that are multivariate totally positive of order 2 (MTP2), show that these are convex exponential families and derive conditions for existence of the MLE. Quadratic exponential familes of MTP2 distributions contain attractive Gaussian graphical models and ferromagnetic Ising models as special examples. We show that these are defined by intersecting the space of canonical parameters with a polyhedral cone whose faces correspond to conditional independence relations. Hence MTP2 serves as an implicit regularizer for quadratic exponential families and leads to sparsity in the estimated graphical model. We prove that the maximum likelihood estimator (MLE) in an MTP2 binary exponential family exists if and only if both of the sign patterns (1,−1) and (−1,1) are represented in the sample for every pair of variables; in particular, this implies that the MLE may exist with n=d observations, in stark contrast to unrestricted binary exponential families where 2d observations are required. Finally, we provide a novel and globally convergent algorithm for computing the MLE for MTP2 Ising models similar to iterative proportional scaling and apply it to the analysis of data from two psychological disorders.

**Unifying Genetic Association Tests via Regression: Prospective and Retrospective, Parametric and Non-Parametric, and Genotype- and Allele-Based Tests**

by *Lin Zhang, and Lei Sun*

Canadian Journal of Statistics | Accepted

*Short Summary: *Genetic association analysis, evaluating the relationship between genetic markers and complex and heritable traits, is the basis of genome-wide association studies. In response, many association tests have been developed, and they are generally classified as prospective vs. retrospective, parametric vs. non-parametric, and genotype- vs. allele-based association tests. While method classification is useful, it can be confusing and challenging for practitioners to decide on the ‘optimal’ test to use for their data. Although there are known differences between some of the popular association tests, we provide new results that show the analytical connections between the different tests for both population- and family-based study designs.

**A generalized robust allele-based genetic association test**

by *Lin Zhang, and Lei Sun*

Biometrics | 2021 (Online)

*Short Summary:* The allele-based association test, comparing allele frequency difference between case and control groups, is locally most powerful. However, application of the classical allelic test is limited in practice, because the method is sensitive to the Hardy–Weinberg equilibrium (HWE) assumption, not applicable to continuous traits, and not easy to account for covariate effect or sample correlation. To develop a generalized robust allelic test, we propose a new allele-based regression model with individual allele as the response variable. We show that the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from HWE and encompasses the classical allelic test as a special case. When the trait of interest is continuous, the corresponding allelic test evaluates a weighted difference between individual-level allele frequency estimate and sample estimate where the weight is proportional to an individual's trait value, and the test remains valid under Y-dependent sampling. Finally, the proposed allele-based method can analyze multiple (continuous or binary) phenotypes simultaneously and multiallelic genetic markers, while accounting for covariate effect, sample correlation, and population heterogeneity. To support our analytical findings, we provide empirical evidence from both simulation and application studies.

**Bayesian model averaging for the X-chromosome inactivation dilemma in genetic association study**

by *Bo Chen, Radu V. Craiu, and Lei Sun*

Biostatistics | 2020 | 21(2) 319-335

*Short Summary: *X-chromosome is often excluded from the so called “whole-genome” association studies due to the differences it exhibits between males and females. One particular analytical challenge is the unknown status of X-inactivation, where one of the two X-chromosome variants in females may be randomly selected to be silenced. In the absence of biological evidence in favor of one specific model, we consider a Bayesian model averaging framework that offers a principled way to account for the inherent model uncertainty, providing model averaging-based posterior density intervals and Bayes factors. We examine the inferential properties of the proposed methods via extensive simulation studies, and we apply the methods to a genetic association study of an intestinal disease occurring in about 20% of cystic fibrosis patients. Compared with the results previously reported assuming the presence of inactivation, we show that the proposed Bayesian methods provide more feature-rich quantities that are useful in practice.

**Checking the model and the prior for the constrained multinomial**

by *Berge Englert, Michael Evans, Gun Ho Jang, Hui Khoon Ng, David Nott, D. and Max Seah*

Metrika | 2021 | DOI: 10.1007/s00184-021-00811-8

*Short Summary: *Multinomial models can be difficult to use when constraints are placed on the probabilities. An exact model checking procedure for such models is developed based on a uniform prior on the full multinomial model. For inference, a nonuniform prior can be used and a consistency theorem is proved concerning a check for prior-data conflict with the chosen prior. Applications are presented and a new elicitation methodology is developed for multinomial models with ordered probabilities.

**Measuring and controlling bias for some Bayesian ****inferences and the relation to frequentist criteria**

by *Michael Evans, and Yang Guo*

Entropy | 2021 | 23(2), 190 DOI:10.3390/e23020190

*Short Summary: *A common concern with Bayesian methodology in scientific contexts is that inferences can be heavily influenced by subjective biases. As presented here, there are two types of bias for some quantity of interest: bias against and bias in favor. Based upon the principle of evidence, it is shown how to measure and control these biases for both hypothesis assessment and estimation problems. Optimality results are established for the principle of evidence as the basis of the approach to these problems. A close relationship is established between measuring bias in Bayesian inferences and frequentist properties that hold for any proper prior. This leads to a possible resolution to an apparent conflict between these approaches to statistical reasoning. Frequentism is seen as establishing figures of merit for a statistical study, while Bayes determines the inferences based upon statistical evidence.

**Modified likelihood root in high dimensions**

by *Yanbo Tang, and Nancy Reid*

J Royal Statist Soc Series B | 2020 | Volume: 62, 1349-1369

*Short Summary: *We examine a higher-order approximation to the significance function with increasing numbers of nuisance parameters, based on the normal approximation to an adjusted log-likelihood root. We show that the rate of the correction for nuisance parameters is larger than the correction for non-normality, when the parameter dimension $p$ is $O(n^{\alpha})$ for $\alpha < 1/2$. We specialize the results to linear exponential families and location-scale families and illustrate these with simulations.

**Multiple block sizes and overlapping blocks for multivariate time series extremes**

by *Nan Zou, Stanislav Volgushev, and Axel Bücher*

Annals of Statistics | 2021 (published) | 49(1): 295-320 (February 2021). DOI: 10.1214/20-AOS1957

*Short Summary: *Block maxima methods constitute a fundamental part of the statistical toolbox in extreme value analysis. However, most of the corresponding theory is derived under the simplifying assumption that block maxima are independent observations from a genuine extreme value distribution. In practice however, block sizes are finite and observations from different blocks are dependent. Theory respecting the latter complications is not well developed, and, in the multivariate case, has only recently been established for disjoint blocks of a single block size. We show that using overlapping blocks instead of disjoint blocks leads to a uniform improvement in the asymptotic variance of the multivariate empirical distribution function of rescaled block maxima and any smooth functionals thereof (such as the empirical copula), without any sacrifice in the asymptotic bias. We further derive functional central limit theorems for multivariate empirical distribution functions and empirical copulas that are uniform in the block size parameter, which seems to be the first result of this kind for estimators based on block maxima in general. The theory allows for various aggregation schemes over multiple block sizes, leading to substantial improvements over the single block length case and opens the door to further methodology developments. In particular, we consider bias correction procedures that can improve the convergence rates of extreme-value estimators and shed some new light on estimation of the second-order parameter when the main purpose is bias correction.

**On set-based association tests: Insights from a regression using summary statistics**

by *Yanyan Zhao, and Lei Sun*

Canadian Journal of Statistics | 2021 (Online)

*Short Summary: *Motivated by, but not limited to, association analyses of multiple genetic variants, we propose here a summary statistics-based regression framework. The proposed method requires only variant-specific summary statistics, and it unifies earlier methods based on individual-level data as special cases. The resulting score test statistic, derived from a linear mixed-effect regression model, inherently transforms the variant-specific statistics using the precision matrix to improve power for detecting sparse alternatives. Furthermore, the proposed method can incorporate additional variant-specific information with ease, facilitating omic-data integration. We study the asymptotic properties of the proposed tests under the null and alternatives, and we investigate efficient P-value calculation in finite samples. Finally, we provide supporting empirical evidence from extensive simulation studies and two applications.

**On specification tests for composite likelihood inference**

by *Jing Huang, Yang Ning, Nancy Reid, and Yong Chen*

Biometrika | 2020 | Volume: 107, 907-917

*Short Summary: *Composite likelihood functions are often used for inference in applications where the data have a complex structure. While inference based on composite likelihood can be more robust than inference based on the full likelihood, the inference is not valid if the associated conditional or marginal models are misspecified. In this paper, we propose a general class of specification tests for composite likelihood inference. The test statistics are motivated by the fact that the second Bartlett identity holds for each component of the composite likelihood function when these components are correctly specified. We construct the test statistics based on the discrepancy between the so-called composite information matrix and the sensitivity matrix. As an illustration, we study three important cases of the proposed tests and establish their limiting distributions under both null and local alternative hypotheses. Finally, we evaluate the finite-sample performance of the proposed tests in several examples.

**Rank-based Estimation under Asymptotic Dependence and Independence, with Applications to Spatial Extremes**

by* Michaël Lalancette, Sebastian Engelke, and Stanislav Volgushev*

Annals of Statistics | 2021 (accepted)

*Short Summary: *Multivariate extreme value theory is concerned with modeling the joint tail behavior of several random variables. Existing work mostly focuses on asymptotic dependence, where the probability of observing a large value in one of the variables is of the same order as observing a large value in all variables simultaneously. However, there is growing evidence that asymptotic independence is equally important in real world applications. Available statistical methodology in the latter setting is scarce and not well understood theoretically. We revisit non-parametric estimation and introduce rank-based M-estimators for parametric models that simultaneously work under asymptotic dependence and asymptotic independence, without requiring prior knowledge on which of the two regimes applies. Asymptotic normality of the proposed estimators is established under weak regularity conditions. We further show how bivariate estimators can be leveraged to obtain parametric estimators in spatial tail models, and again provide a thorough theoretical justification for our approach.

**Testing relevant hypotheses in functional time series via self-normalization**

by *Holger Dette, Kevin Kokot, and Stanislav Volgushev*

Journal of the Royal Statistical Society: Series B | 2020 (published) | 82 (3) 629-660

*Short Summary: *We develop methodology for testing relevant hypotheses about functional time series in a tuning-free way. Instead of testing for exact equality, e.g. for the equality of two mean functions from two independent time series, we propose to test the null hypothesis of no relevant deviation. In the two-sample problem this means that an L2-distance between the two mean functions is smaller than a prespecified threshold.For such hypotheses self-normalization, which was introduced in 2010 by Shao, and Shao and Zhang and is commonly used to avoid the estimation of nuisance parameters, is not directly applicable. We develop new self-normalized procedures for testing relevant hypotheses in the one-sample, two-sample and change point problem and investigate their asymptotic properties. Finite sample properties of the tests proposed are illustrated by means of a simulation study and data examples. Our main focus is on functional time series, but extensions to other settings are also briefly discussed.

**The measurement of statistical evidence as the basis for statistical reasoning**

Proceedings of the 5th International Electronic | 2020 | 46(1), 7; DOI:10.3390/ecea-5-06682

*Short Summary: *There are various approaches to the problem of how one is supposed to conduct a statistical analysis. Different analyses can lead to contradictory conclusions in some problems so this is not a satisfactory state of affairs. It seems that all approaches make reference to the evidence in the data concerning questions of interest as a justification for the methodology employed. It is fair to say, however, that none of the most commonly used methodologies is absolutely explicit about how statistical evidence is to be characterized and measured. We will discuss the general problem of statistical reasoning and the development of a theory for this that is based on being precise about statistical evidence. This will be shown to lead to the resolution of a number of problems.

**Using prior expansions for prior-data conflict checking**

by* David Nott, Max Seah, Luai Al-Labadi, Michael Evans, Hui Khoon Ng, and Berge Englert*

Bayesian Analysis | 2021 | 16, Number 1, 203-231

*Short Summary: *Any Bayesian analysis involves combining information represented through different model components, and when different sources of information are in conflict it is important to detect this. Here we consider checking for priordata conflict in Bayesian models by expanding the prior used for the analysis into a larger family of priors, and considering a marginal likelihood score statistic for the expansion parameter. Consideration of different expansions can be informative about the nature of any conflict, and an appropriate choice of expansion can provide more sensitive checks for conflicts of certain types. Extensions to hierarchically specified priors and connections with other approaches to prior-data conflict checking are considered, and implementation in complex situations is illustrated with two applications. The first concerns testing for the appropriateness of a LASSO penalty in shrinkage estimation of coefficients in linear regression. Our method is compared with a recent suggestion in the literature designed to be powerful against alternatives in the exponential power family, and we use this family as the prior expansion for constructing our check. A second application concerns a problem in quantum state estimation, where a multinomial model is considered with physical constraints on the model parameters. In this example, the usefulness of different prior expansions is demonstrated for obtaining checks which are sensitive to different aspects of the prior.

**A generalized Levene's scale test for variance heterogeneity in the presence of sample correlation and group uncertainty**

by *David Soave *and *Lei Sun*

Biometrics | 2017 | 73(3):960-971

*Short Summary:* Why were we interested in generalizing Levene's test? It can be used to indirectly detect Gene-Environment interactions when E is missing!

**A new look at F-tests**

by *McCormack, A., Reid, N., Sartori, N. *and *Theivendran, S.-A.*

*Short Summary: *We show that the directional tests recently developed by Fraser, Reid, Sartori and Davison can be explicitly computed in a number of classical models, including normal theory linear regression, where the test reduces to the usual F-test.

**Adaptive Huber regression**

by *Qiang Sun*, *Wen-Xin **Zhou*, and *Jianqing Fan*

Journal of the American Statistical Association | 2018

*Short Summary: *We proposed the concept of tail-robustness, which is evidenced by better finite-sample performance than nonrobust methods in the presence of heavy-tailed data. To achieve this form of robustness, we proposed the adaptive Huber regression. The key difference between this and its classical counterpart, Huber regression, is that the robustification parameter needs to adapt to the sample size, dimensionality and unknown moments of the data, so that an optimal tradeoff between the effect of heavy-tailedness and statistical bias can be achieved.

**Data-dependent PAC-Bayes priors via differential privacy**

by *G. K. Dziugaite, D. M. Roy*

Advances in Neural Information Processing Systems | 2018 (to appear)

*Short Summary*: The Probably Approximately Correct (PAC) Bayes framework (McAllester, 1999) can incorporate knowledge about the learning algorithm and data distribution through the use of distribution-dependent priors, yielding tighter generalization bounds on data-dependent posteriors. Using this flexibility, however, is difficult, especially when the data distribution is presumed to be unknown. We show how a differentially private prior yields a valid PAC-Bayes bound, and then show how non-private mechanisms for choosing priors obtain the same generalization bound provided they converge weakly to the private mechanism.

**Distributed inference for quantile regression processes **

by *Stanislav Volgushev*, Shih-Kang Chao and Guang Cheng

Annals of Statistics | 2018 (to appear)

*Short Summary:* We provide novel approaches to do quanitle regression for Big (massive) data and show one of the first examples where the failure of a popular computational approach - divide and cnquer, can be characterized explicitly. Thepaper also provides new approaches to inference that explicitly use thedivide and conquer framerowrk for fast and simple inference.

**I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error**

by *Jianqing Fan, Han Liu, Qiang Sun,* and *Tong Zhang*

The Annals of Statistics | 2018 | 46(2), 814-841

*Short Summary: *Nonconvex optimization has attracted much interest recently in both statistics and machine learning. This is possibly due to the popularity of big data, which enables the use of complex and nonconvex learning tools in practice. This paper shows that, by taking model structures and randomness into account, finding the global optima with a polynomial-time algorithm in nonconvex problems becomes possible, at least in the problem of nonconvex sparse regression. We propose such an algorithm and characterize its statsitical and computaitonal tradeoffs.

**Quantile spectral analysis for locally stationary time series**

by *Stefan Birr, Stanislav Volgushev, Tobias Kley, Holger Dette* and *Marc Hallin*

Journal of the Royal Statistical Society: Series B | 2017 | Vol 79, PP 1619-1643

*Short Summary*: We develop new methods for time series analysis that allow to describe non-linear dynamics for non-stationary processes and show that many models that are routinely applied to study time series are not able to capture the true dynamics of observed data.

**Sampling and Estimation for (Sparse) Exchangeable Graphs **

by *V. Veitch, D. M. Roy*

Annals of Statistics | 2016 (to appear)

*Short Summary: *We develop the graphex framework (Veitch and Roy, 2015) as a tool for statistical network analysis by identifying the sampling scheme that is naturally associated with the models of the framework, and by introducing a general consistent estimator for the parameter (the graphex) underlying these models. Our results may be viewed as a generalization of consistent estimation via the empirical graphon from the dense graph regime to also include sparse graphs.

**Vine Copulas for Imputation of Monotone Non-Response**

by *Caren Hasler, Radu V. Craiu* and *Louis-Paul Rivest*

International Statistical Review | 2018 | 86: 488-511

*Short Summary:* Multiple imputations sor sample surveys from copula models.

**When should modes of inference disagree? Some simple but challenging examples.**

by *Fraser, D.A.S., Reid, N. *and* Lin, W. *

Annals of Applied Statistics | 2018 | 12, 750--770

*Short Summary: *This paper addresses eight illustrative problems that David Cox outlined for a recent conference. Each illustration raises difficulties for different theoretical approaches to inference. We discuss these from the view of our work on high order asymptotics.