# 2020-21

**A generalized robust allele-based genetic association test**

by *Lin Zhang, and Lei Sun*

Biometrics | 2021 (Online)

*Short Summary:* The allele-based association test, comparing allele frequency difference between case and control groups, is locally most powerful. However, application of the classical allelic test is limited in practice, because the method is sensitive to the Hardy–Weinberg equilibrium (HWE) assumption, not applicable to continuous traits, and not easy to account for covariate effect or sample correlation. To develop a generalized robust allelic test, we propose a new allele-based regression model with individual allele as the response variable. We show that the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from HWE and encompasses the classical allelic test as a special case. When the trait of interest is continuous, the corresponding allelic test evaluates a weighted difference between individual-level allele frequency estimate and sample estimate where the weight is proportional to an individual's trait value, and the test remains valid under Y-dependent sampling. Finally, the proposed allele-based method can analyze multiple (continuous or binary) phenotypes simultaneously and multiallelic genetic markers, while accounting for covariate effect, sample correlation, and population heterogeneity. To support our analytical findings, we provide empirical evidence from both simulation and application studies.

**Bayesian model averaging for the X-chromosome inactivation dilemma in genetic association study**

by *Bo Chen, Radu V. Craiu, and Lei Sun*

Biostatistics | 2020 | 21(2) 319-335

*Short Summary: *X-chromosome is often excluded from the so called “whole-genome” association studies due to the differences it exhibits between males and females. One particular analytical challenge is the unknown status of X-inactivation, where one of the two X-chromosome variants in females may be randomly selected to be silenced. In the absence of biological evidence in favor of one specific model, we consider a Bayesian model averaging framework that offers a principled way to account for the inherent model uncertainty, providing model averaging-based posterior density intervals and Bayes factors. We examine the inferential properties of the proposed methods via extensive simulation studies, and we apply the methods to a genetic association study of an intestinal disease occurring in about 20% of cystic fibrosis patients. Compared with the results previously reported assuming the presence of inactivation, we show that the proposed Bayesian methods provide more feature-rich quantities that are useful in practice.

**Checking the model and the prior for the constrained multinomial**

by *Berge Englert, Michael Evans, Gun Ho Jang, Hui Khoon Ng, David Nott, D. and Max Seah*

Metrika | 2021 | DOI: 10.1007/s00184-021-00811-8

*Short Summary: *Multinomial models can be difficult to use when constraints are placed on the probabilities. An exact model checking procedure for such models is developed based on a uniform prior on the full multinomial model. For inference, a nonuniform prior can be used and a consistency theorem is proved concerning a check for prior-data conflict with the chosen prior. Applications are presented and a new elicitation methodology is developed for multinomial models with ordered probabilities.

**Measuring and controlling bias for some Bayesian ****inferences and the relation to frequentist criteria**

by *Michael Evans, and Yang Guo*

Entropy | 2021 | 23(2), 190 DOI:10.3390/e23020190

*Short Summary: *A common concern with Bayesian methodology in scientific contexts is that inferences can be heavily influenced by subjective biases. As presented here, there are two types of bias for some quantity of interest: bias against and bias in favor. Based upon the principle of evidence, it is shown how to measure and control these biases for both hypothesis assessment and estimation problems. Optimality results are established for the principle of evidence as the basis of the approach to these problems. A close relationship is established between measuring bias in Bayesian inferences and frequentist properties that hold for any proper prior. This leads to a possible resolution to an apparent conflict between these approaches to statistical reasoning. Frequentism is seen as establishing figures of merit for a statistical study, while Bayes determines the inferences based upon statistical evidence.

**Modified likelihood root in high dimensions**

by *Yanbo Tang, and Nancy Reid*

J Royal Statist Soc Series B | 2020 | Volume: 62, 1349-1369

*Short Summary: *We examine a higher-order approximation to the significance function with increasing numbers of nuisance parameters, based on the normal approximation to an adjusted log-likelihood root. We show that the rate of the correction for nuisance parameters is larger than the correction for non-normality, when the parameter dimension $p$ is $O(n^{\alpha})$ for $\alpha < 1/2$. We specialize the results to linear exponential families and location-scale families and illustrate these with simulations.

**Multiple block sizes and overlapping blocks for multivariate time series extremes**

by *Nan Zou, Stanislav Volgushev, and Axel Bücher*

Annals of Statistics | 2021 (published) | 49(1): 295-320 (February 2021). DOI: 10.1214/20-AOS1957

*Short Summary: *Block maxima methods constitute a fundamental part of the statistical toolbox in extreme value analysis. However, most of the corresponding theory is derived under the simplifying assumption that block maxima are independent observations from a genuine extreme value distribution. In practice however, block sizes are finite and observations from different blocks are dependent. Theory respecting the latter complications is not well developed, and, in the multivariate case, has only recently been established for disjoint blocks of a single block size. We show that using overlapping blocks instead of disjoint blocks leads to a uniform improvement in the asymptotic variance of the multivariate empirical distribution function of rescaled block maxima and any smooth functionals thereof (such as the empirical copula), without any sacrifice in the asymptotic bias. We further derive functional central limit theorems for multivariate empirical distribution functions and empirical copulas that are uniform in the block size parameter, which seems to be the first result of this kind for estimators based on block maxima in general. The theory allows for various aggregation schemes over multiple block sizes, leading to substantial improvements over the single block length case and opens the door to further methodology developments. In particular, we consider bias correction procedures that can improve the convergence rates of extreme-value estimators and shed some new light on estimation of the second-order parameter when the main purpose is bias correction.

**On set-based association tests: Insights from a regression using summary statistics**

by *Yanyan Zhao, and Lei Sun*

Canadian Journal of Statistics | 2021 (Online)

*Short Summary: *Motivated by, but not limited to, association analyses of multiple genetic variants, we propose here a summary statistics-based regression framework. The proposed method requires only variant-specific summary statistics, and it unifies earlier methods based on individual-level data as special cases. The resulting score test statistic, derived from a linear mixed-effect regression model, inherently transforms the variant-specific statistics using the precision matrix to improve power for detecting sparse alternatives. Furthermore, the proposed method can incorporate additional variant-specific information with ease, facilitating omic-data integration. We study the asymptotic properties of the proposed tests under the null and alternatives, and we investigate efficient P-value calculation in finite samples. Finally, we provide supporting empirical evidence from extensive simulation studies and two applications.

**On specification tests for composite likelihood inference**

by *Jing Huang, Yang Ning, Nancy Reid, and Yong Chen*

Biometrika | 2020 | Volume: 107, 907-917

*Short Summary: *Composite likelihood functions are often used for inference in applications where the data have a complex structure. While inference based on composite likelihood can be more robust than inference based on the full likelihood, the inference is not valid if the associated conditional or marginal models are misspecified. In this paper, we propose a general class of specification tests for composite likelihood inference. The test statistics are motivated by the fact that the second Bartlett identity holds for each component of the composite likelihood function when these components are correctly specified. We construct the test statistics based on the discrepancy between the so-called composite information matrix and the sensitivity matrix. As an illustration, we study three important cases of the proposed tests and establish their limiting distributions under both null and local alternative hypotheses. Finally, we evaluate the finite-sample performance of the proposed tests in several examples.

**Rank-based Estimation under Asymptotic Dependence and Independence, with Applications to Spatial Extremes**

by* Michaël Lalancette, Sebastian Engelke, and Stanislav Volgushev*

Annals of Statistics | 2021 (accepted)

*Short Summary: *Multivariate extreme value theory is concerned with modeling the joint tail behavior of several random variables. Existing work mostly focuses on asymptotic dependence, where the probability of observing a large value in one of the variables is of the same order as observing a large value in all variables simultaneously. However, there is growing evidence that asymptotic independence is equally important in real world applications. Available statistical methodology in the latter setting is scarce and not well understood theoretically. We revisit non-parametric estimation and introduce rank-based M-estimators for parametric models that simultaneously work under asymptotic dependence and asymptotic independence, without requiring prior knowledge on which of the two regimes applies. Asymptotic normality of the proposed estimators is established under weak regularity conditions. We further show how bivariate estimators can be leveraged to obtain parametric estimators in spatial tail models, and again provide a thorough theoretical justification for our approach.

**Testing relevant hypotheses in functional time series via self-normalization**

by *Holger Dette, Kevin Kokot, and Stanislav Volgushev*

Journal of the Royal Statistical Society: Series B | 2020 (published) | 82 (3) 629-660

*Short Summary: *We develop methodology for testing relevant hypotheses about functional time series in a tuning-free way. Instead of testing for exact equality, e.g. for the equality of two mean functions from two independent time series, we propose to test the null hypothesis of no relevant deviation. In the two-sample problem this means that an L2-distance between the two mean functions is smaller than a prespecified threshold.For such hypotheses self-normalization, which was introduced in 2010 by Shao, and Shao and Zhang and is commonly used to avoid the estimation of nuisance parameters, is not directly applicable. We develop new self-normalized procedures for testing relevant hypotheses in the one-sample, two-sample and change point problem and investigate their asymptotic properties. Finite sample properties of the tests proposed are illustrated by means of a simulation study and data examples. Our main focus is on functional time series, but extensions to other settings are also briefly discussed.

**The measurement of statistical evidence as the basis for statistical reasoning**

Proceedings of the 5th International Electronic | 2020 | 46(1), 7; DOI:10.3390/ecea-5-06682

*Short Summary: *There are various approaches to the problem of how one is supposed to conduct a statistical analysis. Different analyses can lead to contradictory conclusions in some problems so this is not a satisfactory state of affairs. It seems that all approaches make reference to the evidence in the data concerning questions of interest as a justification for the methodology employed. It is fair to say, however, that none of the most commonly used methodologies is absolutely explicit about how statistical evidence is to be characterized and measured. We will discuss the general problem of statistical reasoning and the development of a theory for this that is based on being precise about statistical evidence. This will be shown to lead to the resolution of a number of problems.

**Using prior expansions for prior-data conflict checking**

by* David Nott, Max Seah, Luai Al-Labadi, Michael Evans, Hui Khoon Ng, and Berge Englert*

Bayesian Analysis | 2021 | 16, Number 1, 203-231

*Short Summary: *Any Bayesian analysis involves combining information represented through different model components, and when different sources of information are in conflict it is important to detect this. Here we consider checking for priordata conflict in Bayesian models by expanding the prior used for the analysis into a larger family of priors, and considering a marginal likelihood score statistic for the expansion parameter. Consideration of different expansions can be informative about the nature of any conflict, and an appropriate choice of expansion can provide more sensitive checks for conflicts of certain types. Extensions to hierarchically specified priors and connections with other approaches to prior-data conflict checking are considered, and implementation in complex situations is illustrated with two applications. The first concerns testing for the appropriateness of a LASSO penalty in shrinkage estimation of coefficients in linear regression. Our method is compared with a recent suggestion in the literature designed to be powerful against alternatives in the exponential power family, and we use this family as the prior expansion for constructing our check. A second application concerns a problem in quantum state estimation, where a multinomial model is considered with physical constraints on the model parameters. In this example, the usefulness of different prior expansions is demonstrated for obtaining checks which are sensitive to different aspects of the prior.

# Previous Publications

**A generalized Levene's scale test for variance heterogeneity in the presence of sample correlation and group uncertainty**

by *David Soave *and *Lei Sun*

Biometrics | 2017 | 73(3):960-971

*Short Summary:* Why were we interested in generalizing Levene's test? It can be used to indirectly detect Gene-Environment interactions when E is missing!

**A new look at F-tests**

by *McCormack, A., Reid, N., Sartori, N. *and *Theivendran, S.-A.*

*Short Summary: *We show that the directional tests recently developed by Fraser, Reid, Sartori and Davison can be explicitly computed in a number of classical models, including normal theory linear regression, where the test reduces to the usual F-test.

**Adaptive Huber regression**

by *Qiang Sun*, *Wen-Xin **Zhou*, and *Jianqing Fan*

Journal of the American Statistical Association | 2018

*Short Summary: *We proposed the concept of tail-robustness, which is evidenced by better finite-sample performance than nonrobust methods in the presence of heavy-tailed data. To achieve this form of robustness, we proposed the adaptive Huber regression. The key difference between this and its classical counterpart, Huber regression, is that the robustification parameter needs to adapt to the sample size, dimensionality and unknown moments of the data, so that an optimal tradeoff between the effect of heavy-tailedness and statistical bias can be achieved.

**Data-dependent PAC-Bayes priors via differential privacy**

by *G. K. Dziugaite, D. M. Roy*

Advances in Neural Information Processing Systems | 2018 (to appear)

*Short Summary*: The Probably Approximately Correct (PAC) Bayes framework (McAllester, 1999) can incorporate knowledge about the learning algorithm and data distribution through the use of distribution-dependent priors, yielding tighter generalization bounds on data-dependent posteriors. Using this flexibility, however, is difficult, especially when the data distribution is presumed to be unknown. We show how a differentially private prior yields a valid PAC-Bayes bound, and then show how non-private mechanisms for choosing priors obtain the same generalization bound provided they converge weakly to the private mechanism.

**Distributed inference for quantile regression processes **

by *Stanislav Volgushev*, Shih-Kang Chao and Guang Cheng

Annals of Statistics | 2018 (to appear)

*Short Summary:* We provide novel approaches to do quanitle regression for Big (massive) data and show one of the first examples where the failure of a popular computational approach - divide and cnquer, can be characterized explicitly. Thepaper also provides new approaches to inference that explicitly use thedivide and conquer framerowrk for fast and simple inference.

**I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error**

by *Jianqing Fan, Han Liu, Qiang Sun,* and *Tong Zhang*

The Annals of Statistics | 2018 | 46(2), 814-841

*Short Summary: *Nonconvex optimization has attracted much interest recently in both statistics and machine learning. This is possibly due to the popularity of big data, which enables the use of complex and nonconvex learning tools in practice. This paper shows that, by taking model structures and randomness into account, finding the global optima with a polynomial-time algorithm in nonconvex problems becomes possible, at least in the problem of nonconvex sparse regression. We propose such an algorithm and characterize its statsitical and computaitonal tradeoffs.

**Quantile spectral analysis for locally stationary time series**

by *Stefan Birr, Stanislav Volgushev, Tobias Kley, Holger Dette* and *Marc Hallin*

Journal of the Royal Statistical Society: Series B | 2017 | Vol 79, PP 1619-1643

*Short Summary*: We develop new methods for time series analysis that allow to describe non-linear dynamics for non-stationary processes and show that many models that are routinely applied to study time series are not able to capture the true dynamics of observed data.

**Sampling and Estimation for (Sparse) Exchangeable Graphs **

by *V. Veitch, D. M. Roy*

Annals of Statistics | 2016 (to appear)

*Short Summary: *We develop the graphex framework (Veitch and Roy, 2015) as a tool for statistical network analysis by identifying the sampling scheme that is naturally associated with the models of the framework, and by introducing a general consistent estimator for the parameter (the graphex) underlying these models. Our results may be viewed as a generalization of consistent estimation via the empirical graphon from the dense graph regime to also include sparse graphs.

**Vine Copulas for Imputation of Monotone Non-Response**

by *Caren Hasler, Radu V. Craiu* and *Louis-Paul Rivest*

International Statistical Review | 2018 | 86: 488-511

*Short Summary:* Multiple imputations sor sample surveys from copula models.

**When should modes of inference disagree? Some simple but challenging examples.**

by *Fraser, D.A.S., Reid, N. *and* Lin, W. *

Annals of Applied Statistics | 2018 | 12, 750--770

*Short Summary: *This paper addresses eight illustrative problems that David Cox outlined for a recent conference. Each illustration raises difficulties for different theoretical approaches to inference. We discuss these from the view of our work on high order asymptotics.