2022-23

A population-aware retrospective regression to detect genome-wide variants with sex difference in allele frequency

By Zhong Wang, Andrew Paterson, and Lei Sun

Annals of Applied Statistics | 2023 | Accepted

Sex difference in allele frequency is an emerging topic that is crucial to our understanding of data quality and features, particularly when it comes to the largely overlooked X chromosome. To detect sex differences in allele fre- quency for both X chromosomal and autosomal variants, the existing method is conservative when applied to samples from multiple ancestral populations. Additionally, it remains unexplored whether the sex difference in allele fre- quency varies between populations, which is important for trans-ancestral genetic studies. To answer these questions, we thus developed a novel, ret- rospective regression-based testing framework that led to interpretable and easy-to-implement solutions. We then applied the proposed methods to the high-coverage whole genome sequence data of the 1000 Genomes Project, robustly analyzing all samples available from the five super-populations. We had 97 novel findings by recognizing and modelling ancestral differences. Finally, we replicated the specific findings and overall conclusion using the gnomAD v3.1.2 data.

Leveraging Hardy-Weinberg disequilibrium for association testing in case-control studies

By Lin Zhang, Lisa Strug, and Lei Sun

Annals of Applied Statistics | 2023 | 17(2):1764-1781

Modern genome-wide association studies (GWAS) remove single nucleotide polymorphisms (SNPs) that are in Hardy–Weinberg disequilibrium (HWD), despite limited rigor for this practice. In a case-control GWAS, although HWD in the control sample is an evidence for genotyping error, a truly associated SNP may be in HWD in the case and/or control populations. We, therefore, develop a new case-control association test that: (i) leverages HWD attributed to true association to increase power, (ii) is robust to HWD caused by genotyping error, and (iii) is easy-to-implement at the genome-wide level. The proposed robust allele-based joint test incorporates the difference in HWD between the case and control samples into the traditional association measure to gain power. We provide the asymptotic distribution of the proposed test statistic under the null hypothesis. We evaluate its type 1 error control at the genome-wide significance level of 5×10−8 in the presence of HWD attributed to factors unrelated to phenotype-genotype association, such as genotyping error. Finally, we demonstrate that the power of the proposed allele-based joint test is higher than the standard association test for a variety of genetic models, through derivations of the noncentrality parameters of the tests, as well as simulation and application studies.

The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait

By Ziang Zhang, and Lei Sun

Bioinformatics | 2023 | 39(4):btad139

Accurate power and sample size estimation is crucial to the design and analysis of genetic association studies. When analyzing a binary trait via logistic regression, important covariates such as age and sex are typically included in the model. However, their effects are rarely properly considered in power or sample size computation during study planning. Unlike when analyzing a continuous trait, the power of association testing between a binary trait and a genetic variant depends, explicitly, on covariate effects, even under the assumption of gene-environment independence. Earlier work recognizes this hidden factor but the implemented methods are not flexible. We thus propose and implement a generalized method for estimating power and sample size for (discovery or replication) association studies of binary traits that (i) accommodates different types of nongenetic covariates E, (ii) deals with different types of G-E relationships, and (iii) is computationally efficient. Extensive simulation studies show that the proposed method is accurate and computationally efficient for both prospective and retrospective sampling designs with various covariate structures. A proof-of-principle application focused on the understudied African sample in the UK Biobank data. Results show that, in contrast to studying the continuous blood pressure trait, when analyzing the binary hypertension trait ignoring covariate effects of age and sex leads to overestimated power and underestimated replication sample size.

Previous Publications

2021-22

A Survey of Tasks and Visualizations in Multiverse Analysis Reports

by Brian Hall, Yang Liu, Yvonne Jansen, Pierre Dragicevic, Fanny Chevalier, Matthew Kay

Computer Graphics Forum | 2022 | 41 (1). pp. 402-426.

Short Summary: Analyzing data from experiments is a complex, multi-step process, often with multiple defensible choices available at each step. While analysts often report a single analysis without documenting how it was chosen, this can cause serious transparency and methodological issues. To make the sensitivity of analysis results to analytical choices transparent, some statisticians and methodologists advocate the use of “multiverse analysis”: reporting the full range of outcomes that result from all combinations of defensible analytic choices. Summarizing this combinatorial explosion of statistical results presents unique challenges; several approaches to visualizing the output of multiverse analyses have been proposed across a variety of fields (e.g., psychology, statistics, economics, neuroscience). In this article, we (1) introduce a consistent conceptual framework and terminology for multiverse analyses that can be applied across fields; (2) identify the tasks researchers try to accomplish when visualizing multiverse analyses; and (3) classify multiverse visualizations into “archetypes”, assessing how well each archetype supports each task. Our work sets a foundation for subsequent research on developing visualization tools and techniques to support multiverse analysis and its reporting.

Bayesian Inference of Globular Cluster Properties Using Distribution Functions

by Gwendolyn M. Eadie, Jeremy J. Webb, and Jeffrey S. Rosenthal

The Astrophysical Journal | 2022 | 926:211 (18 pp)

Short Summary: We present a Bayesian inference approach to estimating the cumulative mass profile and mean-squared velocity profile of a globular cluster (GC) given the spatial and kinematic information of its stars. Mock GCs with a range of sizes and concentrations are generated from lowered-isothermal dynamical models, from which we test the
reliability of the Bayesian method to estimate model parameters through repeated statistical simulation. We find that given unbiased star samples, we are able to reconstruct the cluster parameters used to generate the mock cluster and the cluster’s cumulative mass and mean-squared velocity profiles with good accuracy. We further explore how strongly biased sampling, which could be the result of observing constraints, might affect this approach. Our tests indicate that if we instead have biased samples, then our estimates can be off in certain ways that are dependent on cluster morphology. Overall, our findings motivate obtaining samples of stars that are as unbiased as possible. This may be achieved by combining information from multiple telescopes (e.g., Hubble and Gaia), but will require careful modeling of the measurement uncertainties through a hierarchical model, which we plan to pursue in future work.

CLEAN: Leveraging Spatial Autocorrelation in Neuroimaging Data in Clusterwise Inference

by Jun Young Park, and Mark Fiecas

Neuroimage | 2021 (accepted) | To apper

Short Summary: While clusterwise inference is a popular approach in neuroimaging that improves sensitivity, current methods do not account for explicit spatial autocorrelations because most use univariate test statistics to construct cluster-extent statistics. Failure to account for such dependencies could result in decreased reproducibility. To address methodological and computational challenges, we propose a new powerful and fast statistical method called CLEAN (Clusterwise inference Leveraging spatial Autocorrelations in Neuroimaging). CLEAN computes multivariate test statistics by modelling brain-wise spatial autocorrelations, constructs cluster-extent test statistics, and applies a refitting-free resampling approach to control false positives. We validate CLEAN using simulations and applications to the Human Connectome Project. This novel method provides a new direction in neuroimaging that paces with advances in high-resolution MRI data which contains a substantial amount of spatial autocorrelation.

Clearing the Hurdle: The Mass of Globular Cluster Systems as a Function of Host Galaxy Mass

by Eadie, Gwendolyn, Harris, William, and Springford, Aaron

The Astrophysical Journal | 2022 | 926:162 (19pp)

Short Summary: Current observational evidence suggests that all large galaxies contain globular clusters (GCs), while the smallest galaxies do not. Over what galaxy mass range does the transition from GCs to no GCs occur? We investigate this question using galaxies in the Local Group (LG), nearby dwarf galaxies, and galaxies in the Virgo Cluster Survey. We consider four types of statistical model: (1) logistic regression to model the probability that a galaxy of stellar mass M* has any number of GCs; (2) Poisson regression to model the number of GCs versus M*; (3) linear regression to model the relation between GC system mass ($\mathrm{log}{M}_{\mathrm{gcs}}$) and host galaxy mass ($\mathrm{log}{M}_{\star }$); and (4) a Bayesian lognormal hurdle model of the GC system mass as a function of galaxy stellar mass for the entire data sample. From the logistic regression, we find that the 50% probability point for a galaxy to contain GCs is M* = 106.8 M⊙. From postfit diagnostics, we find that Poisson regression is an inappropriate description of the data. Ultimately, we find that the Bayesian lognormal hurdle model, which is able to describe how the mass of the GC system varies with M* even in the presence of many galaxies with no GCs, is the most appropriate model over the range of our data. In an Appendix, we also present photometry for the little-known GC in the LG dwarf Ursa Major II.

Major Sex Differences in Allele Frequencies for X Chromosomal Variants in Both the 1000 Genomes Project and GnomAD.

by Zhong Wang, Lei Sun, and Andrew Paterson

PLoS Genetics | Accepted

Short Summary: An unexpectedly high proportion of SNPs on the X chromosome in the 1000 Genomes Project phase 3 data were identified with significant sex differences in minor allele frequencies (sdMAF). sdMAF persisted for many of these SNPs in the recently released high coverage whole genome sequence of the 1000 Genomes Project that was aligned to GRCh38, and it was consistent between the five super-populations. Among the 245,825 common (MAF>5%) biallelic X-chromosomal SNPs in the phase 3 data presumed to be of high quality, 2,039 have genome-wide significant sdMAF (p-value <5e-8). sdMAF varied by location: non-pseudo-autosomal region (NPR)=0.83%, pseudo-autosomal regions (PAR1)=0.29%, PAR2=13.1%, and X-transposed region (XTR)/PAR3=0.85% of SNPs had sdMAF, and they were clustered at the NPR-PAR boundaries, among others. sdMAF at the NPR-PAR boundaries are biologically expected due to sex-linkage, but have generally been ignored in association studies. For comparison, similar analyses found only 6, 1 and 0 SNPs with significant sdMAF on chromosomes 1, 7 and 22, respectively. Similar sdMAF results for the X chromosome were obtained from the high coverage whole genome sequence data from gnomAD V 3.1.2 for both the non-Finnish European and African/African American samples. Future X chromosome analyses need to take sdMAF into account.

Non-Natural Manners of Death in Ontario: Effects of the COVID-19 Pandemic and Related Public Health Measures

by J.M. Dmetrichuk, J.S. Rosenthal, J. Man, M. Cullipa, R.A. Wells

Lancet Regional Health - Americas | 2022 | 7, 100130.

Short Summary: In this study, we examine data from the OCC-OFPS to identify trends in manners of death and types of death across four provincially-defined 'lockdown' stages of the COVID-19 pandemic. We show that homicide rates in Ontario were largely unaffected during the lockdown. Suicide rates slightly decreased during Stage 0, compared to recent years (2013 onwards). There was a substantial increase in the rate of drug-related fatalities during all stages of the lockdown. Accidental motor vehicle collision-associated fatalities decreased slightly in 2020, however an effect attributed to the lockdown was not clearly evident, particularly when compared to recent years. Our data has implications applicable across North America and is key to future public policy development. Our analysis highlights the importance of the death investigation system in mobilizing such data to best inform public health practice and policy recommendations.

Permutation-based Inference for Spatially Localized Signals in Longitudinal MRI Data

by Jun Young Park, and Mark Fiecas

Neuroimage | 2021 (published) | 239, 118312

Short Summary: Alzheimer’s disease is a neurodegenerative disease in which the degree of cortical atrophy in specific structures of the brain serves as a useful imaging biomarker. Recent approaches using linear mixed effects (LME) models in longitudinal neuroimaging have been powerful and flexible in investigating the temporal trajectories of cortical thickness. However, massive-univariate analysis, a simplified approach that obtains a summary statistic (e.g., a -value) for every vertex along the cortex, is insufficient to model cortical atrophy because it does not account for spatial similarities of the signals in neighboring locations. In this article, we develop a permutation-based inference procedure to detect spatial clusters of vertices showing statistically significant differences in the rates of cortical atrophy. The proposed method, called SpLoc, uses spatial information to combine the signals adaptively across neighboring vertices, yielding high statistical power while controlling family-wise error rate (FWER) accurately. When we reject the global null hypothesis, we use a cluster selection algorithm to detect the spatial clusters of significant vertices. We validate our method using simulation studies and apply it to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data to show its superior performance over existing methods. An R package for implementing SpLoc is publicly available.

Reward Design in Risk-Taking Contests

by Marcel Nutz, and Yuchong Zhang

SIAM Journal on Financial Mathematics | 2022 | 13(1), 129-146

Short Summary: Following the risk-taking model of Seel and Strack, $n$ players decide when to stop privately observed Brownian motions with drift and absorption at zero. They are then ranked according to their level of stopping and paid a rank-dependent reward. We study the problem of a principal who aims to induce a desirable equilibrium performance of the players by choosing how much reward is attributed to each rank. Specifically, we determine optimal reward schemes for principals interested in the average performance and the performance at a given rank. While the former can be related to reward inequality in the Lorenz sense, the latter can have a surprising shape.

Teamwise Mean Field Competitions

by Xiang Yu, Yuchong Zhang and Zhou Zhou

Applied Mathematics & Optimization | 2021 | 84, 903-942

Short Summary: This paper studies competitions with rank-based reward among a large number of teams. Within each sizable team, we consider a mean-field contribution game in which each team member contributes to the jump intensity of a common Poisson project process; across all teams, a mean field competition game is formulated on the rank of the completion time, namely the jump time of Poisson project process, and the reward to each team is paid based on its ranking. On the layer of teamwise competition game, three optimization problems are introduced when the team size is determined by: (i) the team manager; (ii) the central planner; (iii) the team members’ voting as partnership. We propose a relative performance criteria for each team member to share the team’s reward and formulate some special cases of mean field games of mean field games, which are new to the literature. In all problems with homogeneous parameters, the equilibrium control of each worker and the equilibrium or optimal team size can be computed in an explicit manner, allowing us to analytically examine the impacts of some model parameters and discuss their economic implications. Two numerical examples are also presented to illustrate the parameter dependence and comparison between different team size decision making.

The Mass of the Milky Way from the H3 Survey

by Shen, Jeff, Eadie, Gwendolyn M., Murray, Norman, Zaritsky, Dennis, Speagle, Joshua S., Ting, Yuan-Sen, Conroy, Charlie, Cargile, Phillip A., Johnson, Benjamin D., Naidu, Rohan P., and Han, Jiwon Jesse

The Astrophysical Journal | 2022 | 925:1 (19pp)

Short Summary: The mass of the Milky Way is a critical quantity that, despite decades of research, remains uncertain within a factor of two. Until recently, most studies have used dynamical tracers in the inner regions of the halo, relying on extrapolations to estimate the mass of the Milky Way. In this paper, we extend the hierarchical Bayesian model applied in Eadie & Juri to study the mass distribution of the Milky Way halo; the new model allows for the use of all available 6D phase-space measurements. We use kinematic data of halo stars out to 142 kpc, obtained from the H3 survey and Gaia EDR3, to infer the mass of the Galaxy. Inference is carried out with the No-U-Turn sampler, a fast and scalable extension of Hamiltonian Monte Carlo. We report a median mass enclosed within 100 kpc of $M(\lt 100\,\mathrm{kpc})={0.69}_{-0.04}^{+0.05}\times {10}^{12}\,{M}_{\odot }$ (68% Bayesian credible interval), or a virial mass of ${M}_{200}=M(\lt {216.2}_{-7.5}^{+7.5}\,\mathrm{kpc})={1.08}_{-0.11}^{+0.12}\times {10}^{12}\,{M}_{\odot }$, in good agreement with other recent estimates. We analyze our results using posterior predictive checks and find limitations in the model's ability to describe the data. In particular, we find sensitivity with respect to substructure in the halo, which limits the precision of our mass estimates to ∼15%.

2020-21

A generalized robust allele-based genetic association test

by Lin Zhang, and Lei Sun

Biometrics | 2021 (Online)

Short Summary: The allele-based association test, comparing allele frequency difference between case and control groups, is locally most powerful. However, application of the classical allelic test is limited in practice, because the method is sensitive to the Hardy–Weinberg equilibrium (HWE) assumption, not applicable to continuous traits, and not easy to account for covariate effect or sample correlation. To develop a generalized robust allelic test, we propose a new allele-based regression model with individual allele as the response variable. We show that the score test statistic derived from this robust and unifying regression framework contains a correction factor that explicitly adjusts for potential departure from HWE and encompasses the classical allelic test as a special case. When the trait of interest is continuous, the corresponding allelic test evaluates a weighted difference between individual-level allele frequency estimate and sample estimate where the weight is proportional to an individual's trait value, and the test remains valid under Y-dependent sampling. Finally, the proposed allele-based method can analyze multiple (continuous or binary) phenotypes simultaneously and multiallelic genetic markers, while accounting for covariate effect, sample correlation, and population heterogeneity. To support our analytical findings, we provide empirical evidence from both simulation and application studies.

Bayesian model averaging for the X-chromosome inactivation dilemma in genetic association study

by Bo Chen, Radu V. Craiu, and Lei Sun

Biostatistics | 2020 | 21(2) 319-335

Short Summary: X-chromosome is often excluded from the so called “whole-genome” association studies due to the differences it exhibits between males and females. One particular analytical challenge is the unknown status of X-inactivation, where one of the two X-chromosome variants in females may be randomly selected to be silenced. In the absence of biological evidence in favor of one specific model, we consider a Bayesian model averaging framework that offers a principled way to account for the inherent model uncertainty, providing model averaging-based posterior density intervals and Bayes factors. We examine the inferential properties of the proposed methods via extensive simulation studies, and we apply the methods to a genetic association study of an intestinal disease occurring in about 20% of cystic fibrosis patients. Compared with the results previously reported assuming the presence of inactivation, we show that the proposed Bayesian methods provide more feature-rich quantities that are useful in practice.

Cystic fibrosis–related diabetes onset can be predicted using biomarkers measured at birth

by Yu-Chung Lin, Katherine Keenan, Jiafen Gong, Naim Panjwani, Julie Avolio, Fan Lin, Damien Adam, Paula Barrett, Stéphanie Bégin, Yves Berthiaume, Lara Bilodeau, Candice Bjornson, Janna Brusky, Caroline Burgess, Mark Chilvers, Raquel Consunji-Araneta, Guillaume Côté-Maurais, Andrea Dale, Christine Donnelly, Lori Fairservice, Katie Griffin, Natalie Henderson, Angela Hillaby, Daniel Hughes, Shaikh Iqbal, Jennifer Itterman, Mary Jackson, Emma Karlsen, Lorna Kosteniuk, Lynda Lazosky, Winnie Leung, Valerie Levesque, Émilie Maille, Dimas Mateos-Corral, Vanessa McMahon, Mays Merjaneh, Nancy Morrison, Michael Parkins, Jennifer Pike, April Price, Bradley S. Quon, Joe Reisman, Clare Smith, Mary Jane Smith, Nathalie Vadeboncoeur, Danny Veniott, Terry Viczko, Pearce Wilcox, Richard van Wylick, Garry Cutting, Elizabeth Tullis, Felix Ratjen, Johanna M. Rommens, Lei Sun, Melinda Solomon, Anne L. Stephenson, Emmanuelle Brochiero, Scott Blackman, Harriet Corvol & Lisa J. Strug

Genetics in Medicine | 2021 | Volume: 23, 927-933

Short Summary: Cystic fibrosis (CF), caused by pathogenic variants in the CF transmembrane conductance regulator (CFTR), affects multiple organs including the exocrine pancreas, which is a causal contributor to cystic fibrosis–related diabetes (CFRD). Untreated CFRD causes increased CF-related mortality whereas early detection can improve outcomes.
Using genetic and easily accessible clinical measures available at birth, we constructed a CFRD prediction model using the Canadian CF Gene Modifier Study (CGS; n = 1,958) and validated it in the French CF Gene Modifier Study (FGMS; n = 1,003). We investigated genetic variants shown to associate with CF disease severity across multiple organs in genome-wide association studies.
The strongest predictors included sex, CFTR severity score, and several genetic variants including one annotated to PRSS1, which encodes cationic trypsinogen. The final model defined in the CGS shows excellent agreement when validated on the FGMS, and the risk classifier shows slightly better performance at predicting CFRD risk later in life in both studies.
We demonstrated clinical utility by comparing CFRD prevalence rates between the top 10% of individuals with the highest risk and the bottom 10% with the lowest risk. A web-based application was developed to provide practitioners with patient-specific CFRD risk to guide CFRD monitoring and treatment.

LocusFocus: Web-based colocalization for the annotation and functional follow-up of GWAS

by Naim Panjwani, Fan Wang, Scott Mastromatteo, Allen Bao, Cheng Wang, Gengming He, Jiafen Gong, Johanna M. Rommens, Lei Sun, and Lisa J. Strug

PLOS Computational Biology | 2020 | 16(10):e1008336

Short Summary: Genome-wide association studies (GWAS) have primarily identified trait-associated loci in the non-coding genome. Colocalization analyses of SNP associations from GWAS with expression quantitative trait loci (eQTL) evidence enable the generation of hypotheses about responsible mechanism, genes and tissues of origin to guide functional characterization. Here, we present a web-based colocalization browsing and testing tool named LocusFocus. LocusFocus formally tests colocalization using our established Simple Sum method to identify the most relevant genes and tissues for a particular GWAS locus in the presence of high linkage disequilibrium and/or allelic heterogeneity. We demonstrate the utility of LocusFocus, following up on a genome-wide significant locus from a GWAS of meconium ileus (an intestinal obstruction in cystic fibrosis). Using LocusFocus for colocalization analysis with eQTL data suggests variation in ATP12A gene expression in the pancreas rather than intestine is responsible for the GWAS locus. LocusFocus has no operating system dependencies and may be installed in a local web server. LocusFocus is available under the MIT license, with full documentation and source code accessible on GitHub.

On set-based association tests: Insights from a regression using summary statistics

by Yanyan Zhao, and Lei Sun

Canadian Journal of Statistics | 2021 (Online)

Short Summary: Motivated by, but not limited to, association analyses of multiple genetic variants, we propose here a summary statistics-based regression framework. The proposed method requires only variant-specific summary statistics, and it unifies earlier methods based on individual-level data as special cases. The resulting score test statistic, derived from a linear mixed-effect regression model, inherently transforms the variant-specific statistics using the precision matrix to improve power for detecting sparse alternatives. Furthermore, the proposed method can incorporate additional variant-specific information with ease, facilitating omic-data integration. We study the asymptotic properties of the proposed tests under the null and alternatives, and we investigate efficient P-value calculation in finite samples. Finally, we provide supporting empirical evidence from extensive simulation studies and two applications.

Statistical power in COVID-19 case-control host genomic study design

by Yu-Chung Lin, Jennifer D. Brooks, Shelley B. Bull, France Gagnon, Celia M. T. Greenwood, Rayjean J. Hung, Jerald Lawless, Andrew D. Paterson, Lei Sun, and Lisa J. Strug

Genome Medicine | 2020 | Volume 12, Article 115

Short Summary: The identification of genetic variation that directly impacts infection susceptibility to SARS-CoV-2 and disease severity of COVID-19 is an important step towards risk stratification, personalized treatment plans, therapeutic, and vaccine development and deployment. Given the importance of study design in infectious disease genetic epidemiology, we use simulation and draw on current estimates of exposure, infectivity, and test accuracy of COVID-19 to demonstrate the feasibility of detecting host genetic factors associated with susceptibility and severity in published COVID-19 study designs. We demonstrate that limited phenotypic data and exposure/infection information in the early stages of the pandemic significantly impact the ability to detect most genetic variants with moderate effect sizes, especially when studying susceptibility to SARS-CoV-2 infection. Our insights can aid in the interpretation of genetic findings emerging in the literature and guide the design of future host genetic studies.

Terminal Ranking Games

by Erhan Bayraktar, and Yuchong Zhang

Mathematics of Operations Research | 2021 | Ahead of Print

Short Summary: We analyze a mean field tournament: a mean field game in which the agents receive rewards according to the ranking of the terminal value of their projects and are subject to cost of effort. Using Schrödinger bridges we are able to explicitly calculate the equilibrium. This allows us to identify the reward functions which would yield a desired equilibrium and solve several related mechanism design problems. We are also able to identify the effect of reward inequality on the players’ welfare as well as calculate the price of anarchy.

2018-19

A generalized Levene's scale test for variance heterogeneity in the presence of sample correlation and group uncertainty

by David Soave and Lei Sun

Biometrics | 2017 | 73(3):960-971

Short Summary: Why were we interested in generalizing Levene's test? It can be used to indirectly detect Gene-Environment interactions when E is missing!

A Mean Field Competition

by Marcel Nutz and Yuchong Zhang

Mathematics of Operations Research | 2019 (to appear)

Short Summary: Introduces a solvable Poissonian mean field game where agents are rewarded based on the ranking of their goal-reaching times, and studies the principle-agent problem of designing an optimal reward scheme.

Increasing the Transparency of Research Papers with Explorable Multiverse Analyses

by Dragicevic P., Jansen Y., Sarma A., Kay M., and Chevalier F.

Proceedings of the SIGCHI Conference on Human Factors (CHI '19) | 2019 (to appear)

Short Summary: Introduces explorable multiverse analysis reports, a new approach to statistical reporting where readers of research papers can explore alternative analysis options by interacting with the paper itself. This approach combines multiverse analysis, a philosophy of statistical reporting where paper authors report the outcomes of many different statistical analyses in order to show how fragile or robust their findings are; and explorable explanations, narratives that can be read as normal explanations but where the reader can also become active by dynamically changing some elements of the explanation.

Provenance Network Analytics: An approach to data analytics using data provenance

by Huynh TD, Ebden M, Fischer J, Roberts S, and Moreau L

Data Mining and Knowledge Discovery | 2018 | 32(3): 708-35

Short Summary: Provenance is a description of what influenced the generation of a piece of data.

The Multiple Forces Behind Chinese Students' Self-segregation and How We May Counter Them

Transformative Dialogues: Teaching and Learning Journal | 2018

Short Summary: Multiple frameworks to understand the self-segregation phenomenon among Chinese international students and how we may build a more inclusive learning community

When Finance Meets Stories

Journal for Research and Practice in College Teaching | 2019 (to appear)

Short Summary: How narratives can be used to develop meaningful concept maps and a sense of purpose in learning for students studying finance.