The ever-increasing scope and scale of data collection has shifted the focus of data collection away from testing pre-specified hypotheses and towards hypothesis generation. Researchers are often interested in performing exploratory data analysis on a data set to generate hypotheses, then to validate those hypotheses in that same data via tests of significance. Unfortunately, this type of "double-dipping" can lead to extremely inflated type I error rates.
In this talk, I will consider double-dipping on trees. First, I will focus on trees generated by hierarchical clustering, and consider testing for differences between clusters obtained by cutting the tree. I will propose a selective inference approach to test for a difference in means between two clusters that properly accounts for the fact that the choice of null hypothesis was made based on the data. Second, I will consider trees generated using the CART algorithm, and will again use a selective inference approach to conduct inference on the means of the terminal nodes. Applications include single-cell RNA-sequencing data and the Box Lunch Study.
This is joint work with Jacob Bien (University of Southern California), Daniela Witten (University of Washington), and Anna Neufeld (University of Washington).
Please join the event.
About Lucy Gao
Lucy is an Assistant Professor in the Department of Statistics and Actuarial Science at the University of Waterloo. She received her PhD in Biostatistics from the University of Washington. Her research interests are in statistical learning, selective inference, and experiment design.