Roger Peng: Building Data Analysis Proofs
When and Where
Speakers
Description
Building Data Analysis Proofs
Data analyses and data analysis plans are often constructed in an imperative manner, where commands representing actions taken on the data are issued in a sequential order, often dictated by the specific structure of the data. The complexity of these commands can vary greatly, from the calculation of simple summary statistics to the fitting of sophisticated statistical models. Although such a construction may be natural for most data analysts, the outputs that are produced, namely the code and the results of running the code, can hide important details about the analyst's premises, expectations, and assumptions about the data. This analysis reasoning, often omitted from the code or comments, can be critical to evaluating the quality of an analysis. We argue that a different kind of construction, a logical construction, offers more useful information for evaluating the quality of a data analysis and for statically illustrating an analyst's reasoning. A formal representation that details the logical construction of a data analysis has the potential for externalizing the thought process of a data analysis for independent examination. In this paper we describe the logical construction of data analysis operations and how it might be applied to some common data analysis tasks.
BIO: Roger D. Peng is a Professor of Statistics and Data Sciences at the University of Texas at Austin. Previously, he was Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health and the Co-Director of the Johns Hopkins Data Science Lab. He is the author of the popular book R Programming for Data Science and 10 other books on data science and statistics. Roger is a Fellow of the American Statistical Association and is the recipient of the Mortimer Spiegelman Award from the American Public Health Association, which honors a statistician who has made outstanding contributions to public health. Roger received a PhD in Statistics from the University of California, Los Angeles. His current research focuses on building analytic design theory for improving the quality of data analyses and on the development of statistical methods for addressing environmental health problems.