Bayesian Canonicalization of Voter Registration Files

When and Where

Thursday, November 05, 2020 3:30 pm to 4:30 pm
Zoom, Passcode: 224849


Andee Kaplan, Colorado State University


Entity resolution (record linkage or de-deduplication) is the process of merging noisy databases to remove duplicate entities in the absence of a unique identifier. One major challenge of utilizing linked data is identifying the canonical (or representative) records without duplicate information to pass to an inferential downstream task. The canonicalization step is particularly crucial after entity resolution, as a multi-stage approach allows for multiple analyses to be performed on the same linked data. While this approach can be scalable, the uncertainty from each stage of the entity resolution process is not naturally propagated throughout the pipeline and into the downstream task. In this talk, we present five fully unsupervised methods to choose canonical records from linked data, including a fully Bayesian approach which propagates the error from linkage through to the downstream inference. This multi-stage approach is illustrated and evaluated on simulated entity resolution data sets as well as voter registration data available from the North Carolina State Board of Elections (NCSBE). The NCSBE has released a snapshot of their voter registration databases regularly since 2005, providing a changing view of the voter registration information over time as new voters register, voters are dropped from the register, and voter information is updated. We compare the proposed canonicalization methods after performing entity resolution on five snapshots and examine the relationship between demographic information and party affiliation on the resulting canonical data sets.

Please join the event.

About Andee Kaplan

Andee Kaplan is a statistician working at the intersection of statistics and computing.  She completed her Ph.D. in Statistics in 2017 at Iowa State University working with Daniel Nordman and Stephen Vardeman. She also holds an M.S. in Statistics from Iowa State University and an M.A. and B.S. in Mathematics from The University of Texas at Austin. After two years as a postdoctoral associate with Rebecca Steorts in the Department of Statistical Science at Duke University, Dr. Kaplan joined the CSU Statistics faculty in August 2019. Her research interests lie in the application of statistical methodology to solve large social science problems, particularly those with complex dependence. Dr. Kaplan likes struggling with JavaScript and learning new languages, R being her first love.