Since most social science research relies
upon multiple data sources, merging data sets is an essential part
of researchers' workflow. Unfortunately, a unique identifier that
unambiguously links records is often unavailable, and data may
contain missing and inaccurate information. These problems are
severe especially when merging large-scale administrative records.
We develop a fast and scalable algorithm to implement a canonical
probabilistic model of record linkage that has many advantages over
deterministic methods frequently used by social scientists. The
proposed methodology efficiently handles millions of observations
while accounting for missing data and measurement error,
incorporating auxiliary information, and adjusting for uncertainty
about merging in post-merge analyses. We conduct comprehensive
simulation studies to evaluate the performance of our algorithm in
realistic scenarios. We also apply our methodology to merging
campaign contribution records, survey data, and nationwide voter
files. An
open-source
software is available for implementing the proposed
methodology.