Rosenman, Evan T.R., Santiago Olivella, and Kosuke Imai.(2023). ``Race and ethnicity data for first, middle, and last names.'' Scientific Data, Vol. 10, No. 299, pp. 1-11.

 

  Abstract

We provide the largest compiled publicly available dictionaries of first, middle, and surnames for the purpose of imputing race and ethnicity using, for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are based on the voter files of six U.S. Southern States that collect self-reported racial data upon voter registration. Our data cover the racial make-up of a larger set of names than any comparable dataset, containing 136 thousand first names, 125 thousand middle names, and 338 thousand surnames. Individuals are categorized into five mutually exclusive racial and ethnic groups --- White, Black, Hispanic, Asian, and Other --- and racial/ethnic probabilities by name are provided for every name in each dictionary. We provide both probabilities of the form P(race | name) and P(name | race), and conditions under which they can be assumed to be representative of a given target population. These conditional probabilities can then be deployed for imputation in a data analytic task for which self-reported racial and ethnic data is not available.

  Software and Related Paper

Khanna, Kabir, Brandon Bertelsen, Santiago Olivella, Evan Rosenman, and Kosuke Imai. ``wru: Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation.'' available through The Comprehensive R Archive Network and GitHub
Imai, Kosuke and Kabir Khanna. (2016). ``Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Record.'' Political Analysis, Vol. 24, No. 2 (Spring), pp. 263-272.
Imai, Kosuke, Evan T.R. Rosenman, and Santiago Olivella. (2022). ``Addressing Census data problems in race imputation via fully Bayesian Improved Surname Geocoding and name supplements.'' Science Advances, Vol. 8, No. 49, pp. 1-10.
Rosenman, Evan T.R., Santiago Olivella, and Kosuke Imai. (2023). ``Race and ethnicity data for first, middle, and last names.'' Scientific Data, Vol. 10, No. 299, pp. 1-11.
McCartan, Cory, Robin Fisher, Jacob Goldin, Daniel E. Ho, Kosuke Imai. ``Estimating Racial Disparities When Race is Not Observed.'' Journal of the American Statistical Association, Forthcoming.

© Kosuke Imai
 Last modified: Mon Jun 23 21:52:11 BST 2025