|
|
We provide the largest compiled publicly
available dictionaries of first, middle, and surnames for the purpose
of imputing race and ethnicity using, for example, Bayesian Improved
Surname Geocoding (BISG). The dictionaries are based on the voter
files of six U.S. Southern States that collect self-reported racial
data upon voter registration. Our data cover the racial make-up of a
larger set of names than any comparable dataset, containing 136
thousand first names, 125 thousand middle names, and 338 thousand
surnames. Individuals are categorized into five mutually exclusive
racial and ethnic groups --- White, Black, Hispanic, Asian, and Other
--- and racial/ethnic probabilities by name are provided for every
name in each dictionary. We provide both probabilities of the form
P(race | name) and P(name | race), and conditions under which they can
be assumed to be representative of a given target population. These
conditional probabilities can then be deployed for imputation in a
data analytic task for which self-reported racial and ethnic data is
not available. |
Khanna, Kabir, Brandon Bertelsen, Santiago
Olivella, Evan Rosenman, and Kosuke Imai. ``wru: Who Are You?
Bayesian Prediction of Racial Category Using Surname and
Geolocation.'' available through The Comprehensive R
Archive Network and GitHub |
Imai, Kosuke and Kabir
Khanna. (2016). ``Improving
Ecological Inference by Predicting Individual Ethnicity from Voter
Registration Record.'' Political Analysis,
Vol. 24, No. 2 (Spring), pp. 263-272.
|
Imai, Kosuke, Evan T.R. Rosenman, and
Santiago Olivella. (2022). ``Addressing Census data problems in
race imputation via fully Bayesian Improved Surname Geocoding and
name supplements.'' Science Advances,
Vol. 8, No. 49, pp. 1-10. |
Rosenman, Evan T.R., Santiago Olivella,
and Kosuke Imai. (2023). ``Race and ethnicity
data for first, middle, and last names.''
Scientific Data, Vol. 10, No. 299, pp. 1-11. |
McCartan, Cory, Robin Fisher, Jacob
Goldin, Daniel E. Ho, Kosuke Imai. ``Estimating Racial Disparities When Race is Not
Observed.'' Journal of the American Statistical
Association, Forthcoming. |