Dasanaike, Noah and Kosuke Imai. (2026). ``Using Embedding Models to Improve Probabilistic Race Prediction.''

Abstract

The estimation of racial disparities is often hampered by the lack of individual-level race data. Researchers have frequently adopted Bayesian Improved Surname Geocoding (BISG), which combines Census surname data with geolocation to predict race. Unfortunately, approximately 10% of the U.S. population has surnames absent from Census records, limiting BISG’s coverage. We propose embedding-powered BISG (eBISG), which leverages pre-trained text embeddings and neural networks trained on 2020 Census data to estimate race probabilities for names absent from Census records. We evaluate five methods, ranging from standard surname-only approaches to full-name embeddings trained on voter file data, and demonstrate that successive eBISG approaches improve predictions, with the most comprehensive method yielding the largest gains for Hispanic and Asian voters.

Software

wru: Who Are You? Bayesian Predictions of Racial Category Using Surname and Geolocation — CRAN / GitHub
© Kosuke Imai