Dasanaike, Noah and Kosuke Imai. (2026). ``Using Embedding Models to Improve Probabilistic Race Prediction.''
The estimation of racial disparities is often hampered by the lack of individual-level race data. Researchers have frequently adopted Bayesian Improved Surname Geocoding (BISG), which combines Census surname data with geolocation to predict race. Unfortunately, approximately 10% of the U.S. population has surnames absent from Census records, limiting BISG’s coverage. We propose embedding-powered BISG (eBISG), which leverages pre-trained text embeddings and neural networks trained on 2020 Census data to estimate race probabilities for names absent from Census records. We evaluate five methods, ranging from standard surname-only approaches to full-name embeddings trained on voter file data, and demonstrate that successive eBISG approaches improve predictions, with the most comprehensive method yielding the largest gains for Hispanic and Asian voters. |