Race/Ethnicity Imputation

Methodology read first

Imputes self-identified race/ethnicity from name (and optionally address, party, sex) using established academic methods. All outputs are probabilities across: white black hispanic asian aian multiple other.

Data sources

Methods

1. Census Surname

Direct lookup; surname only.

2. BISG — Bayesian Improved Surname Geocoding

P(race | surname, geo) ∝ P(surname | race) · P(race | geo)

Elliott et al. 2009; Imai & Khanna 2016. Methodologically equivalent to the R package wru.

3. BIFSG — adds first-name term

P(race | first, surname, geo) ∝ P(surname | race) · P(first | race) · P(race | geo)

Voicu 2018. Meaningful lift for distinguishing Black vs. white where surnames overlap.

4. BISGP — adds party-registration term

P(race | s, g, party) ∝ P(s|r) · P(party|r) · P(r|g)

Imai & Khanna 2016 also describe this extension. P(party | race) uses validated voter-study estimates (Pew, Catalist, ANES). The party term adds bias risk: it conflates partisanship with ethnicity and is sensitive to drift in voting patterns. Off by default; only available when a party column is mapped.

5. BIFSGP — first name + party

P(race | f, s, g, party) ∝ P(s|r) · P(f|r) · P(party|r) · P(r|g)

6. ethnicolr

LSTM on full name; trained on the Florida voter registration file. No geography needed.

Sood & Laohaprapanon 2018 (arXiv:1805.02109). Native output covers four categories (NH White, NH Black, Hispanic, Asian); aian/multiple/other are reported as 0.

7. Ensemble

Per-race weighted average of whichever methods you ran. Weights reflect published per-group accuracy.

Optional adjustments

Phonetic (Metaphone) fallback

When a surname doesn't appear in the Census file, the loader tries matching by Metaphone code (better than Soundex for non-English names) against a count-weighted index of phonetically equivalent Census surnames. Recovers a few percent of records that would otherwise return no surname signal. On by default; can be disabled.

Decennial 2020 block-level geo

Replaces ACS block-group composition with Decennial 2020 P2 at the full 15-digit block GEOID. More granular — useful in racially mixed neighborhoods where adjacent blocks differ — but the data ages until the next decennial cycle. Block-level data carries some DAS noise.

Married-name correction

For records flagged as female, when the surname-only prediction is high-confidence (>0.6) in a different race than the first-name+geo prediction, the posterior is pulled toward the first+geo distribution by 0.5 (a soft blend, not a hard override). This is a heuristic, not a published method, and it requires a sex column. The underlying bias it addresses is documented in CFPB enforcement actions and Voicu 2018.

Expected accuracy

Approximate per-group accuracy (precision/recall, F1-equivalent) drawn from the published validations cited below. Bars are illustrative ranges, not point estimates — actual numbers depend on the test set.

Group Surname-only + first + geo (BIFSG) + party (BIFSGP)
hispanic
~90%
~92%
marginal lift
asian
~80%
~85%
small lift
black
~60%
~80%
+2–5 pts (largest)
white
~75%
~88%
small lift
aian
~30%
~35%
none
multiple
~10%
~15%
none
other
~5%
~8%
none

Notes: White shows high recall but lower precision in surname-only methods (it absorbs unmatched surnames); BIFSG meaningfully improves both. Black sees the largest BIFSGP lift because partisan registration is most diagnostic for distinguishing Black from white voters at the surname-overlap margin. AIAN / Multiple / Other are poorly modeled across the entire BISG family — too few training examples, too much intra-group naming heterogeneity, and no reliable party-by-race signal.

Caveats — read before using results

Citations