Methodology read first
Imputes self-identified race/ethnicity from name (and optionally address, party, sex) using established academic methods. All outputs are probabilities across: white black hispanic asian aian multiple other.
Data sources
- Census surname file (2010 default) — 162k surnames with race percentages. Source: U.S. Census Bureau, "Frequently Occurring Surnames from the 2010 Census."
- Census surname file (2020, optional toggle) — 157k surnames; raw counts converted to probabilities. New federal race categories. Subject to differential-privacy noise (DAS); rare names show more noise than 2010.
- Census first-name file (2020) — 54k first names with race counts. Used by BIFSG and BIFSGP. Falls back to Tzioumis 2018 (~4.2k names) if the 2020 file isn't present.
- ACS 5-year, block-group level (2024 vintage) — block-group race composition (table B03002). Default geo source for BISG / BIFSG / BISGP / BIFSGP.
- Decennial 2020, block level (optional toggle) — block-level race composition (table P2). More granular than ACS, but ages over the decade.
- Augmented dictionaries (optional) — if
data/augmented_*.csvfiles are present (built from the NC voter file viascripts/build_augmented_names.py), they extend the Census tables, with augmented entries overriding Census ones at matching keys. NC voter file race is self-reported and considered higher-quality than name-frequency inference.
Methods
1. Census Surname
Direct lookup; surname only.
2. BISG — Bayesian Improved Surname Geocoding
P(race | surname, geo) ∝ P(surname | race) · P(race | geo)
Elliott et al. 2009; Imai & Khanna 2016. Methodologically equivalent to the R package wru.
3. BIFSG — adds first-name term
P(race | first, surname, geo) ∝ P(surname | race) · P(first | race) · P(race | geo)
Voicu 2018. Meaningful lift for distinguishing Black vs. white where surnames overlap.
4. BISGP — adds party-registration term
P(race | s, g, party) ∝ P(s|r) · P(party|r) · P(r|g)
Imai & Khanna 2016 also describe this extension. P(party | race) uses validated voter-study estimates (Pew, Catalist, ANES). The party term adds bias risk: it conflates partisanship with ethnicity and is sensitive to drift in voting patterns. Off by default; only available when a party column is mapped.
5. BIFSGP — first name + party
P(race | f, s, g, party) ∝ P(s|r) · P(f|r) · P(party|r) · P(r|g)
6. ethnicolr
LSTM on full name; trained on the Florida voter registration file. No geography needed.
Sood & Laohaprapanon 2018 (arXiv:1805.02109). Native output covers four categories (NH White, NH Black, Hispanic, Asian); aian/multiple/other are reported as 0.
7. Ensemble
Per-race weighted average of whichever methods you ran. Weights reflect published per-group accuracy.
Optional adjustments
Phonetic (Metaphone) fallback
When a surname doesn't appear in the Census file, the loader tries matching by Metaphone code (better than Soundex for non-English names) against a count-weighted index of phonetically equivalent Census surnames. Recovers a few percent of records that would otherwise return no surname signal. On by default; can be disabled.
Decennial 2020 block-level geo
Replaces ACS block-group composition with Decennial 2020 P2 at the full 15-digit block GEOID. More granular — useful in racially mixed neighborhoods where adjacent blocks differ — but the data ages until the next decennial cycle. Block-level data carries some DAS noise.
Married-name correction
For records flagged as female, when the surname-only prediction is high-confidence (>0.6) in a different race than the first-name+geo prediction, the posterior is pulled toward the first+geo distribution by 0.5 (a soft blend, not a hard override). This is a heuristic, not a published method, and it requires a sex column. The underlying bias it addresses is documented in CFPB enforcement actions and Voicu 2018.
Expected accuracy
Approximate per-group accuracy (precision/recall, F1-equivalent) drawn from the published validations cited below. Bars are illustrative ranges, not point estimates — actual numbers depend on the test set.
| Group | Surname-only | + first + geo (BIFSG) | + party (BIFSGP) |
|---|---|---|---|
| hispanic | ~90% |
~92% |
marginal lift |
| asian | ~80% |
~85% |
small lift |
| black | ~60% |
~80% |
+2–5 pts (largest) |
| white | ~75% |
~88% |
small lift |
| aian | ~30% |
~35% |
none |
| multiple | ~10% |
~15% |
none |
| other | ~5% |
~8% |
none |
Notes: White shows high recall but lower precision in surname-only methods (it absorbs unmatched surnames); BIFSG meaningfully improves both. Black sees the largest BIFSGP lift because partisan registration is most diagnostic for distinguishing Black from white voters at the surname-overlap margin. AIAN / Multiple / Other are poorly modeled across the entire BISG family — too few training examples, too much intra-group naming heterogeneity, and no reliable party-by-race signal.
Caveats — read before using results
- Married-name bias. Surname imputation systematically misclassifies people (often women) who took a spouse's surname from a different group. The optional married-name correction is a partial fix at best.
- Aggregate vs. individual. These methods are calibrated for aggregate disparity estimates. Courts have accepted BISG for VRA aggregate analyses but rejected it for individualized determinations.
- Geography matters a lot. Surname-only is materially worse than BISG. If your file has addresses, prefer BISG/BIFSG.
- Party signal is politically loaded. BISGP/BIFSGP use a party-by-race prior derived from validated voter studies. The signal is strongest for distinguishing Black voters, which means errors in that prior produce directional bias against that group. Use only when partisan signal is genuinely diagnostic and disclose it in any output.
- Differential privacy in 2020 data. The 2020 surname file and Decennial block data are post-DAS-noise. For rare names and small blocks this matters; the academic community is still debating when to migrate from 2010.
- Census-Hispanic ambiguity. Census treats Hispanic origin and race as separate dimensions; we collapse them into a single 7-category vector for compatibility with downstream tools. This loses information for Black-Hispanic and white-Hispanic records.
- Ungeocoded records. Addresses that don't match the Census Geocoder fall back to surname-only methods automatically. Match rates are typically 85–95% on cleaned voter files.
- Legal/ethical. For research, redistricting, or compliance work this is standard practice. For commercial targeting (pricing, advertising) it's a regulatory minefield — consult counsel.
Citations
- Elliott, M.N., et al. (2009). "Using the Census Bureau's surname list to improve estimates of race/ethnicity and associated disparities." Health Services and Outcomes Research Methodology, 9(2), 69–83.
- Imai, K. & Khanna, K. (2016). "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records." Political Analysis, 24(2), 263–272.
- Voicu, I. (2018). "Using First Name Information to Improve Race and Ethnicity Classification." Statistics and Public Policy, 5(1), 1–13.
- Sood, G. & Laohaprapanon, S. (2018). "Predicting Race and Ethnicity From the Sequence of Characters in a Name." arXiv:1805.02109.
- Tzioumis, K. (2018). "Demographic aspects of first names." Scientific Data, 5, 180025.
- U.S. Census Bureau (2023). "Frequently Occurring Surnames from the 2020 Census."
- CFPB (2014). "Using publicly available information to proxy for unidentified race and ethnicity."