Mitigating Data Bias
Knowing where bias enters is necessary but not sufficient. You need mitigation strategies at each stage. The general principles below apply broadly; the specific tactics depend on your domain, your data, and what kind of bias you've identified.
- Representativeness. Ensure your dataset represents the target population — not just your convenient sampling frame.
- Diverse and inclusive collection. Sources and participants spanning gender, race, ethnicity, age, socioeconomic status, geography, and other relevant attributes.
- Balanced distribution. Across groups and categories. If a subgroup is genuinely rare, consider oversampling or synthetic augmentation.
- Objective and consistent labeling. Clear guidelines. Multiple annotators. Double-blind annotation for sensitive categories.
- Audit for sensitive attribute proxies. Postal code is often a proxy for race. Time-of-day-of-shopping can be a proxy for socioeconomic status. Removing the explicit sensitive attribute isn't enough if correlated proxies remain.
- Documentation and transparency. Collection process, sources, known biases, limitations — all documented.
- Regular monitoring. Especially with internet-sourced data — biases shift over time as the world changes.
- Diverse teams. People with different perspectives spot different problems.
Synthetic Data as One Tool
Synthetic data can address data scarcity for underrepresented groups, rebalance distributions, protect privacy when real personal data is too sensitive to expose, and enable controlled experiments.
The challenges are real: the generative process must accurately capture the true underlying distribution (if it makes wrong assumptions, it can introduce new biases); the test set still needs to be real; and there's potential for reverse engineering if the synthetic data allows re-identification of originals. Synthetic data is a tool. It can do a lot of good in the right hands and cause new problems in the wrong hands.
Self-Driving Cars and Synthetic Data
Self-driving car companies use synthetic data extensively because collecting real-world data for rare but critical scenarios — a child running into the road, an unusual sensor failure mode — is expensive and dangerous. Healthcare research uses synthetic data for similar reasons: rare disease research suffers chronically from small sample sizes that synthetic augmentation can address without compromising patient privacy.
You remove 'race' from your loan model's features to prevent discriminatory outcomes. A fairness audit later shows the model still produces disparate outcomes across racial groups. What most likely explains this?