Counterfactual Prompting

Also known as: Counterfactual Debiasing, Counterfactual Data Augmentation

A bias mitigation technique that involves modifying prompts or training examples by swapping identity-related attributes (such as disability status, gender, or race) while keeping all other context identical, in order to expose and counteract biased associations in language models. For example, a counterfactual prompt might replace "a person with autism" with "a person without autism" to test whether the model's response changes based solely on disability status. When models produce different outputs for counterfactual pairs, this reveals encoded biases. The technique can be used both for bias detection (identifying where models treat groups differently) and bias mitigation (training or prompting models to produce consistent responses regardless of identity attributes).

Category: artificial intelligence · AI fairness

Related: Self-Debiasing · Ability Bias · Algorithmic Bias · Prompt Chaining

Sources

https://dl.acm.org/doi/10.1145/3744257.3744268