Reinforcement Learning from Human Feedback

Also known as: RLHF

A machine learning technique used to fine-tune large language models by incorporating human judgments about response quality. Human annotators rank or rate model outputs, and this feedback trains a reward model that guides the LLM toward producing preferred responses. While RLHF is effective for aligning AI outputs with general human preferences, it relies on annotator perspectives that may not represent marginalised groups including disabled or neurodivergent users, potentially embedding normative biases into supposedly "aligned" models.

Category: Artificial Intelligence

Related: Large Language Model · Human-AI Alignment · AI Bias

Sources

https://doi.org/10.1145/3663547.3746361