Reinforcement Learning from Human Feedback
Also known as: RLHF
A machine learning technique used to fine-tune large language models by incorporating human judgments about response quality. Human annotators rank or rate model outputs, and this feedback trains a reward model that guides the LLM toward producing preferred responses. While RLHF is effective for aligning AI outputs with general human preferences, it relies on annotator perspectives that may not represent marginalised groups including disabled or neurodivergent users, potentially embedding normative biases into supposedly "aligned" models.
Category: Artificial Intelligence
Related: Large Language Model · Human-AI Alignment · AI Bias