LLM-as-Judge

Also known as: LLM as a Judge, Model-as-Judge

An evaluation methodology in which a large language model is prompted to assess the quality of some artifact — generated text, code, a UI, or a response from another model — according to a structured rubric. LLM-as-judge is attractive because it scales automated evaluation to dimensions (meaning, tone, helpfulness, accessibility semantics) that conventional deterministic checkers cannot measure. In accessibility work, LLM judges have been used to flag non-descriptive alt text, vague link purposes, and generic form labels that pass syntactic tools like Axe-core. LLM judges require calibration — typically through controlled fault injection or agreement with human annotators — and their outputs inherit the biases and blind spots of the underlying model.

Category: Artificial Intelligence · Accessibility Testing · Evaluation Methods · Machine Learning

Related: Large Language Model · Automated accessibility testing · Semantic Accessibility

Sources

https://arxiv.org/abs/2306.05685