BLEU Score

Also known as: BiLingual Evaluation Understudy, BLEU

A metric for evaluating the quality of machine-generated text by comparing it to one or more reference (human-written) translations. BLEU calculates precision by counting how many n-grams (sequences of words) in the predicted text match n-grams in the reference text, with BLEU-1 comparing single words, BLEU-2 comparing word pairs, and so on up to BLEU-4. In accessibility research, BLEU scores are used to evaluate sign language translation systems, automatic captioning, and image description generators. While widely used, BLEU has limitations—it focuses on exact word matches and may not capture semantic equivalence or fluency, which is particularly relevant for sign language where multiple valid translations often exist.

Category: Machine Learning · Evaluation Methods · Natural Language Processing

Related: Sign Language Translation · Machine Translation · Natural Language Processing

Sources

https://doi.org/10.3115/1073083.1073135
https://doi.org/10.1145/3477498