Methods for Evaluation of Imperfect Captioning Tools by Deaf or Hard-of-Hearing Users at Different Reading Literacy Levels

Larwan Berke, Sushant Kafle, Matt Huenerfauth · 2018 · Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18) · doi:10.1145/3173574.3173665

Summary

This CHI 2018 paper (awarded an Honourable Mention) is the originating methodological study behind the group’s later Alonzo et al. work on Automatic Text Simplification evaluation. It asks: when Deaf and Hard-of-Hearing (DHH) participants evaluate imperfect captions produced by Automatic Speech Recognition (ASR), which evaluation probes (question types) actually detect differences in caption quality, and do those probes work equally well across literacy levels? The authors argue that accessibility researchers have a responsibility to measure the usability of ASR-captioning tools carefully before deploying them, particularly because DHH readers span a wide range of English literacy (approximately 30% of deaf US high-school graduates are described as functionally illiterate, and most Deaf university students read at around sixth-grade level). A lab study was run with 107 DHH participants (69 Deaf, 36 HoH, 2 Other; 48 women; ages 18-30; recruited via university channels; $40 compensation) who watched 12 mock-meeting videos (a human-resources officer discussing hiring, scripted at 8th-grade Flesch-Kincaid) captioned at three known Word Error Rate (WER) levels: Human ~5%, Cloud ~15% (Sphinx/Watson-class), and Desktop ~50% (degraded ASR simulation). Participants were split by WRAT-4 sentence-comprehension score into WRAT-L, WRAT-M, and WRAT-H literacy subgroups. After each video, they answered seven types of probes: Boolean 'noticed errors,' Likert 'errors prevented understanding,' Likert 'ASR did a good job,' five-point ordinal accuracy rating, numeric 0-100 accuracy estimate, multiple-choice comprehension quiz, and comprehension-quiz response time.

Key findings

Two hypotheses were tested per probe: H1 — discriminative ability (does the probe reveal significant differences between WER levels?) and H2 — literacy bias (do response scores systematically differ across literacy groups?). The ordinal accuracy scale (Evaluation of Accuracy, d) was the most broadly successful probe: it distinguished caption quality levels for WRAT-H and WRAT-M participants fully, and partially for WRAT-L. Among WRAT-H participants, six of seven probes worked, with the Likert 'errors prevented understanding' (b) and ordinal accuracy scale (d) being most sensitive. Among WRAT-L participants, very few probes worked — only the comprehension quiz (partial) and the ordinal accuracy scale (barely), suggesting that lower-literacy DHH readers struggle with probes that demand meta-cognitive reflection. Literacy bias was present for four probes: 'noticed errors' (a) — WRAT-H noticed roughly twice as many errors as WRAT-L; 'ASR did a good job' (c) — WRAT-H gave more critical scores; ordinal accuracy (d) — WRAT-H more critical; and multiple-choice comprehension quiz (f) — higher literacy correlated with higher scores independent of caption quality. Notably, the numeric accuracy estimate (e) showed no literacy bias, despite the ordinal version of the same question (d) showing significant bias. The comprehension-quiz response time (g) did not work for any group and any WER level.

Relevance

For accessibility researchers evaluating ASR-based captioning, live transcription, or any other imperfect real-time language technology with DHH users, this paper gives a concrete probe-selection guide: use an ordinal accuracy rating as the primary evaluation instrument, report participant literacy levels (ideally via WRAT), and supplement with multiple-choice comprehension questions only when the stimulus quality differences are large (Desktop vs Human, not Cloud vs Human). For practitioners designing ASR captioning deployments, the underlying finding matters: DHH users with lower literacy tolerate or fail to notice caption errors that higher-literacy users find problematic — which means satisfied user feedback from a low-literacy deployment does not mean the captions are actually high quality. Limitations include a university-recruited sample (ages 18-30, under-representing older DHH adults), a single scripted business-meeting genre at 8th-grade reading level, a Sphinx-plus-manual-error-injection stimulus pipeline (not a real current commercial ASR), and a lack of open-ended or eye-tracking probes. This paper directly motivated the group’s subsequent methodological-research programme on ATS evaluation.

Tags: captioning · deaf and hard of hearing · automatic speech recognition · research methodology · literacy · accessibility research · qualitative research