Watch It, Don't Imagine It: Creating a Better Caption-Occlusion Metric by Collecting More Ecologically Valid Judgments from DHH Viewers

Akhter Al Amin, Saad Hassan, Sooyeon Lee, Matt Huenerfauth · 2022 · Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22) · doi:10.1145/3491102.3517681

Summary

This CHI 2022 paper builds a better automated metric for the severity of caption-occlusion — the problem of closed captions blocking important on-screen visual content during television programming. DHH viewers consistently report that even perfectly transcribed captions become annoying or harmful when they cover a speaker's mouth, a news headline, a sports score, or an ASL interpreter. Yet standard caption-quality metrics such as Word Error Rate (WER), Weighted WER, the NER model, and Automatic Caption Evaluation only measure text accuracy and say nothing about placement. The authors build directly on a prior state-of-the-art model (the 'Component Judgment Model' by the same research group), which collected judgments by showing DHH participants static diagrams of TV screen layouts and asking them to imagine how bad it would be if a caption blocked each region. The central methodological bet of this paper is that ecologically valid judgments — collected after participants actually watch real captioned videos — should yield a better predictive model than judgments based on imagination. The research team assembled 104 30-second video clips across six live-television genres (news, weather news, sports, emergency announcements, interviews, political debates) sourced from CNN, ABC, Fox, NBC, ESPN, and others, and produced four caption-placement variants of each by burning captions at different vertical locations. Twenty-four DHH participants rated the quality of caption placement on a 10-point scale after watching each video. The team annotated each video for maximum occlusion percentage, minimum occlusion percentage, and occlusion time per information region, then fit per-genre multiple linear regression models and used Lindeman-Merenda-Gold relative-importance analysis to decompose feature contributions.

Key findings

Per-genre regression models explained meaningful portions of variance: Sports R² = 0.22, Weather News R² = 0.28, Emergency Announcements R² = 0.18, News R² = 0.13, Interviews R² = 0.10, Political Debates R² = 0.05 — all statistically significant. The features that mattered most differed substantially from the prior imagination-based study: for Sports the top predictors were current game score, player statistics, and quarter/timer — the prior model had instead weighted the speakers' mouth, discussion topic, and speakers' eyes. For News, the top three features differed completely. For Emergency Announcements, the ASL signer's hand and face emerged as the two largest predictors, reflecting a reality that static-diagram judgments missed. The speaker's mouth — historically assumed critical because of speechreading — did not appear in the top features for any genre, likely because during rapid multi-speaker dialogue DHH viewers rely on captions rather than lipreading. Slowly-changing elements (time, temperature, persistent headlines) scored lower than in the imagination-based study, because in a real video any brief gap between captions reveals the underlying information. Rapidly-changing elements (scrolling news tickers, ASL interpreter hands) scored higher. The new holistic metric significantly outperformed the prior Component Judgment Model on two of six genres (Weather News z = 2.36, p < 0.05; Sports z = 2.58, p < 0.01) and matched it on the rest, on the shared evaluation dataset.

Relevance

For broadcasters, streaming platforms, regulators (FCC, Ofcom), and caption-service vendors, this paper contributes a publicly released software implementation of a genre-aware caption-occlusion metric that can either guide prospective caption placement in new broadcasts or retrospectively audit the placement quality of existing content. It is a direct complement to accuracy metrics like WER and the NER model and addresses a long-standing gap: the FCC's closed-captioning quality rules mention placement but provide no quantitative way to measure it. The paper is also a useful methodological reference for accessibility researchers more broadly, arguing that imagination-based user studies systematically mis-weight factors that only matter in motion, and that holistic ratings on dynamic stimuli combined with regression modeling yield more trustworthy user-preference models. Practical takeaways for caption placers: during sports and weather broadcasts, protect the score/timer/weather-map region; during emergency announcements, never cover the ASL interpreter's hands or face; for fast-changing tickers, even brief occlusion is judged harshly. Limitations include short 30-second stimuli, 24 young DHH participants, six genres, and a remote COVID-era protocol without controlled screen sizes.

Tags: captioning · captions · caption occlusion · deaf and hard of hearing · television accessibility · video accessibility · accessibility metrics · accessibility research · live captioning

Standards referenced: FCC Closed Captioning Quality Rules · BBC Subtitle Guidelines