Modeling Word Importance in Conversational Transcripts: Toward improved live captioning for Deaf and hard of hearing viewers

Akhter Al Amin, Saad Hassan, Matt Huenerfauth, Cecilia O. Alm · 2023 · Proceedings of the 20th International Web for All Conference (W4A) · doi:10.1145/3587281.3587290

Summary

This paper investigates how to model word importance in conversational transcripts to improve live captioning quality for Deaf and hard of hearing (DHH) viewers. Live captions generated by automatic speech recognition (ASR) systems inevitably contain errors, but not all errors are equally damaging to comprehension. DHH readers typically skim captions quickly, with only two to three lines visible at any time, making keyword recognition critical for understanding. Standard caption quality metrics like Word Error Rate (WER) and Named Entity Recognition (NER) treat all words equally, failing to reflect how DHH users actually perceive and process captions. The authors build on a human-annotated word importance dataset derived from the Switchboard conversational speech corpus, where approximately 25,000 tokens were labeled by annotators as high, medium, or low importance. They explore the linguistic characteristics of words at different importance levels, finding that high-importance words are predominantly nouns and adjectives, while low-importance words tend to be determiners, pronouns, and prepositions. The research then develops classification models to automatically predict word importance. The authors augment the training data with part-of-speech (POS) tags and use BERT-based masked language modeling to generate synthetic examples, addressing class imbalance in the dataset. Multiple model architectures are evaluated, including logistic regression, random forests, and neural networks, with and without POS augmentation.

Key findings

The best-performing model was a neural network trained on POS-augmented data, achieving an F1 score of 0.64 and accuracy of 0.71 for word importance classification. POS tag augmentation consistently improved performance across model types — logistic regression improved from F1 0.57 to 0.59 with POS features. Analysis of POS tag distributions revealed clear patterns: high-importance words were dominated by common nouns (NN), plural nouns (NNS), proper nouns (NNP), and adjectives (JJ), while low-importance words were primarily determiners (DT), personal pronouns (PRP), and prepositions (IN). The BERT-based data augmentation technique generated approximately 5% additional training examples, helping to reduce class imbalance. The authors also released their augmented dataset publicly to support future research. While the models showed meaningful improvement over baselines, the relatively moderate F1 scores indicate that word importance prediction in conversational speech remains a challenging task, likely due to the inherently subjective and context-dependent nature of importance judgments.

Relevance

This research has direct implications for improving live captioning systems used in education, workplaces, and media. Current caption quality metrics do not account for the fact that errors on important content words are far more disruptive than errors on function words. A reliable word importance model could enable weighted error metrics that better predict DHH users' actual comprehension, leading to more meaningful evaluation of ASR systems. For accessibility practitioners, this work highlights that caption accuracy alone is insufficient — the distribution of errors across word types matters significantly. Organizations providing live captioning services could use importance-weighted metrics to better assess and compare captioning quality. The research also underscores the value of involving DHH perspectives in defining what constitutes caption quality, moving beyond purely technical metrics toward user-centered evaluation.

Tags: live captioning · deaf and hard of hearing · automatic speech recognition · word importance · natural language processing · caption quality metrics