Methods for Evaluating the Fluency of Automatically Simplified Texts with Deaf and Hard-of-Hearing Adults at Various Literacy Levels

Oliver Alonzo, Jessica Trussell, Matthew Watkins, Sooyeon Lee, Matt Huenerfauth · 2022 · Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI '22) · doi:10.1145/3491102.3517566

Summary

This CHI 2022 paper is a methodological study, not a product evaluation: the authors ask how researchers should measure the fluency of Automatic Text Simplification (ATS) output when the evaluators are Deaf and Hard-of-Hearing (DHH) adults spanning a wide range of English literacy levels. ATS is already used to build reading-assistance tools for groups including people with dyslexia, aphasia, second-language learners, and DHH adults, but ATS output can introduce grammatical and semantic errors — so alongside the usual question of whether the simplified text is easier to read (complexity) and whether it preserves meaning (faithfulness), researchers need to evaluate whether it still reads as coherent English (fluency). Prior DHH research had established how to measure complexity with this population, but no one had validated metrics for fluency. The authors therefore engineered stimuli at three known fluency levels (low, medium, high) by mixing sentences from two state-of-the-art ATS systems (a hybrid rule-based/data-driven model and a Transformer-based model) with human-written simplifications from Newsela, then had 29 DHH participants (average age 25.6; recruited via social media; conducted remotely during COVID over Zoom with an ASL-fluent research assistant) read the texts in jsPsych and respond on a battery of objective and subjective metrics. Participants were split into higher and lower literacy groups (WRAT-H and WRAT-L) using their median score on the Wide-Range Achievement Test sentence-comprehension sub-test.

Key findings

Two metrics reliably distinguished fluency levels across both literacy groups: reading speed in words per minute (Kruskal-Wallis p < 0.001 for both groups) and subjective Likert-scale judgements of grammaticality (WRAT-L p = 0.003; WRAT-H p = 0.001). Reading speed had only ever been used for complexity evaluation before this study, so its effectiveness for fluency is a new finding. Subjective judgements of readability and understandability only worked for higher-literacy readers, suggesting these metrics demand the meta-cognitive awareness that makes lower-literacy readers less reliable judges of dis-fluencies. Comprehension questions (written at both high- and low-linguistic-complexity) failed to distinguish fluency levels for either group, though they did exhibit a literacy bias (WRAT-H scored higher), as did understandability judgements. Score prediction, system-performance judgements, and readability were confirmed by TOST equivalence testing to have no measurable literacy bias.

Relevance

For accessibility practitioners and researchers building or evaluating reading-assistance tools, this paper offers a concrete methodological recipe: use reading speed as the primary fluency metric and supplement with a grammaticality Likert item when reading speed is impractical to capture. The work also underscores that DHH is not a monolithic population — 30% of deaf high-school graduates in the U.S. have been described as functionally illiterate, and recruiting only 'expert' native-English readers for ATS evaluations systematically excludes the very users most likely to benefit from the tools. The authors also recommend that researchers report participant literacy levels (e.g., WRAT scores) so results can be compared across studies. Limitations include the small sample size, stimuli being restricted to science-topic news articles at a single complexity level, and no evaluation of faithfulness.

Tags: automatic text simplification · deaf and hard of hearing · readability · reading accessibility · natural language processing · research methodology · literacy · qualitative research