Comparison of Methods for Evaluating Complexity of Simplified Texts among Deaf and Hard-of-Hearing Adults at Different Literacy Levels

Oliver Alonzo, Jessica Trussell, Becca Dingman, Matt Huenerfauth · 2021 · Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21) · doi:10.1145/3411764.3445038

Summary

This CHI 2021 paper is the companion methodological study to Alonzo et al.'s 2022 fluency work: where that later paper asked how to evaluate the fluency (grammaticality) of simplified texts, this one asks how to evaluate their complexity (whether the simplification actually made the text easier) among Deaf and Hard-of-Hearing (DHH) adults across literacy levels. Automatic Text Simplification (ATS) is an emerging assistive reading technology for groups including people with dyslexia, aphasia, language learners, children, and DHH adults, many of whom have diverse English literacy skills (prior work has described 30% of DHH US high-school graduates as functionally illiterate, which in the US corresponds to roughly 4th-to-8th-grade reading levels). Evaluations of text complexity typically rely on reading speed, comprehension questions, score prediction, and Likert-scale subjective judgements of understandability and readability — but no prior work had validated which of these metrics actually distinguishes texts of known complexity levels when the evaluators are DHH adults. The authors used six articles from the science section of Newsela in three human-authored versions each (high complexity: Flesch-Kincaid 12th-grade; medium: 8.9; low: 4.3), verified by a DHH literacy expert, and presented them to 54 DHH participants (mean age 27, 28 culturally Deaf, 19 hard-of-hearing, recruited remotely during COVID over Zoom via jsPsych, $40 compensation). Participants completed the WRAT-4 sentence-comprehension sub-test and were split at the median (score 86) into WRAT-L and WRAT-H groups for analysis of discriminative ability (H1) and literacy bias (H2).

Key findings

Subjective judgements were clearly the winners. Among lower-literacy (WRAT-L) readers, five of six metrics showed some discriminative ability: score prediction, understandability, readability, and even low-linguistic-complexity comprehension questions (but not high-complexity ones or reading speed) distinguished between the low and high complexity conditions. Among higher-literacy (WRAT-H) readers, only readability judgements worked at all, and only between the low and high conditions. Reading speed was ineffective in both groups — a notable contrast with the 2022 follow-up where reading speed did distinguish fluency levels. Literacy bias was present for every metric: WRAT-H participants scored higher or gave more positive judgements across the board, confirming that researchers must report participant literacy levels when using any of these metrics. Comprehension questions only worked when the question wording itself was simple enough for lower-literacy readers to parse, and when the text-complexity difference was large enough (multiple grade levels apart).

Relevance

For accessibility researchers building or evaluating reading-assistance tools, this paper pins down a practical methodological recipe for complexity evaluation with DHH users: recruit readers across literacy levels, report literacy scores, prefer subjective readability Likert judgements over objective reading speed, and only use comprehension questions if you validate question difficulty independently from text difficulty. The counterintuitive finding that comprehension questions can be biased by question wording (not just text difficulty) should worry anyone designing reading-comprehension assessments for low-literacy users generally, not just DHH readers. Limitations include stimuli restricted to the Newsela science corpus, the high-complexity condition possibly not being challenging enough to stress higher-literacy readers, no eye-tracking due to COVID-19 constraints, and a sample that identified as DHH but was not specifically recruited as ASL signers. Read together with the 2022 fluency paper, the two studies give ATS researchers a full methodological toolkit for the three evaluation dimensions (complexity, fluency, faithfulness).

Tags: automatic text simplification · deaf and hard of hearing · readability · reading accessibility · research methodology · literacy · natural language processing · qualitative research