Towards a Tool for Keystroke Level Modeling of Skilled Screen Reading

Shari Trewin, Bonnie E. John, John Richards, Cal Swart, Jonathan Brezin, Rachel Bellamy, John Thomas · 2010 · Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2010) · doi:10.1145/1878803.1878811

Summary

This paper explores whether cognitive models of auditory interaction — specifically the Keystroke-Level Model (KLM) — can be adapted to predict task completion times for skilled screen reader users, with the goal of helping designers evaluate the usability of their interfaces for blind users without needing direct access to screen reader users. The authors observed an expert JAWS user (P1) performing a practiced, real-world financial data retrieval task across Yahoo Finance and Google Finance pages using Internet Explorer. P1 listened to synthesized speech at approximately 900 words per minute and demonstrated highly optimized navigation strategies, including concatenating keystroke sequences without waiting for speech feedback, using structural HTML landmarks, and employing the Control key to interrupt JAWS speech. The researchers then used CogTool, a cognitive modeling tool based on KLM and the ACT-R cognitive architecture, to generate a baseline model of the same task. CogTool was originally designed for visual GUI interaction, so this work represents its first application to screen reader usage. The model produced a task time estimate of 91.9 seconds compared to the actual 48.7 seconds — nearly double — revealing fundamental mismatches between the tool's assumptions and the realities of skilled auditory interaction.

Key findings

The standard KLM as implemented in CogTool significantly overestimated task time for screen reader usage due to several key mismatches. First, CogTool's rules for placing Think (cognitive) operators were derived from visual interfaces and inserted pauses before actions like pressing Enter, but the screen reader user had automated these keystroke sequences into cognitive units requiring no deliberation. Removing unwarranted Think operators reduced the estimate from 91.9 to 83.7 seconds. Second, the model assumed sequential processing — listen, then think, then act — but the skilled user performed hearing, thinking, and acting in parallel, often issuing the next command after hearing only the first few syllables of speech output. Third, ACT-R's speech timing parameters, calibrated for sighted individuals at normal speech rates, did not account for the extremely fast listening rates of experienced screen reader users (~900 wpm). Fourth, the model's typing assumptions (home-row finger placement between keystrokes) did not match screen reader usage patterns, where inter-key times were about one-third of the model's predictions. The authors identified that a practical modeling tool would need to distinguish between content that must be fully heard versus confirmation items that can be partially heard.

Relevance

This paper makes a compelling case that existing usability evaluation tools and cognitive models are fundamentally inadequate for predicting screen reader interaction performance, because they were built on assumptions about visual interfaces. The findings have direct implications for accessibility practitioners: first, that skilled screen reader users interact with interfaces in ways that are radically different from sighted users, including parallel processing of audio and action at speech rates approaching 900 wpm. Second, that automated accessibility compliance testing reveals little about actual screen reader usability, since two technically compliant pages can offer vastly different user experiences depending on structural markup and navigation affordances. The vision of a design tool that could predict screen reader task times would allow developers to compare alternative designs and optimize for auditory interaction early in the design process, without requiring scarce access to screen reader users for testing.

Tags: screen readers · cognitive modeling · usability · keyboard navigation · JAWS · keystroke-level model · auditory interface · human-computer interaction