Silence is a Feature, Not a Bug: A Deaf Developer’s Autoethnography on Agency and Local AI

Chenyang Gong · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA ’26) · doi:10.1145/3772363.3798715

Summary

This CHI 2026 Extended Abstract is a three-page autoethnographic provocation by a Deaf computer science graduate student who uses a MED-EL cochlear implant. The author refuses the medical-model framing of deafness as deficit and instead argues that the ability to remove the processor and enter absolute silence is a privilege hearing people do not have. The real barrier is not silence but the inability to control what enters awareness when the processor is on — the "cocktail party effect" of filtering unwanted signals. The paper targets a prevailing design assumption in AI-mediated accessibility: that more transcription equals better accessibility. The author argues that cloud-based live-captioning pipelines, benchmarked on Word Error Rate (WER), systematically ignore three costs that matter to Deaf users — latency, context, and agency. Hundreds of milliseconds of network delay breaks turn-taking and social synchrony; general-purpose models hallucinate on technical jargon and non-native accents; and always-on cloud processing converts users into passive readers while exposing environmental audio to third parties. Drawing on Crip Technoscience (Hamraie and Fritsch) and autobiographical design (Neustaedter and Sengers), the author positions themselves as an architect of their own tools and proposes SilentAgent: a Rust-based, local-first AI agent that runs quantized speech recognition on-device, accepts user-defined lexicons (lab terminology, acronyms), and stays silent by default — speaking only when the user explicitly engages.

Key findings

This is a position paper, not an empirical study, so its contribution is argumentative rather than statistical. Four claims stand out. First, WER is the wrong metric for conversational accessibility: the author argues they would trade five percentage points of accuracy for a 50 ms latency improvement, because a delayed laugh is "functionally equivalent to no laugh." Second, cloud generalist models systematically fail in specialist contexts — the author reports phonetic hallucinations such as "LLM inference" transcribed as "Ellen's friends" and "Rust Traits" as "Rust Trades" in a bilingual Chinese-English lab, requiring real-time mental reverse-engineering that drains cognitive bandwidth needed for intellectual participation. Third, always-on captioning is a form of over-assistance that erodes agency by presuming the user always wants transcription; silence should be the default. Fourth, architecture matters. The author describes moving prototypes from Python to Rust specifically to escape the Global Interpreter Lock and garbage-collection pauses, using the CPAL crate for low-overhead audio capture and Sherpa-onnx for on-device inference with JSON lexicon injection to bias recognition toward domain terms. The stated evaluation plan explicitly rejects WER-only benchmarking in favour of interactional metrics: perceived conversational timing, cognitive load, and sense of agency. The author acknowledges generative-AI assistance (Gemini Pro) in drafting.

Relevance

For accessibility practitioners and AT developers, this paper is a useful corrective to the "transcribe everything" paradigm that dominates current captioning products. The author's framing of epistemic exclusion in technical seminars — the moment when a Deaf researcher withdraws from a debate not from lack of ideas but because captions cannot keep pace — names a specific harm that WER benchmarks do not measure. Practitioners designing captioning, speech-to-text, or real-time meeting tools should take seriously the argument that latency is a social signal and that user-controlled activation may matter more than raw accuracy. The local-first + lexicon-injection pattern is also directly applicable to any specialist setting (legal, medical, research) where generalist ASR fails. Caveats: n=1, no deployed evaluation yet, and the accuracy/latency trade-off the author endorses is a personal preference that should not be generalized to all Deaf and Hard of Hearing users. Still, as a design provocation, it reframes accessibility as agency rather than coverage.

Tags: autoethnography · deaf and hard of hearing · cochlear implant · automatic speech recognition · captioning · local-first software · privacy · crip technoscience · AI accessibility