AudioWiz: Nearly Real-Time Audio Transcriptions

Samuel White · 2010 · Proceedings of the 12th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2010) · doi:10.1145/1878803.1878885

Summary

This student research paper presents AudioWiz, a mobile application that provides near-real-time transcriptions of both spoken words and environmental sounds for deaf and hard of hearing users. The author identifies a critical gap in existing automated transcription systems: they filter out environmental noises and focus only on speech, leaving deaf users unaware of important non-speech audio events such as a faulty appliance grinding, a guard dog barking, a doorbell ringing, or a telephone. AudioWiz addresses this by using human crowdworkers rather than automated speech recognition, enabling transcription of any audio content — including contextual and environmental sounds with no speech whatsoever. The system has two components: an iPhone 3GS client application that records and buffers up to thirty seconds of audio with a real-time waveform visualization, and a server that handles worker recruitment and job queuing. Users visually monitor the scrolling waveform to identify significant audio events, then press a "Transcribe It!" button to compress and upload that audio segment for human transcription. Workers are recruited from Amazon's Mechanical Turk via TurKit/quickTurKit, an abstraction layer developed at the University of Rochester that begins recruiting the moment the client app launches, ensuring workers are available when the first audio arrives.

Key findings

The system achieves transcription turnaround in as little as one minute, providing a realistic way for deaf users to understand ambient audio information. Workers are paid one cent per transcription and are not pre-trained; they receive simple instructions to listen for both verbal and nonverbal events, and in the absence of significant events, to describe everything they hear in as much detail as possible. Workers are barred from transcribing the same audio more than once to ensure independent descriptions. The iPhone 3GS was chosen specifically for its hardware MPEG-4 encoder, which compresses audio before transmission significantly faster than software encoding, reducing latency. The waveform visualization serves a dual purpose: it gives users visual feedback about ambient audio levels (higher peaks indicate louder or more active sounds) and helps them decide which audio segments are worth submitting for transcription, optimizing both cost and worker time.

Relevance

AudioWiz represents an early and innovative application of human computation to accessibility, predating the widespread availability of AI-powered sound recognition. The core insight — that deaf users need awareness of environmental sounds, not just speech transcription — remains critically relevant and has since been addressed by features like Apple's Sound Recognition and Android's Sound Notifications. The crowdsourcing approach trades scalability for quality: humans can describe complex, ambiguous audio scenes that automated systems still struggle with, though the cost and latency make it impractical for continuous use. For accessibility practitioners, the key design lesson is that sound awareness for deaf users extends far beyond captioning spoken words. The system also anticipated future developments in remote accessibility services, where human workers provide real-time assistance to people with disabilities through on-demand platforms.

Tags: deaf and hard of hearing · audio transcription · crowdsourcing · human computation · environmental sounds · sound awareness · mobile accessibility