Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Search results

Acoustic Model(also: AM): An acoustic model is the component of an automatic speech recognition (ASR) system that maps short segments of audio (typically 10–25 ms frames of spectral features) to the linguistic units that produced them — most commonly phonemes or sub-phonetic states. Classical acoustic…
Automated Speech Recognition(also: ASR, Speech-to-Text, Voice Recognition): Technology that converts spoken language into written text using machine learning and signal processing algorithms. In accessibility, ASR is used for real-time captioning, voice control of devices and software, and generating transcripts of audio and video content. While ASR…
Automatic Speech Recognition (ASR)(also: ASR, Speech-to-Text, Voice Recognition): Technology that converts spoken language into written text using computational algorithms and machine learning models. ASR powers auto-captioning features in video conferencing, media players, and assistive devices. While ASR has improved significantly, its accuracy is affected…
Caption quality metric(also: ACE metric, Caption evaluation metric): A measure designed to predict how understandable automatically generated captions are for Deaf and Hard-of-Hearing users, as an alternative to standard Word Error Rate which correlates poorly with actual DHH comprehension. The Automatic Caption Evaluation (ACE) metric combines…
Code-switching(also: Language switching, Code-mixing): Code-switching is the practice of alternating between two or more languages, dialects, or communication styles within a single conversation or even a single sentence. It is common in multilingual households, immigrant communities, and among speakers of non-standard dialects.…
Connected Speech Recognition(also: Continuous Speech Recognition): A form of automatic speech recognition in which users speak words naturally, with normal coarticulation and minimal pauses, rather than pausing between each word as required by older 'discrete' or 'isolated-word' recognisers. Connected-speech recognition was a significant…
Deaf-Accented Speech(also: Deaf Accent, Deaf-Accented English): Speech produced by Deaf or Hard of Hearing people whose articulation, prosody, and voicing patterns differ from typical hearing speakers because the speaker has limited or no auditory feedback for their own voice. Deaf-accented speech is intelligible to familiar listeners but is…
DementiaBank: A shared database of multimedia interactions for the study of communication in dementia, maintained as part of the TalkBank system. DementiaBank contains longitudinal recordings of people with Alzheimer's disease and matched controls performing tasks like the "cookie theft"…
Dialog Act(also: Dialogue Act, Speech Act): A classification label representing the communicative intention behind a spoken or written utterance in a conversational system. In the context of accessible technology, dialog acts are used to interpret what a user wants to accomplish when issuing voice commands — for example,…
Distant speech recognition(also: Far-field ASR, Far-field speech recognition): Automatic speech recognition performed on audio captured by microphones positioned at a distance from the speaker (typically 2+ meters), rather than close-talk input from headsets or handheld devices. Distant speech recognition is significantly more challenging than close-talk…
Dysarthric Speech(also: Dysarthria): Dysarthric speech is speech that is affected by dysarthria, a motor speech disorder resulting from neurological injury or conditions that affect the muscles used for speech production. Characteristics include imprecise articulation, irregular speech rate, abnormal pitch and…
Error-spread modelling(also: Error propagation modelling, Error radiation): An approach to evaluating the impact of speech recognition errors that accounts for how a single misrecognized word degrades comprehension of its neighbouring words, not just the word itself. For example, misrecognizing "kitchen" as "kitten" makes the subsequent word "area"…
Goodness of Pronunciation(also: GOP, GOP Score): A computational measure used in automatic speech recognition to assess how closely a spoken utterance matches expected pronunciation patterns. GOP scores are calculated by comparing phone sequences from unrestricted ASR against forced alignment to the actual word sequence. In…
Hidden Markov Model(also: HMM): A statistical model used extensively in pattern recognition where the system being modeled is assumed to follow a Markov process with hidden (unobserved) states. HMMs have been foundational in both automatic speech recognition and sign language recognition, as they can model…
Jitter and Shimmer(also: Voice perturbation measures, Cycle-to-cycle variability): Acoustic measures of voice quality that capture short-term irregularity in the vocal fold vibration. Jitter is the cycle-to-cycle variability in pitch (fundamental frequency), while shimmer is the cycle-to-cycle variability in amplitude. Elevated jitter and shimmer are…
SLPAT(also: Speech and Language Processing for Assistive Technologies): A special interest group jointly supported by the Association for Computational Linguistics (ACL) and the International Speech Communication Association (ISCA), focused on speech and language technology for assistive applications. SLPAT brings together researchers from…
Speaker Adaptation(also: Voice Adaptation, Speaker-Adaptive Training, Voice Personalization): Speaker adaptation is the process of adjusting an existing automatic speech recognition (ASR) system — usually one trained on a large, demographically broad corpus of able-bodied speakers — to a particular individual's voice using a relatively small amount of that person's…
Vocal Programming(also: Voice Coding, Speech-Based Programming, Voice Programming): The practice of writing, editing, and navigating computer code using speech recognition rather than keyboard input. Vocal programming is an important accessibility concern because conventional software development tools implicitly require the use of a keyboard, creating a…
Voice Recognition(also: Speech Recognition, Voice Control, Voice Input): Technology that identifies and processes human speech to convert it into text or execute commands. Voice recognition serves as a critical assistive technology for people with motor disabilities who cannot use a keyboard or mouse, enabling them to navigate websites, dictate text,…
Voice User Interface(also: VUI, Voice Command Interface, Voice Interface): An interface that allows users to interact with a device or application through spoken language commands rather than touch, mouse, or keyboard input. Voice user interfaces use automated speech recognition (ASR) to convert speech to text and natural language understanding (NLU)…
Wake Word(also: Hotword, Trigger Word, Activation Word): A specific word or phrase that activates a voice-controlled device, such as "Hey Google," "Alexa," or "Hey Siri." The wake word must be spoken before any command for the device to begin listening. Wake words present accessibility barriers for people with speech disfluencies, as…
Word Error Rate(also: WER): A metric used to evaluate the accuracy of automatic speech recognition (ASR) and captioning systems, calculated as the number of word-level errors (insertions, deletions, and substitutions) divided by the total number of words in the reference transcript. Lower WER indicates…
Word Lattice(also: Recognition Lattice, Speech Lattice): A graph data structure produced by a speech recognizer that represents multiple competing word hypotheses explored during recognition, along with their acoustic and language model scores. Each path through the lattice represents a possible transcription of the spoken input. Word…
Word error rate(also: WER): The standard metric for evaluating automatic speech recognition accuracy, calculated as the number of substitutions, deletions, and insertions divided by the total number of words in the reference transcript. Research with DHH users has shown that WER correlates poorly with…
iVector(also: Identity Vector, i-vector): A low-dimensional representation of voice characteristics widely used in speaker recognition and verification systems. iVectors capture many acoustic aspects of a speaker's voice in a compact form, making them useful for automatically estimating speech intelligibility in people…

25 results.

Category

Search results