Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Search results

AI-Generated Speech(also: Synthetic Speech, AI Speech): Speech audio produced by artificial intelligence systems — typically neural text-to-speech or voice cloning models — rather than recorded from a human speaker. Deaf and hard-of-hearing content creators increasingly use AI-generated speech to add spoken-language tracks to signed…
Acoustic Analysis(also: Acoustic Signal Analysis): The computational examination of sound signals to extract measurable properties such as duration, fundamental frequency (pitch), intensity, spectral characteristics, and formant structure. In accessibility and clinical contexts, acoustic analysis is used to objectively assess…
Acoustic Model(also: AM): An acoustic model is the component of an automatic speech recognition (ASR) system that maps short segments of audio (typically 10–25 ms frames of spectral features) to the linguistic units that produced them — most commonly phonemes or sub-phonetic states. Classical acoustic…
Canonical Syllable(also: Canonical Babbling, Well-Formed Syllable): A canonical syllable is a well-formed syllable in infant babbling that consists of a consonant-like closure (closant) produced by an oral cavity constriction followed by a vowel-like opening (vocant). Canonical syllables typically appear between 5 and 10 months of age in the…
Computer Feedback System(also: CFS, Computerized Feedback System): A technology system that detects a user's behavior — such as vocalizations, movements, or physiological signals — and provides immediate audio, visual, or haptic responses mapped to that behavior. In speech and communication interventions, computer feedback systems translate…
Computer-Assisted Language Learning(also: CALL, Computer-Aided Language Learning): Computer-Assisted Language Learning (CALL) refers to the use of computers and digital technology to support language education and pronunciation training. CALL systems often incorporate automatic speech recognition to provide feedback on learner pronunciation, detect…
Computer-Based Speech Training(also: CBST, Computer-Aided Speech Training, CAST): Computer-based speech training (CBST) refers to software systems designed to help individuals improve their speech production through automated exercises, feedback, and practice. These systems typically present target words or utterances, capture the user's speech through a…
Concatenated Speech Synthesis(also: Concatenative Synthesis, Unit Selection Synthesis): A method of producing synthetic speech by connecting pre-recorded segments of human speech, typically diphones (transitions between phonemes) or demi-syllables, to form complete words and sentences. Concatenated speech synthesis produces more natural-sounding output than older…
Concatenative Synthesis(also: Unit Selection Synthesis): A text-to-speech method that generates synthetic speech by concatenating (joining together) pre-recorded segments of human speech. These segments, called units, may be phonemes, diphones, syllables, or words. The system selects and joins appropriate units from a large database…
Connected Speech Recognition(also: Continuous Speech Recognition): A form of automatic speech recognition in which users speak words naturally, with normal coarticulation and minimal pauses, rather than pausing between each word as required by older 'discrete' or 'isolated-word' recognisers. Connected-speech recognition was a significant…
DECTalk: A text-to-speech synthesis system originally developed by Digital Equipment Corporation in the 1980s, using rule-based formant synthesis to generate speech from text input. DECTalk offered several preset voices (including "Paul" and "Betty") and was widely adopted in AAC…
Data-based Synthesis(also: Corpus-based Synthesis, Unit Selection Synthesis): A speech synthesis technique that generates speech by selecting and concatenating segments from a large database of prerecorded human speech, rather than using rules to generate acoustic waveforms from scratch. The database is indexed with phoneme boundaries, pitch, and prosodic…
Deaf-Accented Speech(also: Deaf Accent, Deaf-Accented English): Speech produced by Deaf or Hard of Hearing people whose articulation, prosody, and voicing patterns differ from typical hearing speakers because the speaker has limited or no auditory feedback for their own voice. Deaf-accented speech is intelligible to familiar listeners but is…
Diphone(also: Diphone Synthesis): A unit of speech used in text-to-speech synthesis, consisting of the transition from the middle of one phoneme to the middle of the next. Diphone-based synthesis works by recording a set of all possible phoneme-to-phoneme transitions in a language and concatenating the…
Dragon NaturallySpeaking(also: Dragon Dictation, Dragon Speech Recognition, Nuance Dragon): Dragon NaturallySpeaking is a commercial speech recognition software product, originally developed by Dragon Systems and later acquired by Nuance Communications (now part of Microsoft). It converts spoken words into text and computer commands, enabling hands-free computer…
DragonDictate(also: Dragon Dictate): An early discrete speech recognition system developed by Dragon Systems that allowed users to control computers and dictate text by speaking one word at a time with brief pauses between words. Released in the early 1990s, DragonDictate was one of the first commercially viable…
ElevenLabs: A commercial AI voice platform that generates realistic synthetic speech and voice clones from text. ElevenLabs is used in accessibility contexts for producing narrated video voiceovers, audiobook-style readings, and personalized text-to-speech voices, and it has been adopted in…
Endpoint Detection(also: Voice Activity Detection, VAD): The process by which a speech-recognition system decides when a user has finished speaking, so the system can stop listening and send the captured audio for recognition. Off-the-shelf voice assistants typically use a silence threshold of 500ms-1s, which cuts off users who pause,…
Forced Alignment(also: Phonetic Alignment, Phone-Level Alignment): Forced alignment is an automatic speech processing technique that aligns a speech recording with its known transcription at the phoneme or word level. Unlike free speech recognition which determines the most likely sequence of sounds, forced alignment constrains the recognizer…
Formant Synthesis(also: Rule-based Synthesis, Parametric Synthesis): A text-to-speech method that generates synthetic speech by modeling the acoustic properties of human vocal production, particularly formants (resonant frequencies of the vocal tract). Rather than using recorded speech segments, formant synthesizers use mathematical rules and…
Gaussian Mixture Model(also: GMM): A Gaussian Mixture Model (GMM) is a probabilistic model that represents data as a weighted combination of multiple Gaussian (normal) distributions. Each component Gaussian has its own mean and covariance, allowing GMMs to model complex, multimodal distributions. In speech…
Grid-Based Navigation(also: Grid Navigation, Grid Cursor Control): A speech-controlled cursor positioning technique that divides the screen into numbered regions, allowing users to select progressively smaller areas by speaking numbers until the cursor reaches the target location. This alternative input method enables people with upper-body…
JSML(also: Java Speech Markup Language): An XML-based markup language developed by Sun Microsystems that provides directives for controlling the output of speech synthesis engines. JSML allows developers to specify pronunciation details including speaking rate, volume, pitch, emphasis, pauses, gender of synthetic…
Landmark Detection(also: Acoustic Landmark Detection, Stevens Landmark Theory): Landmark detection is a speech analysis method based on Kenneth Stevens' acoustic model of speech production, which identifies perceptually significant points in the acoustic signal where listeners extract information about underlying distinctive features. Three primary landmark…
Listening Window: The interval during which a voice assistant or speech-recognition system actively captures user audio after being activated (by wake word or button press). A short or fixed listening window causes premature cut-offs for users who pause while formulating speech — common for…
Math-to-Speech(also: Mathematical Speech Generation, Math Speech): The process of converting mathematical notation into spoken language that can be rendered by text-to-speech engines or read aloud by screen readers. Math-to-speech is significantly more complex than reading ordinary text because mathematical expressions are two-dimensional,…
Mispronunciation Detection(also: Pronunciation Error Detection, Mispronunciation Diagnosis): Mispronunciation detection is the automated process of identifying errors in a speaker's pronunciation by comparing their speech production against a target or expected utterance. In assistive technology and speech training systems, mispronunciation detection goes beyond simple…
Natural Speech Output(also: Recorded Speech, Digitized Speech): Speech output produced from digital recordings of actual human speakers, as opposed to artificially generated synthetic speech. Natural speech output preserves the prosody, intonation, emotion, and vocal quality of the original speaker, making it generally more pleasant and…
Neural Vocoder: A deep-learning model that synthesises audio waveforms from intermediate acoustic representations such as mel-spectrograms or discrete speech units. Examples include HiFi-GAN, WaveNet, WaveGlow, and SoundStream. Neural vocoders have largely replaced classical signal-processing…
Non-Verbal Vocalization(also: Non-Speech Vocalization, Vocal Gesture, Non-speech Vocalisation): A sound produced by the voice that is not a spoken word, such as a sustained vowel sound ("Ahhhhh"), hum, or other vocal noise. In assistive technology and alternative input contexts, non-verbal vocalizations can serve as continuous control signals for cursor movement or other…
Perceptual Linear Prediction(also: PLP, PLP Coefficients): Perceptual Linear Prediction (PLP) is an acoustic feature extraction technique used in speech processing that models human auditory perception. PLP analysis applies psychoacoustic principles including critical band frequency resolution, equal-loudness pre-emphasis, and…
Re-speaking(also: Respeaking, Speech-to-Text Relay): A captioning technique in which a trained operator listens to a speaker and repeats (re-speaks) their words clearly into a high-quality microphone in a controlled environment, allowing automatic speech recognition software to generate captions with higher accuracy than direct…
Recurrent Neural Network(also: RNN): A recurrent neural network (RNN) is a type of artificial neural network designed to process sequential data by maintaining an internal state (memory) that captures information from previous inputs in the sequence. Unlike feedforward networks, RNNs have connections that loop…
Repair Mechanism(also: Conversational Repair): In conversational interface design, a feature that helps the user and the system recover from misrecognition, ambiguity, or misunderstanding — for example, clarification prompts ("Did you mean the [X] cricket match?"), visible candidate lists, or "try again" affordances that…
SAPI(also: Speech Application Programming Interface, Microsoft SAPI): The Speech Application Programming Interface (SAPI) is a Microsoft Windows API that enables applications to use speech recognition and text-to-speech synthesis. SAPI provides a standardized interface between speech engines and applications, meaning that a synthetic voice built…
Semantically Unpredictable Sentences(also: SUS, SUS Test): A standardised method for evaluating speech intelligibility in which listeners are presented with sentences that are grammatically correct but semantically meaningless, such as "A polite art jumps beneath the arms" or "The law that finished shows the boots." Because the…
Speaker Adaptation(also: Voice Adaptation, Speaker-Adaptive Training, Voice Personalization): Speaker adaptation is the process of adjusting an existing automatic speech recognition (ASR) system — usually one trained on a large, demographically broad corpus of able-bodied speakers — to a particular individual's voice using a relatively small amount of that person's…
Speaker Diarisation(also: Speaker Diarization, Speaker Segmentation): The automatic process of segmenting an audio recording by speaker identity — answering "who spoke when" — and labelling each segment. A critical pre-requisite for accessible transcripts of multi-voice audio such as interviews, podcasts, and meetings, since a flat transcript…
Spectrogram(also: Sonogram, Spectral Display): A spectrogram is a visual representation of the frequency spectrum of a signal as it varies over time, typically showing time on the horizontal axis, frequency on the vertical axis, and intensity represented by color or brightness. In speech science and accessibility research,…
Speech Composer(also: Speech Generation, Message Composition Engine): A software component in AAC (Augmentative and Alternative Communication) systems that takes user input — whether typed text, selected symbols, or telegraphic phrases — and processes it for spoken output through a text-to-speech synthesiser. Advanced speech composers may include…
Speech Diversity(also: Diverse Speech, Non-Typical Speech): The full range of ways human speech varies from the narrow 'typical' speech on which most speech-AI systems are trained and benchmarked. Speech diversity includes people who stutter, d/Deaf and Hard-of-Hearing speakers, people with dysarthria, aphasia, or other neurological…
Speech Language Model(also: SLM, Audio Language Model, Speech Foundation Model): A class of large neural models that processes both speech and text in a single end-to-end framework, integrating tasks — automatic speech recognition, spoken language understanding, dialogue, speech generation — that traditionally required separate modular systems. Examples…
Speech Neuroprosthesis(also: Speech BCI, Speech Brain-Computer Interface): A brain-computer interface that decodes neural activity associated with attempted or imagined speech and converts it into text, synthesized voice, or both. Speech neuroprostheses are designed for people with anarthria or severe dysarthria from ALS, brainstem stroke, locked-in…
Speech Prosodics(also: Prosodic Features, Suprasegmental Features): Speech prosodics refers to the nonverbal acoustic features of speech that convey meaning beyond the words themselves, including pitch (fundamental frequency), rhythm, stress, intonation patterns, pausing, and speaking rate. In accessibility research, prosodic analysis serves as…
Speech Rate(also: Speaking Rate, Articulation Rate): The speed at which speech is produced, typically measured in words per minute (WPM) or syllables per second. Normal conversational speech ranges from 120-180 WPM, while screen reader users often configure synthetic speech at rates of 300-400 WPM or higher. Speech rate settings…
Speech Repair(also: Self-Correction, Speech Self-Repair, Command Correction): Speech repair is the process of correcting or modifying a spoken utterance after it has been produced, either within the same turn or in a subsequent one. In natural conversation, speakers commonly interrupt themselves to fix errors, change wording, or update information using…
Speech Visualization(also: Visual Speech Display, Speech-to-Visual Display): Speech visualization refers to techniques that convert spoken language into visual representations to aid comprehension, particularly for individuals who are deaf or hard of hearing. These displays can range from real-time captioning and waveform displays to more abstract…
Speech-Generating Device(also: SGD, Voice Output Communication Aid, VOCA): An electronic AAC device that produces spoken output from text or symbol input, enabling people with speech disabilities to communicate verbally with others. Speech-generating devices range from dedicated hardware (such as Tobii Dynavox devices) to software applications running…
Speech-to-Speech(also: S2S, Speech-to-Speech Conversion): A class of systems that transform one speech signal directly into another — for example, converting atypical input (whispered, dysarthric, accented, or cross-lingual speech) into clear, intelligible output in a target voice or language. Speech-to-speech systems differ from…
Spoken Dialogue System(also: SDS, Voice Dialogue System): A computer system that communicates with users through spoken natural language, allowing them to interact via voice rather than visual or manual interfaces. Spoken dialogue systems are used in telecare, customer service, and home care applications, and are particularly relevant…

Category

Search results