Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Category: Automatic Speech Recognition

Filter

Search results

Automatic Captions(also: Auto-Generated Captions, Auto Captions, ASR Captions): Captions produced by automatic speech recognition (ASR) systems without human transcription, typically generated by the hosting platform (e.g., YouTube, Zoom, Microsoft Teams) as an optional layer on uploaded or live video. Automatic captions have dramatically expanded caption…
Character Error Rate(also: CER): A metric for evaluating automatic speech recognition (ASR) and optical character recognition (OCR) accuracy, measuring the minimum number of character-level edits (insertions, deletions, substitutions) needed to transform the system output into the reference text, divided by the…
Endpoint Detection(also: Voice Activity Detection, VAD): The process by which a speech-recognition system decides when a user has finished speaking, so the system can stop listening and send the captured audio for recognition. Off-the-shelf voice assistants typically use a silence threshold of 500ms-1s, which cuts off users who pause,…
Forced Alignment(also: Phonetic Alignment, Phone-Level Alignment): Forced alignment is an automatic speech processing technique that aligns a speech recording with its known transcription at the phoneme or word level. Unlike free speech recognition which determines the most likely sequence of sounds, forced alignment constrains the recognizer…
Perceptual Linear Prediction(also: PLP, PLP Coefficients): Perceptual Linear Prediction (PLP) is an acoustic feature extraction technique used in speech processing that models human auditory perception. PLP analysis applies psychoacoustic principles including critical band frequency resolution, equal-loudness pre-emphasis, and…
Universal Background Model(also: UBM): A Universal Background Model (UBM) is a large Gaussian Mixture Model trained on speech from many speakers to represent speaker-independent acoustic characteristics. The UBM serves as a reference distribution against which individual speaker models are compared, typically using…
Whisper(also: OpenAI Whisper, Whisper ASR): An open-source automatic speech recognition (ASR) model released by OpenAI in 2022, trained on 680,000 hours of multilingual and multitask supervised audio data. Whisper supports transcription in dozens of languages, translation into English, language identification, and…

7 results.

Category

Search results