Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Search results

Re-speaking(also: Respeaking, Speech-to-Text Relay): A captioning technique in which a trained operator listens to a speaker and repeats (re-speaks) their words clearly into a high-quality microphone in a controlled environment, allowing automatic speech recognition software to generate captions with higher accuracy than direct…
Real-Time Captioning(also: Live Captioning, Live Transcription): The process of converting spoken language into text simultaneously as it is being spoken, displayed with minimal delay. Real-time captioning is essential for deaf and hard of hearing individuals to participate in live events, meetings, lectures, and conversations. Methods…
Real-Time Captioning(also: CART, Communication Access Realtime Translation, Live Captioning): The instant conversion of spoken language into text displayed simultaneously as speech occurs, provided either by a trained human captioner or through automatic speech recognition (ASR) technology. Real-time captioning is a critical accessibility service for Deaf and…
Real-Time Captioning(also: Live Captioning, Real-Time Text): The process of converting spoken language into text that is displayed simultaneously or near-simultaneously as the speech occurs. Real-time captioning is essential for deaf and hard of hearing individuals to participate in live events, meetings, and educational settings. Methods…
Real-Time Captioning(also: Live Captioning, CART, Communication Access Realtime Translation): The process of converting spoken language into text display in real time, typically with only a few seconds of delay. Professional real-time captioning (CART) uses stenographers with specialised shorthand keyboards who can type at speaking rates of 170+ words per minute,…
Real-Time Captioning(also: Live Captioning, Live Speech-to-Text): The process of converting spoken language to text simultaneously or with minimal delay as the speech occurs. Real-time captioning can be produced by human transcriptionists (CART, C-Print, TypeWell), crowd workers, automatic speech recognition (ASR), or hybrid approaches. Unlike…
Remote Captioning(also: Remote CART, Remote Real-Time Captioning): A live captioning service delivered at a distance, in which a human captioner (CART provider) or automatic speech recognition system receives an audio feed from a meeting, classroom, or event over the internet or a phone line and transmits transcribed text back to the user in…
Respeaking(also: Speech-to-Speech Captioning, Voice Writing): A real-time captioning method in which a trained operator listens to speech and repeats it clearly into a speech recognition system optimized for their voice, producing captions. Respeaking is commonly used in broadcast television captioning and live events. It requires less…
Roll-up Captions(also: Roll-up style, Scroll-up Captions): A captioning display style in which text is added one word or line at a time, scrolling upward as new text arrives and pushing earlier lines off the top. Roll-up is typically used in live captioning because it can display words as they are produced without waiting for a full…
SRT(also: SubRip, SubRip Text, SRT Subtitle Format): SRT (SubRip Text) is a widely used plain-text subtitle file format originally created by the SubRip software for extracting subtitles from DVDs. An SRT file contains sequentially numbered subtitle entries, each with a time range (start and end timestamps in…
Shadow Speaking(also: Shadow Captioning, Respeaking): A captioning technique where a trained human operator listens to live speech and repeats (or "respeaks") it clearly into a speech recognition system, which then generates real-time captions. The shadow speaker simplifies and normalizes the speech — removing overlapping dialogue,…
Social Media Video Captions(also: SMVC): An umbrella term for the textual or symbolic elements — platform-generated captions, creator-edited captions, user-generated captions, and non-speech information such as sound effects, music cues, or onomatopoeia — that are temporally aligned with video content on social media…
Sound Event Detection(also: Audio Tagging, Automatic Sound Recognition): A machine learning technique that automatically identifies and classifies sounds within an audio stream, such as music, applause, laughter, environmental noises, and other non-speech audio events. In accessibility contexts, sound event detection can complement automatic speech…
Sound Representation(also: Sound Depiction): The methods and conventions used to convey audio information through text in captions and other written formats. Common approaches include descriptive text (explaining the sound source and quality), onomatopoeia (words that mimic sounds), and sensory quality-focused descriptions…
Speaker Identification(also: Speaker ID, Speaker Attribution): Methods used in captions and subtitles to indicate which person is currently speaking, enabling viewers to follow conversations among multiple participants. Common in-text speaker identification techniques include double chevrons (>>) with speaker names, different text colors…
Speech-modulated Typography(also: Speech-driven Typography, Prosody-driven Typography): A design technique in which the visual properties of text — typically font weight, width, or size on a variable-font axis — are modulated in real time by features extracted from a corresponding speech signal, such as pitch, loudness, rhythm, or an inferred emotional-arousal…
Stenographer(also: Stenocaptioner, Court Reporter): A trained professional who produces real-time verbatim transcription of speech, typically using a stenotype machine that maps chorded key combinations to phonetic syllables. In accessibility contexts, stenographers (sometimes called stenocaptioners or CART providers) produce…
Stenographic Keyboard(also: Steno Machine, Stenotype, Shorthand Keyboard): A specialized keyboard used by CART captioners and court reporters that allows simultaneous pressing of multiple keys to represent syllables, words, or phrases in a single stroke, enabling transcription speeds of 200+ words per minute. Each captioner maintains a personal…
Stenotype(also: Stenography, Shorthand Typing, Machine Shorthand): A specialised text-entry method that uses a keyboard with fewer keys than a standard QWERTY layout, where multiple keys are pressed simultaneously (chording) to represent phonetic sounds, syllables, or entire words. Stenotype enables trained operators to achieve speeds of…
Subtitle(also: Subtitles, Open captions (video), Movie subtitles): On-screen text that reproduces the spoken dialogue of a video, most commonly rendered in a "movie subtitle" style (white text with a black outline, one or two lines at the bottom of the frame). Subtitles are closely related to captions but are conventionally distinguished in…
Subtitles: Text displayed on screen that represents the spoken language in audio-visual content, primarily intended for viewers who do not understand the language being spoken. While often used interchangeably with captions, subtitles and captions serve different purposes: subtitles…
Subtitles(also: Captions, Closed Captions, CC): Text displayed on screen that represents the spoken dialogue and other relevant audio information in video content. Subtitles (called captions in North America) are essential for deaf and hard of hearing viewers but are also widely used by hearing audiences in noisy…
Tactile Captions(also: Haptic Captions, Vibrotactile Captions): An enhanced captioning approach that supplements traditional text-based captions with vibrotactile feedback, allowing deaf and hard of hearing viewers to feel non-speech sounds (such as phone rings, doorbells, footsteps, or objects falling) through a wrist-worn or body-worn…
Teletext(also: Ceefax, Oracle): A text-based information service broadcast within the television signal that allowed viewers to access pages of text and simple graphics using their TV remote control. Originating in the UK with the BBC's Ceefax service in 1974, teletext provided news, weather, sports results,…
Text Alignment(also: Sequence Alignment, Transcript Alignment): The process of matching corresponding segments between two or more text sequences that represent the same content but may differ in timing, wording, or structure. In captioning systems, text alignment is used to synchronize parallel transcription streams — such as…
Tracked Captions(also: Speaker-following captions, Dynamic captions): Captions that move dynamically within the video frame to stay near the current speaker's face or mouth, rather than remaining anchored at a fixed position (typically the bottom of the video). Tracked captions reduce the visual effort required for Deaf and Hard-of-Hearing viewers…
Transcript(also: Text Transcript, Video Transcript, Audio Transcript): A written document containing the complete text of spoken content from a video or audio recording, presented separately from the media rather than synchronized with it. Unlike captions, which appear on-screen in real time as speech occurs, transcripts provide all text at once,…
Transcription(also: Speech-to-Text Transcription, Real-Time Transcription): The process of converting spoken language into written text, either in real time or after the fact. In accessibility contexts, transcription services provide communication access for deaf and hard of hearing individuals by producing text versions of spoken content in classrooms,…
Transcripts(also: Transcript, Text Transcript): A written, text-based representation of spoken audio or audiovisual content. WCAG 2.1 success criterion 1.2.1 (Audio-only and Video-only Prerecorded) requires an alternative for time-based media — typically a transcript — for pre-recorded audio-only content such as podcasts,…
Typographic Modulation(also: Typographic Variation, Dynamic Typography): Systematic variation of a typeface's visual parameters — weight, width, slant, size, colour, letter spacing, baseline shift, opacity — to carry information beyond the literal words, typically driven by an external signal such as speech pitch, loudness, emotional arousal, or…
User-Generated Captions(also: UGC captions): Captions created and added to video content by non-professional contributors — typically the video's own creator or community members — rather than by professional captioners or fully automated systems. On social media, user-generated captions are often implemented as open…
Verbatim Captioning(also: Verbatim Captions): A captioning approach that reproduces every spoken word exactly as uttered, including filler words, false starts, and repetitions. Regulators in many countries (e.g., the Canadian CRTC, the US FCC) emphasize verbatim accuracy as a quality requirement. Verbatim captions preserve…
Visual Attention Split(also: Split Attention, Divided Visual Attention): The cognitive challenge of needing to divide visual focus between two or more sources of information simultaneously. For deaf and hard of hearing people, visual attention split is a pervasive accessibility barrier: they must look at captions or a sign language interpreter while…
Wav2Vec(also: Wav2Vec2, Wav2Vec 2.0): A family of self-supervised speech representation models from Meta AI that learn rich acoustic embeddings directly from raw waveform audio without requiring transcribed training data. Wav2Vec 2.0, introduced in 2020, became a backbone for low-resource automatic speech…
WebVTT(also: Web Video Text Tracks, Web Video Text Tracks Format): WebVTT (Web Video Text Tracks) is the W3C standard text format for providing timed text tracks — including captions, subtitles, descriptions, chapters, and metadata — synchronized with HTML5 <video> and <audio> elements. WebVTT evolved from the earlier SRT subtitle format,…
Word Error Rate(also: WER): A standard metric for evaluating speech recognition and captioning accuracy, calculated as the number of insertions, deletions, and substitutions needed to transform the transcribed text into the reference text, divided by the total number of words in the reference. Lower WER…
Word Error Rate(also: WER): A metric used to evaluate the accuracy of automatic speech recognition (ASR) and captioning systems, calculated as the number of word-level errors (insertions, deletions, and substitutions) divided by the total number of words in the reference transcript. Lower WER indicates…

Category

Search results