Glossary

Terms used in accessibility research and practice. Each entry has a definition, common aliases, and category tags.

Search results

Key Frame Extraction(also: Keyframe Selection, Key Frame Selection): A computer vision technique that automatically identifies and selects the most representative or highest-quality frames from a continuous video stream. In accessibility contexts, key frame extraction is used in mobile assistive applications to select well-focused,…
Keyframe(also: Key Frame): A keyframe is a single representative frame selected from a video scene or shot that best captures the essential visual content of that segment. In automated audio description and video captioning systems, keyframe selection is a critical step — the chosen frame is analyzed by…
Landmark Extraction(also: Keypoint Detection, Skeletal Tracking): A computer vision technique that identifies and tracks specific anatomical points (landmarks or keypoints) on the human body, hands, and face from images or video. In sign language technology, landmark extraction is a critical preprocessing step that converts raw video into…
Large Vision Model(also: LVM): A large vision model is a foundation model trained on very large image (and often video) datasets to produce general-purpose visual representations - capable of object detection, segmentation, captioning, or feature extraction without task-specific retraining. Examples include…
Linear Discriminant Analysis(also: Fisher Discriminant Analysis, Fisherfaces): A statistical method used in pattern recognition and machine learning that finds a linear combination of features to best separate two or more classes of objects. In the context of face recognition, LDA (also known as the Fisherfaces method) projects face images into a…
Meal Assistance Technology(also: Dining Assistance Technology, Food Accessibility Technology): Assistive technologies designed to help people with disabilities identify, locate, and consume food independently during mealtimes. For people with visual impairments, these systems may use computer vision to recognize dishes, voice interfaces to provide information about food…
MediaPipe: An open-source framework by Google for building multimodal machine learning pipelines, commonly used for real-time face, hand, and body tracking. In accessibility applications, MediaPipe Holistic extracts 3D landmarks from the user's body and hands via webcam, while MediaPipe…
Microsoft Kinect(also: Kinect, Kinect sensor): A motion-sensing device that captures RGB video, depth images, and skeletal tracking data simultaneously. Originally developed for gaming, the Kinect became widely adopted in accessibility research due to its affordable price point (compared to laboratory equipment) and ability…
Motion Capture(also: MoCap, Movement Tracking): Technology that records the movements of people or objects, typically using cameras, sensors, or computer vision, and translates them into digital data for animation or analysis. In sign language applications, motion capture tracks hand, body, and facial movements to drive…
Motion History Image(also: MHI): A computer vision technique that represents motion in video sequences as a single grayscale image, where pixel intensity indicates recency of movement. Brighter pixels represent more recent motion while darker pixels show older movement patterns. In accessibility applications,…
Multimodal Features(also: multimodal data, multimodal fusion): Information extracted from multiple sensory channels or data types—such as combining visual (RGB), depth, audio, and skeletal data—to improve recognition accuracy. In accessibility systems, multimodal approaches often outperform single-modality methods because different data…
OCR (Optical Character Recognition)(also: OCR, Optical Character Recognition, Text Recognition): A computer-vision technology that converts images of printed, handwritten, or on-screen text into machine-readable character data. OCR is foundational to a wide range of accessibility tools: extracting alt-text for image-based PDFs, reading labels for screen-reader users (e.g.,…
ORBIT Dataset(also: Object Recognition for Blind Image Training): A disability-first machine learning dataset for teachable object recognition, contributed by people who are blind or have low vision. The original ORBIT dataset (Massiceti et al., 2021) contains 3,822 videos of 486 objects from 67 data collectors, predominantly in the UK and…
Object Detection(also: Object Recognition): A computer vision technique that identifies and locates specific objects within images or video frames, typically by drawing bounding boxes around detected items and classifying them. In video accessibility, object detection enables automatic identification of video elements…
Object Recognition(also: Object Detection): A computer vision capability that identifies and classifies objects within images or video frames. In visual assistance technologies, object recognition enables automated description of what the camera captures, helping blind users identify items in their environment. However,…
Object Status Recognition(also: Object State Recognition, Object Transformation Detection): The computer vision task of identifying the current condition or transformation state of objects, such as whether an ingredient is raw, chopped, sauteed, or blended. Object status recognition goes beyond simple object detection (identifying what is present) to understand how…
Open-Vocabulary Detection(also: Open-Vocabulary Object Detection, OVD): A class of computer vision object detection models that accept arbitrary text queries at inference time rather than being restricted to a fixed set of pre-trained classes. Instead of only recognizing, for example, the 80 COCO categories, an open-vocabulary detector (such as…
OpenPose: An open-source computer vision library developed by Carnegie Mellon University that detects human body, hand, facial, and foot keypoints in real-time from images or video. OpenPose extracts 25 body keypoints, 21 keypoints per hand, and 70 facial landmarks, providing a skeletal…
Optical Flow: A computer vision method that estimates the apparent motion of objects between consecutive video frames by tracking pixel displacement patterns. Optical flow calculates velocity vectors showing movement direction and speed across an image. In assistive technology, optical flow…
Optical Music Recognition(also: OMR): Computer vision technology that automatically converts images of printed or handwritten music notation into machine-readable digital formats such as musicXML. OMR is analogous to OCR (Optical Character Recognition) for text. While OMR can potentially streamline the creation of…
Overlay Detection(also: Overlay Recognition): The process of automatically identifying graphical or textual elements overlaid on top of video content, such as pop-up graphics, watermarks, banners, subtitles, logos, and text annotations. Overlay detection uses computer vision techniques including edge detection, shape…
Pedestrian Detection(also: Person Detection, Human Detection): A computer vision task that identifies and locates people in images or video frames, typically using deep learning models such as convolutional neural networks. In accessibility applications, pedestrian detection is used in wearable assistive technologies for blind and low…
Perceptual Hashing(also: Image Hashing, pHash): A technique that generates a compact fingerprint (hash) of an image based on its visual content rather than its raw data. Unlike cryptographic hashes that change completely with any modification, perceptual hashes produce similar values for visually similar images, allowing…
Personal Object Recognizer(also: Teachable Object Recognizer, Custom Object Classifier): A computer vision system that allows individual users to train their own object recognition models by providing a small number of example photos and custom labels. Unlike generic object recognizers that use pre-defined categories, personal object recognizers let users define…
Personalized Object Recognition(also: Teachable Object Recognition): A class of computer vision systems that allow an individual user — typically someone who is blind or has low vision — to train their device to recognize a small set of personally relevant objects (a specific coffee mug, a particular set of keys, a favourite notebook) by…
Phase-Based Motion Processing(also: Phase-Based Video Motion Processing, Phase-Based Motion Magnification): A family of computer vision techniques that decompose video frames into complex steerable pyramids and analyse changes in the temporal phase of each scale and orientation to recover motion, including sub-pixel movements invisible to the naked eye. Because it operates in the…
Point Cloud: A set of data points in three-dimensional space, where each point represents a position on the surface of an object or environment, typically captured by depth cameras, LiDAR scanners, or photogrammetry. In accessibility applications, point clouds are used to create virtual…
Polar Motion Profile(also: PMP): A Polar Motion Profile (PMP) is a computational technique used in sign language detection that models the quantity and distribution of motion relative to a detected face using polar coordinates. The method captures the characteristic hand and arm movements associated with…
Principal Component Analysis(also: PCA): A statistical technique that reduces the dimensionality of data by identifying the principal axes of variation in a dataset. In accessibility and assistive technology contexts, PCA is commonly used in face recognition systems (as the basis of the Eigenfaces method), gesture…
RANSAC(also: Random Sample Consensus): An iterative algorithm (Fischler and Bolles, 1981) for fitting a mathematical model to data that contains a significant proportion of outliers. In accessibility-focused indoor navigation systems, RANSAC is commonly used to detect the floor plane from a LiDAR point cloud — points…
RGBD Camera(also: RGB-D Camera, Depth Camera, Stereo Camera): A camera that captures both a colour (RGB) image and a per-pixel depth (D) measurement of the scene, yielding a 3D representation of the environment. Depth can be produced by stereo vision, structured light, or time-of-flight sensing. In accessibility research RGBD cameras…
Scene Classification(also: Scene Recognition, Scene Understanding): Scene classification is a computer vision task that categorizes images or video frames into predefined scene types such as indoor/outdoor, kitchen, office, or street. For accessibility, scene classification helps automated systems provide context about environments in image…
Scene Segmentation(also: Scene Detection, Shot Boundary Detection): Scene segmentation is the process of automatically dividing a video into discrete scenes or segments based on visual changes such as cuts, transitions, or the appearance of new elements in the frame. In the context of accessibility, scene segmentation is a foundational component…
Scene Text Recognition(also: Scene Text Detection, Text in the Wild, Environmental Text Detection): The computer vision task of detecting and reading text that appears naturally in real-world environments, such as street signs, product labels, shop names, and building numbers. Unlike optical character recognition (OCR) for scanned documents where text layout is predictable,…
Screen Recognition: A computer vision feature in Apple's VoiceOver screen reader that automatically interprets the pixels of a graphical user interface to identify and label interactive elements when applications have not properly implemented accessibility APIs. Screen Recognition analyses the…
Semantic Segmentation(also: Pixel-Level Classification, Scene Parsing): A computer vision technique that classifies every pixel in an image into a predefined category, producing a detailed map of what objects are present and where they are located. Unlike object detection (which draws bounding boxes around objects), semantic segmentation provides…
SigLIP(also: Sigmoid Loss for Language Image Pre-Training): A vision-language model that uses sigmoid loss instead of contrastive loss for aligning images with text descriptions. SigLIP improves upon CLIP by using a more efficient training objective that computes image-text similarity without requiring large batch sizes. In accessibility…
Sign Language Generation(also: Sign Language Synthesis, Signing Generation): The automatic production of sign language content, typically through computer-generated animations of signing avatars or video synthesis. Sign language generation systems convert text or symbolic representations of signs into visual output, often using motion-capture data,…
Sign Recognition(also: Indoor sign recognition, Signage recognition): The task of automatically detecting, reading, and interpreting signs in an environment — for accessibility purposes, typically indoor directional signs (arrows pointing to corridors or facilities) and textual signs (room numbers, department names, wayfinding labels). Sign…
Sign Spotting(also: Sign Detection, Continuous Sign Spotting): Sign spotting is the task of automatically locating instances of specific signs within a continuous signing video, as opposed to classifying a pre-segmented isolated sign. It is a building block for search-by-sign in archive footage, automatic captioning of signed media, and…
Sign language detection(also: SL detection, Signing detection): The automated identification of whether video content contains sign language communication, using computer vision techniques to analyse motion patterns around detected faces. Sign language detection is distinct from sign language recognition (which interprets specific signs): it…
Skeleton Tracking(also: skeletal tracking, body tracking, pose estimation): Technology that detects and tracks the positions of human body joints (such as head, shoulders, elbows, hands) in real-time from camera or depth sensor data. In accessibility applications, skeleton tracking enables gesture-based interfaces, sign language recognition, and…
Spatiotemporal Saliency(also: Spatiotemporal Saliency Estimation, Spatio-Temporal Saliency): A computer vision technique that estimates, for each pixel in a video, how visually important it is at a given moment by combining spatial contrast (features that stand out within a frame) with temporal contrast (regions that change or move differently from their recent…
Speaker Segmentation(also: Person Segmentation, Human Segmentation): The process of identifying and isolating the speaker or presenter in a video frame, separating them from the background and other visual elements. Speaker segmentation uses computer vision models to create precise masks around the speaker, enabling layout customization options…
Stereo Vision(also: Stereoscopic Vision, Stereo Camera System, Stereopsis): A computer vision technique that uses two or more cameras positioned at slightly different viewpoints to extract three-dimensional depth information from a scene, mimicking the way human binocular vision perceives depth. In assistive technology, stereo vision systems have been…
Stereoscopic Camera(also: Stereo Camera, Depth Camera, 3D Camera): A camera system that uses two or more lenses to capture images from slightly different perspectives, mimicking human binocular vision to compute depth information (disparity maps). In accessibility applications, stereoscopic cameras are used in assistive devices for visually…
Talking Head(also: Virtual Talking Head, Animated Face, 3D Talking Head): A talking head is a computer-generated 3D or 2D animated representation of a human face and articulatory system that produces visible speech movements synchronised with audio output. In accessibility and speech therapy contexts, talking heads are particularly valuable because…
Teachable Object Recognition(also: Teachable Object Recognizer, TOR, Personalized Object Recognition): A machine learning approach that allows users to train an object recognition system to identify their own personal items by providing a small number of training examples, typically photos or videos. This technology is particularly valuable for blind and low vision users who need…
Teachable Object Recognizer(also: Teachable Machine, Personalized Object Recognizer): A machine learning application that allows end users to train custom object recognition models by providing their own example images, rather than relying on pre-trained models with fixed categories. In accessibility contexts, teachable object recognizers empower blind and…
Text Spotting(also: Scene Text Detection): A computer vision technique that detects and localizes text within images in real time, without actually performing OCR recognition. Text spotting algorithms identify where text appears in a camera frame, its boundaries, and orientation. In accessibility applications, text…

Category

Search results