Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research

Pengfei Lu, Matt Huenerfauth · 2010 · Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies (SLPAT '10) · doi:10.5555/1867750.1867765

Summary

This workshop paper describes the first year of a multi-year project at CUNY to build a motion-capture corpus of American Sign Language (ASL) specifically intended to support data-driven ASL animation and machine-translation research. The authors argue that current ASL animation systems produce output that native signers find hard to understand because those systems have been built without the kind of richly annotated corpora that drive modern NLP. They enumerate six linguistic phenomena that make ASL particularly hard for natural language processing — Timing, Spatial Reference, Inflection of verbs to encode subject/object locations, Coarticulation between adjacent signs, Non-Manual markers carried on the face and head, and Evaluation in the absence of a written form — and show that prior sign-language corpora (transcription-based or single-sign lexicons) cannot support modelling these phenomena at the level of detail needed to drive a virtual signing character. The bulk of the paper describes the lab's novel motion-capture configuration: two 22-sensor Immersion CyberGloves for handshape, an Applied Science Labs H6 head-mounted eye-tracker, an Intersense IS-900 acoustic/inertial system for head position, an Animazoo IGS-190 bodysuit for upper-body joint angles, three synchronised high-speed video cameras (front, facial close-up, and side), and a blue-screen backdrop to enable future computer-vision re-use. Data collection uses a native-signer 'prompter' seated behind the camera to keep the studio ASL-immersive and prevent English code-switching. In year one, 58 passages totalling ~40 minutes were collected from 6 signers.

Key findings

The paper's empirical contribution is an evaluation study confirming that the authors' motion-capture configuration produces data of sufficient quality to drive understandable ASL animation — a non-trivial result, since an earlier project by the same group using different equipment produced motion-capture animations that were 'barely understandable' due to dropped connections, poor calibration, and data noise. Twelve native ASL signers rated three versions of ten matched stories: (a) scripted animations with the authors' linguistically motivated timing/pause algorithm, (b) scripted animations with default timing, and (c) animations driven directly from the motion-capture recording (with no face, since the mocap rig does not digitise facial expression). Using 10-point Likert scales plus multiple-choice comprehension questions, the motion-capture animations scored similarly to the state-of-the-art scripted animations on grammaticality and understandability, and significantly higher on naturalness of movement — despite running slightly faster (1.12 vs 1.2 signs/sec) and lacking facial expression entirely. The scripted timing-algorithm animations retained the edge on comprehension-question accuracy, which the authors attribute to their linguistically motivated pauses and to the presence of facial expressions in the scripted output.

Relevance

For accessibility practitioners and researchers working on deaf-focused assistive technology, this paper is significant mostly as a methodology artefact: it documents in practical detail what a usable ASL motion-capture pipeline looks like — the specific equipment combination, the calibration and synchronisation protocols, and the studio conventions (native prompter, blue screen, strobe-synced cameras) that distinguish research-grade ASL data from the limited transcription-based 'corpora' used in earlier sign-language MT work. For consumers of ASL animation systems, the key takeaway is that modern ASL animations can approach human-signer naturalness when driven by good motion-capture data, but comprehension still depends heavily on correct timing, pausing, and non-manual (facial) expressions — none of which come 'for free' from raw motion capture. Limitations worth flagging: the mocap rig captures no facial data, the evaluation uses only 12 participants and 10 stories, and the broader ethical debate about signing avatars as substitutes for human interpreters is outside the paper's scope. The long-term vision — making ASL a 'normal' language for NLP researchers — remains the right strategic goal.

Tags: American Sign Language · ASL animation · motion capture · sign language corpus · deaf accessibility · natural language processing · signing avatar · data-driven generation · computational linguistics · sign language machine translation