Design and Evaluation of an American Sign Language Generator

Matt Huenerfauth, Liming Zhao, Erdan Gu, Jan Allbeck · 2007 · Proceedings of the Workshop on Embodied Language Processing (EmbodiedNLP 2007) · doi:10.5555/1610065.1610072

Summary

Huenerfauth, Zhao, Gu, and Allbeck (2007) describe the implementation and user evaluation of a prototype system for generating animations of American Sign Language (ASL) classifier predicates — spatially complex hand movements that trace the location, motion, shape, or contour of real-world entities in the 3D space around the signer. The motivation is accessibility: roughly half a million deaf Americans use ASL as a primary language, and because most deaf U.S. 18-year-olds read English well below a typical 10-year-old hearing student, English-to-ASL translation systems can broaden access to information and services. Prior English-to-ASL MT systems (Sáfár & Marshall 2001; Zhao et al. 2000) could not generate classifier predicates — the spatial constructions most essential for describing scenes — so the authors' contribution fills a specific and long-standing gap. The system takes a 3D scene model as input (coordinates and orientations for the objects being described) and uses a library of template-based planning operators to produce a multichannel animation plan that coordinates handshape, hand location, eye gaze, head tilt, and brow position across the timeline of the classifier predicate. The resulting specification drives a virtual human character (the Greta facial animation engine plus the Virtual Human Testbed body), with inverse-kinematics-based arm-pose selection that prefers natural, collision-free poses. The paper also reports a user study with 15 native ASL signers that compares the prototype's classifier-predicate animations against Signed English transliterations — the broad-coverage baseline representing the state of the art at the time.

Key findings

The user study used four outcome measures — grammaticality, understandability, naturalness of movement, and a comprehension 'matching task' in which signers chose which of three animated 3D scenes matched each ASL sentence. The prototype outperformed the Signed English baseline on all four measures, with statistical significance across pairwise Mann-Whitney U tests with Bonferroni correction (α = 0.05). The gap on the matching task was particularly striking: signers could reliably identify the scene described by the ASL classifier predicate but often could not do so from the Signed English transliteration, which carries no spatial information beyond English word order. The authors also make a methodological argument: string-based automatic evaluation metrics are poorly suited to sign-language generation because there is no standard ASL writing system, classifier predicates encode 3D information not representable as a string, non-manual signals are lost in a string form, and real users experience animation output rather than text. User-based evaluation — with native signers scoring grammaticality and naturalness on Likert scales and completing comprehension matching tasks — is proposed as the right methodology. Qualitative feedback from signers identified concrete improvement areas: more facial expression, less animation choppiness, more eye-gaze variation, and specific lexical corrections.

Relevance

This paper is foundational for anyone evaluating signing-avatar technology, whether as an accessibility practitioner procuring such a product or as a researcher building one. Two lessons travel well: first, string-based automatic metrics should not be trusted for sign-language systems — comprehension matching tasks with native signers are the defensible ground truth. Second, Signed English transliteration is a weak baseline: it may look like sign language to hearing observers but fails Deaf users on exactly the spatial-description tasks that matter most, so accessibility products that rely on signed-English-style output are not delivering ASL access. The paper also validates the broader sign-avatar design principle that non-manual signals (eye gaze, brow raising, head tilt) and classifier predicates must be modelled together — piecemeal implementations produce uncanny, ungrammatical output. Limitations include a narrow test set of ten sentences, a single signing-avatar character, and a prototype that produces only location/movement classifier predicates rather than the full range.

Tags: ASL · American Sign Language · deaf accessibility · sign language · sign language animation · sign language generation · signing avatar · virtual human · animation · classifier predicates · non-manual markers · eye gaze · Signed English · transliteration · inverse kinematics · user-based evaluation · evaluation methodology · Deaf community · computational linguistics · natural language generation