American Sign Language Generation: Multimodal NLG with Multiple Linguistic Channels
Matt Huenerfauth · 2005 · Proceedings of the ACL Student Research Workshop (ACLstudent '05) · doi:10.5555/1628960.1628968
Summary
This short student-research-workshop paper presents the design rationale for Huenerfauth's English-to-ASL machine translation system, framing American Sign Language generation as a form of multimodal natural language generation (NLG) with multiple parallel linguistic channels. The motivation is familiar from the author's other work: most deaf U.S. high school graduates read English at around a fourth-grade level, so translating English text into an animated virtual character performing ASL would make content such as closed captions, TTY, and user interfaces more accessible. The paper's contribution is conceptual rather than empirical — it argues that ASL NLG is fundamentally different from text-based NLG, and from prior multimodal NLG (driving directions, embodied conversational agents) in that there is no dominant linguistic channel; meaning is spread simultaneously and hierarchically across the hands, eye gaze, mouth, facial expression, head tilt, and shoulder tilt, all of which must be generated and time-coordinated. The paper distinguishes two ASL subsystems: lexical signing (LS), where signs are syntactically combined and 3D space is used arbitrarily for pronouns and verb agreement, and classifier predicates (CPs), where the signer's hands 'draw' a topologically accurate 3D scene using handshapes chosen from semantic classes (moving vehicles, seated animals, upright humans, etc.). CPs occur 1–17 times per minute depending on genre, and because English sentences that translate to CPs look structurally very different from their ASL forms, they are exactly the sentences that low-literacy deaf readers most need translated.
Key findings
Huenerfauth's design is notable as the first English-to-ASL MT system to attempt classifier predicates at all. The system is organised around four features driven by ASL's multichannel and multi-subsystem nature: (1) grammar-like coordination formalisms that let complex signals on multiple channels be represented together rather than flattened to a string; (2) ASL-tailored computational-linguistic models of discourse, semantics, syntax, and sign phonology, all time-indexed to an animation timeline; (3) a 3D semantic model in which invisible placeholders representing real-world objects are populated in the space around the signer — for CPs, scene-visualisation software analyses the English input and the resulting 3D layout is 'overlaid' in front of the virtual signer, so the object motion paths drive the hand motion paths; and (4) a multi-path architecture in which only sentences that require CP generation go through the expensive scene-visualisation pipeline, while ordinary sentences take a simpler MT route. CPs are produced by a planning-based NLG approach that selects and parameterises templates of prototypical CP performances. Evaluation was planned (not yet conducted at time of writing) using native ASL signers rating system output against motion-captured human CPs as a control, and the author explicitly argues that user-based evaluation is more meaningful than automatic string-comparison metrics because ASL has no widely used writing system against which to compare.
Relevance
For accessibility practitioners and researchers, this paper is worth reading less for empirical results than for the framing it establishes — that sign-language generation is not simply text-to-text translation with an animated front-end bolted on, but a fundamentally different linguistic problem that requires multiple time-synchronised output channels and topologically accurate 3D spatial representation. That framing is still load-bearing for modern signing-avatar and neural sign-language generation systems, which continue to struggle with non-manual markers, classifier predicates, and spatial reference. The paper also makes a useful methodological point for anyone evaluating multichannel generation systems: individual channels can be isolated for evaluation (e.g., using motion-captured body and system-generated hands), enabling experiments that are not possible with single-channel text generation. Limitations: this is a 2005 design overview with implementation in progress, no empirical results, and the state of the art has moved substantially — but the conceptual framework remains a reasonable starting point for understanding why signing-avatar systems are hard and why high-quality ASL generation cannot be reduced to sequence-to-sequence translation.
Tags: American Sign Language · natural language generation · sign language machine translation · multimodal NLG · classifier predicates · embodied conversational agents · deaf accessibility · signing avatar · computational linguistics