Effect of Displaying Human Videos During an Evaluation Study of American Sign Language Animation

Hernisa Kacorri, Pengfei Lu, Matt Huenerfauth · 2013 · ACM Transactions on Accessible Computing · doi:10.1145/2517038

Summary

This paper addresses a critical methodological question in sign language animation research: how does the choice of upper baseline (human video vs. high-quality animation) and the modality of comprehension questions affect evaluation results? The authors conducted three phases of experiments with native ASL signers to isolate these effects. Phase 1 examined hand movements using a verb inflection animation system, finding that video upper baselines received higher subjective Likert scores than animation baselines. Phase 2 added facial expressions to the animations and found that video baselines also achieved higher comprehension scores than animation baselines. Phase 3 investigated whether presenting comprehension questions as videos versus animations affected participants' responses to the actual stimuli being evaluated. The research involved careful methodological controls, including matching video recordings to animation content through scripting, coaching signers on facial expressions, and controlling for signing space locations. Participants were recruited from the NYC deaf community, with studies involving 16-18 native ASL signers per phase. The experimental design used fully-factorial approaches where no participant saw the same story twice, presentation order was randomized, and each participant viewed multiple quality levels of each animation.

Key findings

Video upper baselines consistently received higher subjective scores (grammaticality, understandability, naturalness) than animation upper baselines, with statistically significant differences in both Phase 1 and Phase 2 experiments. When video upper baselines were displayed alongside other animations in side-by-side comparisons, the Likert scores for non-video stimuli decreased by 10-20% compared to when animation upper baselines were used. This "depressive effect" was significant during simultaneous presentation and partially significant during sequential presentation, particularly for animations with facial expressions. For comprehension question accuracy, Phase 2 showed that video upper baselines achieved significantly higher scores than animation baselines—likely because facial expressions (critical for ASL grammar) are better rendered by humans than current animation technology. However, Phase 3 demonstrated that the modality used to present comprehension questions (video vs. animation) did not significantly affect either comprehension scores or Likert ratings, provided the animations were high quality. TOST equivalence testing confirmed statistically equivalent results between the two question presentation modalities.

Relevance

This research provides essential methodological guidance for anyone conducting user studies to evaluate sign language animation systems. The findings reveal that methodological choices—which may seem incidental—can systematically bias results, making fair comparisons between studies difficult. Practitioners developing sign language accessibility tools should understand that their evaluation metrics may appear worse if they use video upper baselines, even if their animation quality is identical to competitors who used animation baselines. The paper offers concrete recommendations: use video baselines when studying photorealism or communicating to lay audiences; use animation baselines when precise control over linguistic variables is needed. The challenge of producing matching video content (due to ASL's lack of standard written form and regional/dialectical variations) also highlights the complexity of evaluation in this field. For accessibility researchers more broadly, this work demonstrates the importance of understanding how evaluation methodology itself can influence perceived quality of assistive technologies.

Tags: sign language animation · ASL · deaf accessibility · evaluation methodology · user studies · virtual humans