Multimodal Summarization of Complex Sentences

Naushad UzZaman, Jeffrey P. Bigham, James F. Allen · 2011 · Proceedings of the 16th International Conference on Intelligent User Interfaces (IUI 2011) · doi:10.1145/1943403.1943412

Summary

This paper introduces the concept of multimodal summarization (MMS) for complex sentences — automatically generating diagrams that combine pictures, simplified compressed text, and structural layout to help people understand difficult text. The system, ROC-MMS, targets people who have difficulty reading, including children, older adults, people with cognitive disabilities, second language learners, and the estimated 2 million Americans with significant communication impairments. ROC-MMS works through three steps: (1) identifying the main event and related entities (subject, object, prepositions) in a complex sentence using the TRIOS temporal event extraction system and Stanford dependency parser; (2) extracting representative pictures for each entity from Wikipedia and web image search; and (3) combining these elements into a structured visual layout showing who did what, to whom, and how. The system handles complex and compound sentences, unlike prior automatic illustration systems that only worked with simple text. Pictures are sourced primarily from Wikipedia to ensure quality, using infobox images, filename-based scoring, and fallback web search. Temporal expressions are given special treatment with representative images rather than literal illustration.

Key findings

The main event identification classifier achieved an F-score of 75.98% (precision 79.10%, recall 73.11%) on Wikipedia domain sentences. Entity extraction achieved relaxed precision of 76.76% and recall of 83.82%. The image extraction system produced acceptable pictures approximately 65% of the time (matching annotator-to-annotator agreement of 66.66%). Critically, the evaluation demonstrated that pictures alone are insufficient for understanding complex sentences — when crowd workers were shown picture-only diagrams (even with human-selected images), they largely could not produce accurate explanations of the original sentence. However, ROC-MMS diagrams combining pictures with compressed text and structure yielded significantly better comprehension, with Rouge-1 F-scores of 0.24 compared to 0.089 for picture-only annotator diagrams. This demonstrates that the combination of text, pictures, and structure is essential — none alone suffices for understanding complex content.

Relevance

This research addresses a critical accessibility need: making complex text comprehensible to people with reading difficulties, cognitive disabilities, or limited language proficiency. The multimodal approach aligns with augmentative and alternative communication (AAC) principles and universal design for learning, providing multiple representations of the same information. For accessibility practitioners, the key insight is that pictures alone do not convey meaning adequately for complex content — structured text is essential as a complement. This has implications for how we approach text simplification and content accessibility: rather than choosing between text and visual alternatives, combining them with clear structure produces the best comprehension. The work is also relevant to plain language initiatives, automated text summarization for cognitive accessibility, and symbol-based communication systems used by people with intellectual or developmental disabilities.

Tags: cognitive accessibility · reading accessibility · natural language processing · augmentative and alternative communication · text simplification · image accessibility · multimodal interaction