Visual Document Understanding

Also known as: VDU, Document Understanding

A field of AI research focused on the interpretation and analysis of visually-rich digital documents such as forms, tables, menus, reports, receipts, and academic papers. Visual document understanding goes beyond basic OCR text extraction by comprehending the spatial layout, visual relationships, legends, icons, and hierarchical structure that convey meaning in documents designed for visual consumption. Techniques include OCR-dependent approaches (like LayoutLMv2 that combine extracted text with layout information) and OCR-free approaches (like Donut that directly map document images to structured outputs). In accessibility, visual document understanding is critical for making image-based documents like restaurant menus, infographics, and scanned forms accessible to screen reader users.

Category: artificial intelligence · computer vision · document accessibility

Related: Optical Character Recognition · Multimodal Large Language Model · Document Accessibility · Reading Order

Sources

https://dl.acm.org/doi/10.1145/3744257.3744275