Extraction of Tabular Data from Document Images

Manolis Vasileiadis, Nikolaos Kaklanis, Konstantinos Votis, Dimitrios Tzovaras · 2017 · Proceedings of the 14th International Web for All Conference (W4A) · doi:10.1145/3058555.3058581

Summary

This demonstration paper presents an open-source tool for automatically detecting and extracting tabular data from document images — both scanned and digitally created — and converting them into accessible HTML format. The work addresses a significant accessibility gap: tables are widely used in documents and reports to organize data efficiently, but when these documents exist only as images (scanned PDFs, photographs of printed pages), the tabular structure is invisible to screen readers. Simply running OCR to extract text is insufficient because it loses the row-column relationships that give the data meaning. The proposed method uses a four-stage heuristic pipeline. First, the document image is preprocessed: binarized using Wolf-Jolion binarization, resampled to 300dpi for optimal OCR performance, grid lines are removed, and page segmentation via the Leptonica library separates text from non-text areas. Multi-column layouts are detected by searching for continuous vertical empty spaces. Second, Google Tesseract v3.04 OCR extracts text data along with bounding boxes, confidence scores, and font information for each word. Third, the table reconstruction algorithm classifies text lines into three types based on their word segment patterns: text lines (single long segment), table lines (multiple segments), and unknown lines (single short segment). Adjacent table and unknown lines are grouped into initial table areas, then a column generation algorithm horizontally aligns segments to identify columns, with various heuristic rules handling edge cases like merged columns and multi-line rows.

Key findings

Evaluation on 45 random document images from the internet — covering diverse table layouts including tables with and without gridlines, multi-line and multi-column text, and tables with and without headers — achieved high accuracy: 88.30% precision and 97.22% recall for table detection, and 89.45% precision and 93.70% recall for individual cell recognition. False positive table detections occurred mainly with strictly formatted non-tabular text such as right-aligned text blocks. Cell accuracy decreased when table data was poorly aligned, particularly with misaligned headers. The method requires no prior training and is layout-agnostic — it does not rely on document-specific features like the word "Table" or visible grid lines, instead treating tables as strictly formatted text segments identified by spatial patterns. The tool handles multi-page documents including automatic detection and removal of repetitive headers and footers.

Relevance

This work addresses a practical and persistent accessibility barrier: tabular data trapped in document images is completely inaccessible to screen reader users. While born-digital documents can include properly marked-up HTML tables, a vast amount of existing documentation — government forms, financial reports, academic papers, historical records — exists only as scanned images. Converting these to accessible formats requires not just OCR but structural understanding of table layouts. The open-source nature of the tool (available on GitHub under the EU-funded Prosperity4All project) makes it a practical resource for organizations working to remediate document accessibility. For web accessibility practitioners, the research highlights that accessible table markup (proper use of th, td, scope, and caption elements) is essential — these are exactly the semantic structures that must be reconstructed when converting inaccessible document images. The authors note that further work is needed to make the tool interface itself accessible to its target audience of visually impaired users.

Tags: document accessibility · OCR · table recognition · visual impairment · document conversion · image processing