What Visual Language Models Can (Not) Do for Document Analysis

Visual Language Models (VLMs) are rapidly reshaping document analysis, promising unified reasoning over text, layout, and visual structure. Yet important questions remain open: How capable are lightweight and open models in real-world scenarios? What do current evaluations actually measure? And are fully end-to-end systems truly the right direction for document understanding?

In this talk, I present recent work on lightweight VLMs for tasks such as structured OCR, information extraction, and document image machine translation. Through evaluations on historical archives and complex structured documents, I discuss common challenges including reading-order reasoning, schema adherence, hallucinations, and the gap between benchmark performance and practical usability.

The talk argues that efficient and deployable document understanding systems may depend less on scaling model size and more on hybrid multimodal pipelines and constrained decoding.