Extracting Structured Data from Multi-Modal Input

Aishni Parab
University of California, Los Angeles (UCLA)
Statistics

In many real-world images, text and visual elements coexist seamlessly — appearing in tables, charts, road signs and maps. These multi-modal images tightly integrate vision and language, requiring precise extraction methods to preserve the semantic richness of both modalities. For example, extracting a table's structure and content requires precision to preserve both its layout and meaning. Programs serve as a powerful, interpretable representation for extracting information from such images. They can be executed to accurately reproduce the image digitally and integrate with software tools like spreadsheets for downstream workflows. Additionally, programs provide disentangled representations that isolate components and relations between these components, enabling precise manipulation without affecting the whole. Programs also generalize well by abstracting patterns independent of content, supporting reusable templates and scalable operations across datasets and domains. In this talk, I will explore key techniques for translating multi-modal data into code, with a focus on structured data extraction from tables. I will highlight purely neural methods, neuro-symbolic approaches, and modern LLM-based techniques, discussing their strengths, limitations, and the challenges involved.


View on Youtube

Back to Workshop III: Naturalistic Approaches to Artificial Intelligence