April 25, 2024

Tabular Data Extraction and Recognition

Evaluation Dataset: WikiTableSet is published on https://github.com/namtuanly/WikiTableSet
Related Research: Rethinking Image-Based Table Recognition Using Weakly Supervised Methods (ICPRAM, 2023)

The tabular format is a rich information format widely used in communication, research, and data analysis. Tables commonly appear in research papers, books, handwritten notes, invoices, financial documents, and many other places. Thus, table recognition and analysis becomes one of essential techniques in document analysis systems and attracts the attention of numerous researchers. As shown in Fig. 1, the table recognition and analysis system consists of two main parts: table detection and table recognition. Table recognition is to recognize the table structure and table content from a table image and convert them into a machine-readable format such as HTML or LaTeX. The aim of the research focuses is to improve the table recognition and analysis system for multilingual documents.

Possible topics that interns can join:

  • Language model for table recognition
  • Domain adaptation for multilingual table recognition
  • Embeddings in tabular data
Figure 1: Table recognition and analysis system