Nougat: Neural Optical Understanding for Academic Documents
By L. Blecher et al
Published on Aug. 25, 2023
Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Related Work
3 Model
3.1 Setup
3.2 Data Augmentation
4 Datasets
4.1 Splitting the pages
4.2 Ground truth artifacts
5 Results & Evaluation
Summary
Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. Nougat (Neural Optical Understanding for Academic Documents) is a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, bridging the gap between human-readable documents and machine-readable text. The proposed approach enhances the accessibility of scientific knowledge in the digital age. The model releases the code and models for future work on scientific text recognition. The paper introduces Nougat, a transformer-based model that converts images of document pages to formatted markup text. The model does not require any OCR related inputs or modules. The text is recognized implicitly by the network. The architecture is based on the Donut architecture and uses a Swin Transformer for visual encoding. The model is trained with an AdamW optimizer and uses data augmentation techniques to improve generalization. The dataset creation process involves collecting source code from arXiv, XML files from PMC, and OCR text from the IDL dataset.