An OCR system that can convert PDFs of scientific papers dense with mathematical equations has been developed. For mathematical equations, it outputs the LaTeX format.
"Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR, excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset, capture the text of 12M2 papers using GROBID, but are missing meaningful representations of the mathematical equations. To this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text."
The researchers have released a pre-trained model capable of converting a PDF to a lightweight markup language.
"Our method is only dependent on the image of a page, allowing access to scanned papers and books."
"To the best of our knowledge there is no paired dataset of PDF pages and corresponding source code out there, so we created our own from the open access articles on arXiv. For layout diversity we also include a subset of the PubMed Central (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library (IDL) is included."
The model they came up to do this is called Nougat, "an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup." It's basically a vision transformer model.
A lot of the paper is concerted with technicalities such as splitting pages and ignoring headers and footers with page numbers and various compression and distortion artifacts, blur, and noise, that can exist in the image to be OCRed.
To measure the performance of the model, they calculated edit distance, BLEU score, METEOR score, and F1-score.
"The edit distance, or Levenshtein distance, measures the number of character manipulations (insertions, deletions, substitutions) it takes to get from one string to another. In this work we consider the normalized edit distance, where we divide by the total number of characters."
"The BLEU metric was originally introduced for measuring the quality of text that has been machinetranslated from one language to another. The metric computes a score based on the number of matching n-grams between the candidate and reference sentence.
METEOR is "another machine-translating metric with a focus on recall instead of precision."
The F1-score incorporates both precision and recall, and "We also compute the F1-score and report the precision and recall."
They compared with a previous OCR system, GROBID with LaTeX OCR. For edit distance, GROBID with LaTeX OCR got 0.727, while Nougat Small (250 million parameters) got 0.117 and Nougat Base (350 million parameters) got 0.128 on math equations. On edit distance, smaller is better. For BLUE, the numbers were 0.3 for GROBID + LaTeX OCR, 56.0 for Nougat Small and 56.9 for Nougat Base -- larger is better. On METEOR, the numbers were 5.0 for GROBID + LaTeX OCR, 74.7 for Nougat Small and 75.4 for Nougat Base -- larger is better. For F1, the numbers were 9.7 for GROBID + LaTeX OCR, 76.9 for Nougat Small, and 76.5 for Nougat Base -- larger is better.
This sounds like something that could be incredibly useful.