#latex

waynerad@diasp.org

An OCR system that can convert PDFs of scientific papers dense with mathematical equations has been developed. For mathematical equations, it outputs the LaTeX format.

"Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR, excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset, capture the text of 12M2 papers using GROBID, but are missing meaningful representations of the mathematical equations. To this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text."

The researchers have released a pre-trained model capable of converting a PDF to a lightweight markup language.

"Our method is only dependent on the image of a page, allowing access to scanned papers and books."

"To the best of our knowledge there is no paired dataset of PDF pages and corresponding source code out there, so we created our own from the open access articles on arXiv. For layout diversity we also include a subset of the PubMed Central (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library (IDL) is included."

The model they came up to do this is called Nougat, "an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup." It's basically a vision transformer model.

A lot of the paper is concerted with technicalities such as splitting pages and ignoring headers and footers with page numbers and various compression and distortion artifacts, blur, and noise, that can exist in the image to be OCRed.

To measure the performance of the model, they calculated edit distance, BLEU score, METEOR score, and F1-score.

"The edit distance, or Levenshtein distance, measures the number of character manipulations (insertions, deletions, substitutions) it takes to get from one string to another. In this work we consider the normalized edit distance, where we divide by the total number of characters."

"The BLEU metric was originally introduced for measuring the quality of text that has been machinetranslated from one language to another. The metric computes a score based on the number of matching n-grams between the candidate and reference sentence.

METEOR is "another machine-translating metric with a focus on recall instead of precision."

The F1-score incorporates both precision and recall, and "We also compute the F1-score and report the precision and recall."

They compared with a previous OCR system, GROBID with LaTeX OCR. For edit distance, GROBID with LaTeX OCR got 0.727, while Nougat Small (250 million parameters) got 0.117 and Nougat Base (350 million parameters) got 0.128 on math equations. On edit distance, smaller is better. For BLUE, the numbers were 0.3 for GROBID + LaTeX OCR, 56.0 for Nougat Small and 56.9 for Nougat Base -- larger is better. On METEOR, the numbers were 5.0 for GROBID + LaTeX OCR, 74.7 for Nougat Small and 75.4 for Nougat Base -- larger is better. For F1, the numbers were 9.7 for GROBID + LaTeX OCR, 76.9 for Nougat Small, and 76.5 for Nougat Base -- larger is better.

This sounds like something that could be incredibly useful.

Nougat: Neural optical understanding for academic documents

#solidstatelife #ai #computervision #ocr #latex

mkwadee@diasp.eu

Another reason to hate #Microsoft. I took data from a colleague's #Powerpoint slide to put into a #LaTeX document. When I processed my file I kept getting errors and I couldn't see anything with #Emacs that was wrong with the file. It turns out that the original file contained a #zero-width #Unicode character which I couldn't see! The solution was to us od -c | less to spot anything untoward in my #TeX source #TextFile. There it was and it was in many places and I had to find the locations, and replace the characters either side to be sure that the invisible character was deleted. What a waste of time.

federatica_bot@federatica.space

Full Circle Magazine #188

image

This month: * Command & Conquer * How-To : Python, Blender and Latex * Graphics : Inkscape * Everyday Ubuntu * Micro This Micro That * Review : Kubuntu 22.10 * Review : Ubuntu Cinnamon 22.04 * Ubports Touch : OTA-24 * Tabletop Ubuntu * Ubuntu Games : Dwarf Fortress (Steam Edition) plus: News, My Story, The Daily Waddle, Q&A,

#magazine #cinnamon #dwarf #dwarffortress #fortress #inkscape #kubuntu #latex #micro #python #steam #touch #ubports #ubuntu #fullcirclemagazine #linux

federatica_bot@federatica.space

Full Circle Magazine #187

image

This month: * Command & Conquer * How-To : Python, Blender and Latex * Graphics : Inkscape * Everyday Ubuntu * Micro This Micro That * Review : Ubuntu 22.10 * Review : VanillaOS * Review : Ventoy * Tabletop Ubuntu * Ubuntu Games : Pixel Wheels plus: News, My Story, The Daily Waddle, Q&A, and more. Get it while it's hot: https://fullcirclemagazine.org/issue-187/

#magazine #2210 #dailywaddle #distro #inkscape #latex #pixelwheels #python #ubuntu #vanilla #vanillaos #ventoy #waddle #fullcirclemagazine #linux

tekaevl@diasp.org

Khurram Wadee - 2022-11-04 11:56:53 GMT

I've produced a #calendar for #2023 featuring some of the images that I've also posted here from time to time of #sunrises and #sunsets. You can download the full resolution #PDF file from the link below.
https://drive.google.com/file/d/17Eq7WeC7NVF8-MkDnXFW6ZMdHZQLwCuA

I present here some lower-resolution sample images.

The calendar itself was produced by and extracted from #Emacs which can output into #LaTeX.

Calendar front page
January photo
January calendar page
February photo
March photo

#MyWork #MyPhoto #CCBYSA #DSLR #Nikon #D7000

mkwadee@diasp.eu

I've produced a #calendar for #2023 featuring some of the images that I've also posted here from time to time of #sunrises and #sunsets. You can download the full resolution #PDF file from the link below.
https://drive.google.com/file/d/17Eq7WeC7NVF8-MkDnXFW6ZMdHZQLwCuA

I present here some lower-resolution sample images.

The calendar itself was produced by and extracted from #Emacs which can output into #LaTeX.

Calendar front page
January photo
January calendar page
February photo
March photo

#MyWork #MyPhoto #CCBYSA #DSLR #Nikon #D7000

federatica_bot@federatica.space

Full Circle Magazine #186

image

This month: * Command & Conquer * How-To : Python, Blender and Latex * Graphics : Inkscape * Everyday Ubuntu : Morrowind * Micro This Micro That * Review : Ubuntu Budgie 22.04 * Review : NixOS * Book Review : Dead Simple Python * Tabletop Ubuntu [NEW!] : Eight Minute Empire plus: News, My Story, The

#magazine #2204 #blender #book #budgie #empire #inkscape #latex #micro #morrowind #nixos #python #review #story #fullcirclemagazine #ubuntu #linux

federatica_bot@federatica.space

Full Circle Magazine #184

image

This month: * Command & Conquer * How-To : Bash to Python, Migrating from VAX/VMS and Latex * Graphics : Inkscape * Everyday Ubuntu: Diagramming with Dia * Review : Xubuntu 22.04 * Review : Void Linux * Ubuntu Games : Crystal Caves HD plus: News, My Opinion, The Daily Waddle, Q&A, and more. Get it while it's

#magazine #bash #crystalcaves #dia #diagram #inkscape #latex #python #qa #vax #vaxvms #vms #void #voidlinux #waddle #xubuntu #fullcirclemagazine #ubuntu #linux

christophs@diaspora.glasswings.com

August 05, 2022 – JabRef 5.7 Release

Citations can now also be looked up in the Biodiversity Heritage Library and we also added support to import Citavi backup files. A new filter for the Unlinked Files Search has been introduced to respect file ignore patterns defined in a .gitignore file in the search directory. We also improved the automatic detection of the library’s charset and fixed a couple of issues regarding the writing of XMP Metadata to linked files.

Notable UI improvements include the feature to drag and drop entries across libraries, by dropping them on the library tab. The “Automatic Field Editor” dialog was redesigned and polished by our GSoC mentee @HoussemNasri. There may be some issues left, feel free to report them in our issue tracker.

As we updated the full-text search engine to Lucene 9.3, JabRef will recreate the search index in the background on start. Be aware that switching back and forth between the current version and any older version will make JabRef repeat this process every time, and this will take a long time for huge databases with many linked files.

For a complete list of all our changes, take a look at the Changelog.

You can get JabRef as free software from FOSShub.

#JabRef #openSource #LaTeX #java

https://blog.jabref.org/#august-05-2022-%E2%80%93-jabref-5-7-release