Bonjour tout le monde, je suis #nouveauici. Mes centres d'intérêt sont #accordéon, #emacs, #esperanto, #krishnamurti, #latex, #linux, #musique et #texinfo. Et encore #montaigne #spinoza etc.
One person like that
1 Shares
Bonjour tout le monde, je suis #nouveauici. Mes centres d'intérêt sont #accordéon, #emacs, #esperanto, #krishnamurti, #latex, #linux, #musique et #texinfo. Et encore #montaigne #spinoza etc.
This blog post starts a series of posts on the 20 years anniversary of JabRef. We asked various contributors about their stories on JabRef and other insights. One of the early contributors was David Weitzman, who started contributing to JabRef while he was in high school...
An OCR system that can convert PDFs of scientific papers dense with mathematical equations has been developed. For mathematical equations, it outputs the LaTeX format.
"Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR, excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset, capture the text of 12M2 papers using GROBID, but are missing meaningful representations of the mathematical equations. To this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text."
The researchers have released a pre-trained model capable of converting a PDF to a lightweight markup language.
"Our method is only dependent on the image of a page, allowing access to scanned papers and books."
"To the best of our knowledge there is no paired dataset of PDF pages and corresponding source code out there, so we created our own from the open access articles on arXiv. For layout diversity we also include a subset of the PubMed Central (PMC) open access non-commercial dataset. During the pretraining, a portion of the Industry Documents Library (IDL) is included."
The model they came up to do this is called Nougat, "an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup." It's basically a vision transformer model.
A lot of the paper is concerted with technicalities such as splitting pages and ignoring headers and footers with page numbers and various compression and distortion artifacts, blur, and noise, that can exist in the image to be OCRed.
To measure the performance of the model, they calculated edit distance, BLEU score, METEOR score, and F1-score.
"The edit distance, or Levenshtein distance, measures the number of character manipulations (insertions, deletions, substitutions) it takes to get from one string to another. In this work we consider the normalized edit distance, where we divide by the total number of characters."
"The BLEU metric was originally introduced for measuring the quality of text that has been machinetranslated from one language to another. The metric computes a score based on the number of matching n-grams between the candidate and reference sentence.
METEOR is "another machine-translating metric with a focus on recall instead of precision."
The F1-score incorporates both precision and recall, and "We also compute the F1-score and report the precision and recall."
They compared with a previous OCR system, GROBID with LaTeX OCR. For edit distance, GROBID with LaTeX OCR got 0.727, while Nougat Small (250 million parameters) got 0.117 and Nougat Base (350 million parameters) got 0.128 on math equations. On edit distance, smaller is better. For BLUE, the numbers were 0.3 for GROBID + LaTeX OCR, 56.0 for Nougat Small and 56.9 for Nougat Base -- larger is better. On METEOR, the numbers were 5.0 for GROBID + LaTeX OCR, 74.7 for Nougat Small and 75.4 for Nougat Base -- larger is better. For F1, the numbers were 9.7 for GROBID + LaTeX OCR, 76.9 for Nougat Small, and 76.5 for Nougat Base -- larger is better.
This sounds like something that could be incredibly useful.
At last, there seems to be a tool for embedding #MovieClips into #Beamer #presentations (i.e. #PDF files) for #GNU plus #Linux systems. It's a previewer called #pdfpc which recognizes the movie and you can start it, pause it, etc with your mouse. If your machine is attached to a second screen, it automatically displays the presentation #Fullscreen on it.
Another reason to hate #Microsoft. I took data from a colleague's #Powerpoint slide to put into a #LaTeX document. When I processed my file I kept getting errors and I couldn't see anything with #Emacs that was wrong with the file. It turns out that the original file contained a #zero-width #Unicode character which I couldn't see! The solution was to us od -c | less
to spot anything untoward in my #TeX source #TextFile. There it was and it was in many places and I had to find the locations, and replace the characters either side to be sure that the invisible character was deleted. What a waste of time.
Das Kind macht für die Schule eine Präsentation. Fach Bio, natürlich mit #LaTeX, class beamer und einem kommunistischen Literaturverzeichnis :-)
This month: * Command & Conquer * How-To : Python, Blender and Latex * Graphics : Inkscape * Everyday Ubuntu * Micro This Micro That * Review : Kubuntu 22.10 * Review : Ubuntu Cinnamon 22.04 * Ubports Touch : OTA-24 * Tabletop Ubuntu * Ubuntu Games : Dwarf Fortress (Steam Edition) plus: News, My Story, The Daily Waddle, Q&A,
#magazine #cinnamon #dwarf #dwarffortress #fortress #inkscape #kubuntu #latex #micro #python #steam #touch #ubports #ubuntu #fullcirclemagazine #linux
We are proud to announce the release of version 5.8 of our favorite citation manager JabRef, just in time for the holidays! And a huge welcome to @ThiloTe in our team of maintainers and as the new community manager!
#jabref #LaTeX #bibtex #openSource
https://blog.jabref.org/#december-18-2022-%E2%80%93-%F0%9F%8E%84-jabref-5-8-release-%F0%9F%8E%84
This month: * Command & Conquer * How-To : Python, Blender and Latex * Graphics : Inkscape * Everyday Ubuntu * Micro This Micro That * Review : Ubuntu 22.10 * Review : VanillaOS * Review : Ventoy * Tabletop Ubuntu * Ubuntu Games : Pixel Wheels plus: News, My Story, The Daily Waddle, Q&A, and more. Get it while it's hot: https://fullcirclemagazine.org/issue-187/
#magazine #2210 #dailywaddle #distro #inkscape #latex #pixelwheels #python #ubuntu #vanilla #vanillaos #ventoy #waddle #fullcirclemagazine #linux
♲ Khurram Wadee - 2022-11-04 11:56:53 GMT
I've produced a #calendar for #2023 featuring some of the images that I've also posted here from time to time of #sunrises and #sunsets. You can download the full resolution #PDF file from the link below.
https://drive.google.com/file/d/17Eq7WeC7NVF8-MkDnXFW6ZMdHZQLwCuAI present here some lower-resolution sample images.
The calendar itself was produced by and extracted from #Emacs which can output into #LaTeX.
I've produced a #calendar for #2023 featuring some of the images that I've also posted here from time to time of #sunrises and #sunsets. You can download the full resolution #PDF file from the link below.
https://drive.google.com/file/d/17Eq7WeC7NVF8-MkDnXFW6ZMdHZQLwCuA
I present here some lower-resolution sample images.
The calendar itself was produced by and extracted from #Emacs which can output into #LaTeX.
This month: * Command & Conquer * How-To : Python, Blender and Latex * Graphics : Inkscape * Everyday Ubuntu : Morrowind * Micro This Micro That * Review : Ubuntu Budgie 22.04 * Review : NixOS * Book Review : Dead Simple Python * Tabletop Ubuntu [NEW!] : Eight Minute Empire plus: News, My Story, The
#magazine #2204 #blender #book #budgie #empire #inkscape #latex #micro #morrowind #nixos #python #review #story #fullcirclemagazine #ubuntu #linux
This month: * Command & Conquer * How-To : Bash to Python, Migrating from VAX/VMS and Latex * Graphics : Inkscape * Everyday Ubuntu: Diagramming with Dia * Review : Xubuntu 22.04 * Review : Void Linux * Ubuntu Games : Crystal Caves HD plus: News, My Opinion, The Daily Waddle, Q&A, and more. Get it while it's
#magazine #bash #crystalcaves #dia #diagram #inkscape #latex #python #qa #vax #vaxvms #vms #void #voidlinux #waddle #xubuntu #fullcirclemagazine #ubuntu #linux
Citations can now also be looked up in the Biodiversity Heritage Library and we also added support to import Citavi backup files. A new filter for the Unlinked Files Search has been introduced to respect file ignore patterns defined in a .gitignore file in the search directory. We also improved the automatic detection of the library’s charset and fixed a couple of issues regarding the writing of XMP Metadata to linked files.
Notable UI improvements include the feature to drag and drop entries across libraries, by dropping them on the library tab. The “Automatic Field Editor” dialog was redesigned and polished by our GSoC mentee @HoussemNasri. There may be some issues left, feel free to report them in our issue tracker.
As we updated the full-text search engine to Lucene 9.3, JabRef will recreate the search index in the background on start. Be aware that switching back and forth between the current version and any older version will make JabRef repeat this process every time, and this will take a long time for huge databases with many linked files.
For a complete list of all our changes, take a look at the Changelog.
You can get JabRef as free software from FOSShub.
#JabRef #openSource #LaTeX #java
https://blog.jabref.org/#august-05-2022-%E2%80%93-jabref-5-7-release