#loc

dredmorbius@joindiaspora.com

Stupid Awk text-processing tricks: Reframe your record and field delimiters

TL;DR: sometimes changing record / field separators can be exceptionally useful.

I've been wrestling with document conversions, from PDF, of what's really a set of structured data.[1] The tools for actually getting text out of PDFs has ... improved markedly over the years. The Poppler library's tools in particular.

But you've still got to manage the output. And what I'm getting has semantic columns, spaces, indents, text, unicode, lions, tigers, bears... All structured within multi-paged documents.

Awk's default processing model is to read a line of input at a time, and break that into fields based on whitespace.

But ... you're not limited to this.

There are a set of arguments and internal variables which can change all of this, as well as some ... suprisingly useful functions. The gawk(1) manpage and Gnu Awk User's Guide are especially helpful

Most useful to me are the RS and FS variables, and the split(s, a \[, r \[, seps\]\]) function.

RS defines the record separator. By default, that's a newline, but if what you're working with is more senisibly thought of as a page of data, well, you can set it to the "\f", that is the form-feed character (hex xOC, octal 014).

FS defines the field separator, a space (" ") by default (hex x20, octal 040). Here, it's more sensible to think of each line as an individual record.

Simply by setting these two values, suddenly I'm reading a full page of text at a time, automatically splitting that into fields consisting of a single complete line each, and setting useful values such as NF, ("number of fields"), now "number of lines" on the page.

If you've ever found yourself wanting to scroll backwards and forwards through a record ... well, now you can.

The split() function was the next realisation I had. In the LCSH file, "columns" are separated by, some testing confirms, two or more space characters. More or less.

(There are all matter of special cases in the data, but getting basic structure set ... helps a lot.)

The arguments to split(s, a \[, r \[, seps\]\]) are:

  • s: the source string. Here, an input line from my raw text file.
  • a: the results array. This is conveniently cleared (as is seps, hold tight) when invoked.
  • r: the field separator regular expression. Since I'm looking for two or more spaces, " \{2\}" works for me here.
  • seps: another array, this time consisting of the separators between the fields. Also cleared as is a.
  • return value: the number of fields extracted.

The square braces mean that some of those arguments are optional -- if not supplied, 'r' is the default field separator, and the separators themselves are discarded.

So suddenly I've got the means to access a page of lines which I can split into columns and keep track of the gaps between them, as well as counts of pages, lines, columns, gaps, lengths of columns and gaps, and All That Other Jazz which makes figuring out the Pieces to the Puzzle possible.

Since the entire LCSH collection is about 760,000 entries, getting a script to do this is Much Easier (and faster, and more replicable) than Trying to Do This by Hand.

I suspect this isn't an especially well-hidden secret, but I'd been finding lots of nothing looking for ways of rescoping the text-extraction problem. Reframing the data as "lines in page" rather than "text on lines" makes the concepts, and opportunities, of working with the data vastly more tractable. Often the trick to solving a problem is one of framing it the right way, and that's exactly what I'm able to do here.

I suspect the notion could be expanded to the point of inhaling complete files in a single fell swoop, which is a frequently-applied method in Perl. I don't need to do that yet, but should I need to ... the option seems to exist: set RS to EOF (decimal / hex / octal 4 / x04 / 004). Which I may yet play with.


Notes:

  1. Library of Congress Classiification and Subject Headings definitions. Freely available ... as PDFs. See \u003chttps://www.loc.gov/aba/publications/FreeLCC/freelcc.html\u003e and \u003chttps://www.loc.gov/aba/publications/FreeLCSH/freelcsh.html\u003e.

#LCC LoCCS #LCSH #LoC #Libraries #Classifications #Ontologies #awk #gawk #TextExtraction