#awk

jojan@wk3.org

I am going to teach some colleagues to use shell. Starting with ls,cd, mkdir et al to navigate, and creating files and directories. Then some more advanced tools like grep and find, as well as a bit of sed and awk. Ending with advanced shell script. I will teach using bash, as it is default in many systems today, but thought I would mention some others and how they differ.

The idea now is to have five sessions of two hours each. I will try to make a proper outline this weekend. This is about what I have today

  • Session 1: Navigating in Shell, Unix file system, creating files, pipes (|, < and > and perhaps 2>&1 et al)
  • Session 2: grep, find, sed, tr
  • Session 3: aliases and .bashrc
  • Session 4: scripts
  • Session 5: more scripts?

Do you have any suggestion of what I should mention? Do you think this outline looks all right?

#bash #sed #awk #gnu #linux #unix #terminal

dredmorbius@joindiaspora.com

COVID-19 A Laycat's US Outbreak Model

This is a non-expert's simple extrapolation of the past 11 days' COVID-19 experience within the US, projecting both further likely spread of the COVID-19 outbreak and the possible actual extent of infected individuals based on a presumed testing lag.

As with my earlier China extrapolation: The real message here is how quickly experience deviates below the projection here, suggesting containment efforts are effective. In the case of China, that began about two weeks after my initial post. I am a space alien cat on the Internet, not an expert.

I've probably fucked up all kinds of things. Cluebats welcomed.

How this model works

I'm using a simple exponential growth formula, and basing the expected number of cases (and deaths) from the 5 March 2020 case and death counts, based on what appears to be native community spread rates through the US from 20 February 2020 through 5 March (the period of visible community spread). This is a short window though one showing rapid growth.

It is overwhelmingly evident that the US does NOT have a solid handle on monitoring, and likely won't for at least another week, possible several. This both makes the data presented and model based on them more uncertain, and means that as monitoring improves, apparent case counts will likely increase rapidly. Again, this reflects experience in China.

Virus behaviour, population behaviour, public health measures, weather changes, sunspots, and timelords could all change things markedly.

Exponential growth function

The fomula for exponential growth is:

y(t) = a * e^(k * t)

See: https://www.mathsisfun.com/algebra/exponential-growth.html

Where:

  • y(t): quantity at time t
  • a: initial quantity
  • e: the natural log constant, about 2.7183
  • k: the grow rate per period.
  • t: the number of periods.

"Period" here is "days".

We can solve for k:

k = ln(y(t)/a)/t

This gives us the growth rate given two measurements t periods apart.

We can solve for t:

t = ln(y(t)/a)/k

In particular, if we solve for y(t) = 2 and a = 1, we get the doubling time.

I've written a simple gawk script which computes for k and doubling rate, and also projects the weekly (7 day) and fortnightly (14 day) growth rates.

Detection lag

A huge problem within the US is that confirmed cases are lagging actual infection dates by a substantial amount. How long that is is ... not entirely clear, though I'm going to assume a 14 day (two week) lag based on:

  • Initial infection is followed by a non-symptomatic period of about a week on average.
  • Seeking medical assistance has seen a further lag of several days in getting an appointment / performing a test.
  • Test results themselves take 4 days based on information I've seen.

The total lag is about 2 weeks.

I'd suggested that this could lead to as much as a 100-fold understatement of actual cases. Based on current data, that seems pessimistic: it's "only" about 47x greater than the published confirmed cases count -- a number that's moved around considerably, by the way, so don't put too much faith in that either. But it gives an indication.

We also get a doubling time of about 2.2 days, which means that however bad the situation is now, it's going to be twice as bad in a little over 48 hours. When you hear statements that the situation is "rapidly evolving" this is what is being referenced. Things are changing very quickly. Locations which may have low risk today may have a high risk in a day or two.

You should be finalising preparations and supplies runs about now, if not already.

Again: non-expert extrapolation based on early data, a simple model, and many uncertainties. I expect we'll likely see following trend, if not overshooting it, for a week or two, mostly as monitoring catches up to reality. I'm very much hoping we'll start to see a low-side numbers starting about two weeks out (18-22 March), as containment efforts begin to be effective. The caveat is that I don't see effective containment measures being enacted, certainly not on the scale that China performed starting ~22 January. In which case the projection here could well fit actual experience for longer.

As before, I'm posting this as a line in the sand of what my projection was. I hope and expect to be proved wrong on this within a couple of weeks. I'm dying to see how well this matches reality.

The professionals are apparently doing this as well

Dr. Messonier of the CDC mentioned 5 March in an NPR interview that there were numerous groups doing epidemic modelling to try to estimate the actual spread of SARS-CoV-2 within the US, though she pointedly refused to give any numbers herself. I have yet to find any published projections, but would be interested in seeing any.

The script

Hardcoded in (edit to modify) are the initial and current case counts. You'll need to supply days between these measures as well. Data are taken from Wikipedia's 2020 Coronavirus Outbreak in the United States article.

The script calcuates the growth rate, with an arbitrary high and low bound (basically assuming one day more or less error in the reported range -- it's kind of weak sauce but gives some idea of sensitivity), the doubling time, the weekly growth rate, and the 14-day growth rate.

It then produces two reports, one every day for 29 days, the other every seven days for 200 days. Both cut off if the infected population exceeds total US population, given as 330.4 million. Shown are projected deaths, cases, cases at a low or high growth rate, and as "w/ 14 day lag" the possible ground truth of total cases from which confirmed cases are drawn. I'll note that this presently exceeds 10,000 cases, and ... doubles ever 2.2 days or so. A rate which will hit 1,000,000 by 18 March.

By April 25, if present rates continue, the entire US is infected. At the WHO's 3.4% fatality rate, 11.2 million die, and given economic modelling, your retirement fund is trash.

(And then the disease may return in the fall....)

For Rest-of-world, you can substitute in values for that outbreak for a simiilar model. (I've got a separate script for this.) As values are hardcoded, it's a tad inflexible.

## Program Output

Minor reformatting aside, this is output as currently stands.

COVID-19 US Outbreak Model

Assumptions:
- init cases (2020-4-26): 14
- cases (2020-3-5): 175
- deaths (2020-3-5): 11
- daily growth rate: 1.316
- doubling time (days): 2.195
- 7 day growth: 6.83x
- 14 day growth/mon. lag: 46.59x

day date deaths cases @ lo dbl @ hi dbl w/ 14d lag
1 Mar 06, 2020 14 230 224 238 10,726
2 Mar 07, 2020 19 302 287 324 14,113
3 Mar 08, 2020 25 398 367 440 18,569
4 Mar 09, 2020 32 524 470 600 24,431
5 Mar 10, 2020 43 689 602 816 32,145
6 Mar 11, 2020 57 907 771 1,111 42,294
7 Mar 12, 2020 75 1,194 988 1,512 55,647
8 Mar 13, 2020 98 1,571 1,266 2,057 73,216
9 Mar 14, 2020 129 2,067 1,621 2,800 96,331
10 Mar 15, 2020 171 2,720 2,076 3,811 126,744
11 Mar 16, 2020 224 3,579 2,659 5,186 166,760
12 Mar 17, 2020 296 4,709 3,405 7,057 219,409
13 Mar 18, 2020 389 6,196 4,360 9,603 288,680
14 Mar 19, 2020 512 8,152 5,584 13,068 379,821
15 Mar 20, 2020 674 10,726 7,151 17,784 499,736
16 Mar 21, 2020 887 14,113 9,159 24,201 657,511
17 Mar 22, 2020 1,167 18,569 11,729 32,933 865,098
18 Mar 23, 2020 1,535 24,431 15,021 44,816 1,138,224
19 Mar 24, 2020 2,020 32,145 19,236 60,987 1,497,580
20 Mar 25, 2020 2,658 42,294 24,635 82,992 1,970,390
21 Mar 26, 2020 3,497 55,647 31,548 112,938 2,592,474
22 Mar 27, 2020 4,602 73,216 40,402 153,688 3,410,959
23 Mar 28, 2020 6,055 96,331 51,740 209,142 4,487,854
24 Mar 29, 2020 7,966 126,744 66,261 284,604 5,904,742
25 Mar 30, 2020 10,482 166,760 84,856 387,295 7,768,965
26 Mar 31, 2020 13,791 219,409 108,670 527,038 10,221,752
27 Apr 01, 2020 18,145 288,680 139,167 717,203 13,448,923
28 Apr 02, 2020 23,874 379,821 178,222 975,983 17,694,965
29 Apr 03, 2020 31,412 499,736 228,238 1,328,136 23,281,550
day date deaths cases @ lo dbl @ hi dbl w/ 14d lag
1 Mar 06, 2020 14 230 224 238 10,726
8 Mar 13, 2020 98 1,571 1,266 2,057 73,216
15 Mar 20, 2020 674 10,726 7,151 17,784 499,736
22 Mar 27, 2020 4,602 73,216 40,402 153,688 3,410,959
29 Apr 03, 2020 31,412 499,736 228,238 1,328,136 23,281,550
36 Apr 10, 2020 214,403 3,410,959 1,289,346 11,477,413 158,908,518
43 Apr 17, 2020 1,463,411 23,281,550 7,283,681 99,184,812 1,084,632,112
50 Apr 24, 2020 9,988,535 158,908,518 41,146,424 857,129,291 7,403,170,243

Source Code

https://pastebin.com/raw/Sn2jrG5f

Please note any observed errors / corrections.

Earlier

#coronavirus #covid-19 #covid19 #ncov2019 #epidemiology #epidemics #exponentialGrowth #IHopeIAmWrong #awk

dredmorbius@joindiaspora.com

Stupid Awk text-processing tricks: Reframe your record and field delimiters

TL;DR: sometimes changing record / field separators can be exceptionally useful.

I've been wrestling with document conversions, from PDF, of what's really a set of structured data.[1] The tools for actually getting text out of PDFs has ... improved markedly over the years. The Poppler library's tools in particular.

But you've still got to manage the output. And what I'm getting has semantic columns, spaces, indents, text, unicode, lions, tigers, bears... All structured within multi-paged documents.

Awk's default processing model is to read a line of input at a time, and break that into fields based on whitespace.

But ... you're not limited to this.

There are a set of arguments and internal variables which can change all of this, as well as some ... suprisingly useful functions. The gawk(1) manpage and Gnu Awk User's Guide are especially helpful

Most useful to me are the RS and FS variables, and the split(s, a \[, r \[, seps\]\]) function.

RS defines the record separator. By default, that's a newline, but if what you're working with is more senisibly thought of as a page of data, well, you can set it to the "\f", that is the form-feed character (hex xOC, octal 014).

FS defines the field separator, a space (" ") by default (hex x20, octal 040). Here, it's more sensible to think of each line as an individual record.

Simply by setting these two values, suddenly I'm reading a full page of text at a time, automatically splitting that into fields consisting of a single complete line each, and setting useful values such as NF, ("number of fields"), now "number of lines" on the page.

If you've ever found yourself wanting to scroll backwards and forwards through a record ... well, now you can.

The split() function was the next realisation I had. In the LCSH file, "columns" are separated by, some testing confirms, two or more space characters. More or less.

(There are all matter of special cases in the data, but getting basic structure set ... helps a lot.)

The arguments to split(s, a \[, r \[, seps\]\]) are:

  • s: the source string. Here, an input line from my raw text file.
  • a: the results array. This is conveniently cleared (as is seps, hold tight) when invoked.
  • r: the field separator regular expression. Since I'm looking for two or more spaces, " \{2\}" works for me here.
  • seps: another array, this time consisting of the separators between the fields. Also cleared as is a.
  • return value: the number of fields extracted.

The square braces mean that some of those arguments are optional -- if not supplied, 'r' is the default field separator, and the separators themselves are discarded.

So suddenly I've got the means to access a page of lines which I can split into columns and keep track of the gaps between them, as well as counts of pages, lines, columns, gaps, lengths of columns and gaps, and All That Other Jazz which makes figuring out the Pieces to the Puzzle possible.

Since the entire LCSH collection is about 760,000 entries, getting a script to do this is Much Easier (and faster, and more replicable) than Trying to Do This by Hand.

I suspect this isn't an especially well-hidden secret, but I'd been finding lots of nothing looking for ways of rescoping the text-extraction problem. Reframing the data as "lines in page" rather than "text on lines" makes the concepts, and opportunities, of working with the data vastly more tractable. Often the trick to solving a problem is one of framing it the right way, and that's exactly what I'm able to do here.

I suspect the notion could be expanded to the point of inhaling complete files in a single fell swoop, which is a frequently-applied method in Perl. I don't need to do that yet, but should I need to ... the option seems to exist: set RS to EOF (decimal / hex / octal 4 / x04 / 004). Which I may yet play with.


Notes:

  1. Library of Congress Classiification and Subject Headings definitions. Freely available ... as PDFs. See \u003chttps://www.loc.gov/aba/publications/FreeLCC/freelcc.html\u003e and \u003chttps://www.loc.gov/aba/publications/FreeLCSH/freelcsh.html\u003e.

#LCC LoCCS #LCSH #LoC #Libraries #Classifications #Ontologies #awk #gawk #TextExtraction