Some poor, lost soul wrote #Wolfenstein3d in #awk.
From a terminal, ssh -t root@teso.segfault.net "awk -f /everyone/AlarmFlock/awkaster.awk”
Password is segfault
.
3 Likes
2 Comments
1 Shares
Some poor, lost soul wrote #Wolfenstein3d in #awk.
From a terminal, ssh -t root@teso.segfault.net "awk -f /everyone/AlarmFlock/awkaster.awk”
Password is segfault
.
Video: BK and the early days of UNIX https://www.montanalinux.org/videos-lca-bk-early-days-of-unix.html the k in #awk
Reducing Overhead Associated With Video Production • Techrights ⚓ http://techrights.org/2022/01/10/video-production-shell-awk/ ䷉ #Techrights #GNU #Linux #FreeSW | ♾ Gemini address: gemini://gemini.techrights.org/2022/01/10/video-production-shell-awk/ #awk
● NEWS ● #DataSwamp #BSD ☞ Using #awk to pretty-display #OpenBSD packages update changes https://dataswamp.org/~solene/2021-12-04-openbsd-package-update-report.html
A quick cross-file comparison with #AWK https://www.datafix.com.au/BASHing/2021-11-10.html #programming
I am going to teach some colleagues to use shell. Starting with ls
,cd
, mkdir
et al to navigate, and creating files and directories. Then some more advanced tools like grep
and find
, as well as a bit of sed
and awk
. Ending with advanced shell script. I will teach using bash
, as it is default in many systems today, but thought I would mention some others and how they differ.
The idea now is to have five sessions of two hours each. I will try to make a proper outline this weekend. This is about what I have today
Do you have any suggestion of what I should mention? Do you think this outline looks all right?
● NEWS ● #Medium #Programming ☞ Analyzing Big Data with #grep and #awk https://medium.com/cloud-computer/analyzing-big-data-with-grep-and-awk-c07d362b6ab8
This is a non-expert's simple extrapolation of the past 11 days' COVID-19 experience within the US, projecting both further likely spread of the COVID-19 outbreak and the possible actual extent of infected individuals based on a presumed testing lag.
As with my earlier China extrapolation: The real message here is how quickly experience deviates below the projection here, suggesting containment efforts are effective. In the case of China, that began about two weeks after my initial post. I am a space alien cat on the Internet, not an expert.
I've probably fucked up all kinds of things. Cluebats welcomed.
I'm using a simple exponential growth formula, and basing the expected number of cases (and deaths) from the 5 March 2020 case and death counts, based on what appears to be native community spread rates through the US from 20 February 2020 through 5 March (the period of visible community spread). This is a short window though one showing rapid growth.
It is overwhelmingly evident that the US does NOT have a solid handle on monitoring, and likely won't for at least another week, possible several. This both makes the data presented and model based on them more uncertain, and means that as monitoring improves, apparent case counts will likely increase rapidly. Again, this reflects experience in China.
Virus behaviour, population behaviour, public health measures, weather changes, sunspots, and timelords could all change things markedly.
The fomula for exponential growth is:
y(t) = a * e^(k * t)
See: https://www.mathsisfun.com/algebra/exponential-growth.html
Where:
"Period" here is "days".
We can solve for k:
k = ln(y(t)/a)/t
This gives us the growth rate given two measurements t periods apart.
We can solve for t:
t = ln(y(t)/a)/k
In particular, if we solve for y(t) = 2 and a = 1, we get the doubling time.
I've written a simple gawk script which computes for k and doubling rate, and also projects the weekly (7 day) and fortnightly (14 day) growth rates.
A huge problem within the US is that confirmed cases are lagging actual infection dates by a substantial amount. How long that is is ... not entirely clear, though I'm going to assume a 14 day (two week) lag based on:
The total lag is about 2 weeks.
I'd suggested that this could lead to as much as a 100-fold understatement of actual cases. Based on current data, that seems pessimistic: it's "only" about 47x greater than the published confirmed cases count -- a number that's moved around considerably, by the way, so don't put too much faith in that either. But it gives an indication.
We also get a doubling time of about 2.2 days, which means that however bad the situation is now, it's going to be twice as bad in a little over 48 hours. When you hear statements that the situation is "rapidly evolving" this is what is being referenced. Things are changing very quickly. Locations which may have low risk today may have a high risk in a day or two.
You should be finalising preparations and supplies runs about now, if not already.
Again: non-expert extrapolation based on early data, a simple model, and many uncertainties. I expect we'll likely see following trend, if not overshooting it, for a week or two, mostly as monitoring catches up to reality. I'm very much hoping we'll start to see a low-side numbers starting about two weeks out (18-22 March), as containment efforts begin to be effective. The caveat is that I don't see effective containment measures being enacted, certainly not on the scale that China performed starting ~22 January. In which case the projection here could well fit actual experience for longer.
As before, I'm posting this as a line in the sand of what my projection was. I hope and expect to be proved wrong on this within a couple of weeks. I'm dying to see how well this matches reality.
Dr. Messonier of the CDC mentioned 5 March in an NPR interview that there were numerous groups doing epidemic modelling to try to estimate the actual spread of SARS-CoV-2 within the US, though she pointedly refused to give any numbers herself. I have yet to find any published projections, but would be interested in seeing any.
Hardcoded in (edit to modify) are the initial and current case counts. You'll need to supply days between these measures as well. Data are taken from Wikipedia's 2020 Coronavirus Outbreak in the United States article.
The script calcuates the growth rate, with an arbitrary high and low bound (basically assuming one day more or less error in the reported range -- it's kind of weak sauce but gives some idea of sensitivity), the doubling time, the weekly growth rate, and the 14-day growth rate.
It then produces two reports, one every day for 29 days, the other every seven days for 200 days. Both cut off if the infected population exceeds total US population, given as 330.4 million. Shown are projected deaths, cases, cases at a low or high growth rate, and as "w/ 14 day lag" the possible ground truth of total cases from which confirmed cases are drawn. I'll note that this presently exceeds 10,000 cases, and ... doubles ever 2.2 days or so. A rate which will hit 1,000,000 by 18 March.
By April 25, if present rates continue, the entire US is infected. At the WHO's 3.4% fatality rate, 11.2 million die, and given economic modelling, your retirement fund is trash.
(And then the disease may return in the fall....)
For Rest-of-world, you can substitute in values for that outbreak for a simiilar model. (I've got a separate script for this.) As values are hardcoded, it's a tad inflexible.
## Program Output
Minor reformatting aside, this is output as currently stands.
COVID-19 US Outbreak Model
Assumptions:
- init cases (2020-4-26): 14
- cases (2020-3-5): 175
- deaths (2020-3-5): 11
- daily growth rate: 1.316
- doubling time (days): 2.195
- 7 day growth: 6.83x
- 14 day growth/mon. lag: 46.59x
day | date | deaths | cases | @ lo dbl | @ hi dbl | w/ 14d lag |
---|---|---|---|---|---|---|
1 | Mar 06, 2020 | 14 | 230 | 224 | 238 | 10,726 |
2 | Mar 07, 2020 | 19 | 302 | 287 | 324 | 14,113 |
3 | Mar 08, 2020 | 25 | 398 | 367 | 440 | 18,569 |
4 | Mar 09, 2020 | 32 | 524 | 470 | 600 | 24,431 |
5 | Mar 10, 2020 | 43 | 689 | 602 | 816 | 32,145 |
6 | Mar 11, 2020 | 57 | 907 | 771 | 1,111 | 42,294 |
7 | Mar 12, 2020 | 75 | 1,194 | 988 | 1,512 | 55,647 |
8 | Mar 13, 2020 | 98 | 1,571 | 1,266 | 2,057 | 73,216 |
9 | Mar 14, 2020 | 129 | 2,067 | 1,621 | 2,800 | 96,331 |
10 | Mar 15, 2020 | 171 | 2,720 | 2,076 | 3,811 | 126,744 |
11 | Mar 16, 2020 | 224 | 3,579 | 2,659 | 5,186 | 166,760 |
12 | Mar 17, 2020 | 296 | 4,709 | 3,405 | 7,057 | 219,409 |
13 | Mar 18, 2020 | 389 | 6,196 | 4,360 | 9,603 | 288,680 |
14 | Mar 19, 2020 | 512 | 8,152 | 5,584 | 13,068 | 379,821 |
15 | Mar 20, 2020 | 674 | 10,726 | 7,151 | 17,784 | 499,736 |
16 | Mar 21, 2020 | 887 | 14,113 | 9,159 | 24,201 | 657,511 |
17 | Mar 22, 2020 | 1,167 | 18,569 | 11,729 | 32,933 | 865,098 |
18 | Mar 23, 2020 | 1,535 | 24,431 | 15,021 | 44,816 | 1,138,224 |
19 | Mar 24, 2020 | 2,020 | 32,145 | 19,236 | 60,987 | 1,497,580 |
20 | Mar 25, 2020 | 2,658 | 42,294 | 24,635 | 82,992 | 1,970,390 |
21 | Mar 26, 2020 | 3,497 | 55,647 | 31,548 | 112,938 | 2,592,474 |
22 | Mar 27, 2020 | 4,602 | 73,216 | 40,402 | 153,688 | 3,410,959 |
23 | Mar 28, 2020 | 6,055 | 96,331 | 51,740 | 209,142 | 4,487,854 |
24 | Mar 29, 2020 | 7,966 | 126,744 | 66,261 | 284,604 | 5,904,742 |
25 | Mar 30, 2020 | 10,482 | 166,760 | 84,856 | 387,295 | 7,768,965 |
26 | Mar 31, 2020 | 13,791 | 219,409 | 108,670 | 527,038 | 10,221,752 |
27 | Apr 01, 2020 | 18,145 | 288,680 | 139,167 | 717,203 | 13,448,923 |
28 | Apr 02, 2020 | 23,874 | 379,821 | 178,222 | 975,983 | 17,694,965 |
29 | Apr 03, 2020 | 31,412 | 499,736 | 228,238 | 1,328,136 | 23,281,550 |
day | date | deaths | cases | @ lo dbl | @ hi dbl | w/ 14d lag |
---|---|---|---|---|---|---|
1 | Mar 06, 2020 | 14 | 230 | 224 | 238 | 10,726 |
8 | Mar 13, 2020 | 98 | 1,571 | 1,266 | 2,057 | 73,216 |
15 | Mar 20, 2020 | 674 | 10,726 | 7,151 | 17,784 | 499,736 |
22 | Mar 27, 2020 | 4,602 | 73,216 | 40,402 | 153,688 | 3,410,959 |
29 | Apr 03, 2020 | 31,412 | 499,736 | 228,238 | 1,328,136 | 23,281,550 |
36 | Apr 10, 2020 | 214,403 | 3,410,959 | 1,289,346 | 11,477,413 | 158,908,518 |
43 | Apr 17, 2020 | 1,463,411 | 23,281,550 | 7,283,681 | 99,184,812 | 1,084,632,112 |
50 | Apr 24, 2020 | 9,988,535 | 158,908,518 | 41,146,424 | 857,129,291 | 7,403,170,243 |
https://pastebin.com/raw/Sn2jrG5f
Please note any observed errors / corrections.
#coronavirus #covid-19 #covid19 #ncov2019 #epidemiology #epidemics #exponentialGrowth #IHopeIAmWrong #awk
TL;DR: sometimes changing record / field separators can be exceptionally useful.
I've been wrestling with document conversions, from PDF, of what's really a set of structured data.[1] The tools for actually getting text out of PDFs has ... improved markedly over the years. The Poppler library's tools in particular.
But you've still got to manage the output. And what I'm getting has semantic columns, spaces, indents, text, unicode, lions, tigers, bears... All structured within multi-paged documents.
Awk's default processing model is to read a line of input at a time, and break that into fields based on whitespace.
But ... you're not limited to this.
There are a set of arguments and internal variables which can change all of this, as well as some ... suprisingly useful functions. The gawk(1) manpage and Gnu Awk User's Guide are especially helpful
Most useful to me are the RS and FS variables, and the split(s, a \[, r \[, seps\]\])
function.
RS defines the record separator. By default, that's a newline, but if what you're working with is more senisibly thought of as a page of data, well, you can set it to the "\f", that is the form-feed character (hex xOC, octal 014).
FS defines the field separator, a space (" ") by default (hex x20, octal 040). Here, it's more sensible to think of each line as an individual record.
Simply by setting these two values, suddenly I'm reading a full page of text at a time, automatically splitting that into fields consisting of a single complete line each, and setting useful values such as NF, ("number of fields"), now "number of lines" on the page.
If you've ever found yourself wanting to scroll backwards and forwards through a record ... well, now you can.
The split()
function was the next realisation I had. In the LCSH file, "columns" are separated by, some testing confirms, two or more space characters. More or less.
(There are all matter of special cases in the data, but getting basic structure set ... helps a lot.)
The arguments to split(s, a \[, r \[, seps\]\])
are:
" \{2\}"
works for me here.The square braces mean that some of those arguments are optional -- if not supplied, 'r' is the default field separator, and the separators themselves are discarded.
So suddenly I've got the means to access a page of lines which I can split into columns and keep track of the gaps between them, as well as counts of pages, lines, columns, gaps, lengths of columns and gaps, and All That Other Jazz which makes figuring out the Pieces to the Puzzle possible.
Since the entire LCSH collection is about 760,000 entries, getting a script to do this is Much Easier (and faster, and more replicable) than Trying to Do This by Hand.
I suspect this isn't an especially well-hidden secret, but I'd been finding lots of nothing looking for ways of rescoping the text-extraction problem. Reframing the data as "lines in page" rather than "text on lines" makes the concepts, and opportunities, of working with the data vastly more tractable. Often the trick to solving a problem is one of framing it the right way, and that's exactly what I'm able to do here.
I suspect the notion could be expanded to the point of inhaling complete files in a single fell swoop, which is a frequently-applied method in Perl. I don't need to do that yet, but should I need to ... the option seems to exist: set RS to EOF (decimal / hex / octal 4 / x04 / 004). Which I may yet play with.
Notes:
#LCC LoCCS #LCSH #LoC #Libraries #Classifications #Ontologies #awk #gawk #TextExtraction