#databasetheory

waynerad@diasp.org

I never heard of the British Post Office scandal (also known as the Horizon IT scandal) until @balduin@diasp.org mentioned it on here. Even though it's been going on for a long time -- the first trials were in 1999. Some programmers made some buggy software, which resulted in 980 people getting criminally prosecuted, 236 people going to prison, an unspecified number of bankruptcies, and 4 suicides.

In trying to find out what happened, most of the news relates to the trials, but I kept wondering, what were the software bugs? Eventually, I came across these videos, from Computerphile, this time featuring computer science professor Steven Murdoch, and Continuous Delivery, aka Dave Farley's channel, which give a cursory outline of what the software bugs might have been.

Basically, Fujitsu made an accounting system called Horizon, and it was a large and complex distributed system, with software designed to be able to perform operations at post offices anywhere without being online, and would be able to synchronize and reconcile everything with only intermittent connections to the main system.

However, they failed to properly maintain what's known as "ACID"-compliance. "ACID" is an acronym from database theory that stands for "atomicity, consistency, isolation, and durability." "Atomicity" means a transaction either goes through entirely or not at all -- nothing in between. All the changes to all the relevant accounts have to happen "atomically" or not at all. "Consistency" means if you have data spread across multiple computers, any computer you ask must give the same answers to the same questions -- you can have one machine saying the bank balance is one thing while another says it's something else -- who is right? Consistency is hard to maintain in distributed systems. "Isolation" means transactions must not inadvertently interfere with other transactions -- no operations can interfere with any other operations. They must all be "isolated" from each other. "Durability" means when you do a transaction, the next day it hasn't disappeared -- once committed, every update to the database is "durable".

So in what way were these principles violated? One example that seems to be mentioned often is if a user in a post office pushed a button, and the system seemed to not respond, and they pushed the button again, a transaction would be recorded multiple times in the central system but only once on their system. If you're familiar with the word "idempotent", the operation was idempotent in one place but not the other. (Idempotent means doing an operation multiple times produces the same result as doing it once -- an important principle in implementing reliable distributed systems.) So they could push a button for $8,000 -- er, this is the UK so it would be £8,000 -- four times, and the central office would think £8,000 was deposited 4 times, but the local post office system would show the £8,000 deposited only once. Corporate would call them up and demand the missing £24,000. But of course, they don't have it, and can't pay it. So they get criminally charged.

Regarding the criminal prosecutions and such -- I wish I could express surprise about that, but I can't, because, for me, it's become a familiar story. I learned about this from reading books about mistakes (that I learned about following the Boeing crashes and related to the Chernobyl accident). The way "mistakes" play out in human social hierarchies is: blame is assigned to whoever is most "proximal" to the accident, and blame goes down social status hierarchies to whoever is on the bottom.

For example in the Chernobyl accident, people knew about the flaws in the reactor design, but were unable to get that information to the plant operators, because that information could not propagate through the bureaucracy (social status hierarchy) of the Soviet Union. When the accident happened, the designers of the reactor were not blamed, nor were any of the people high up in the social status hierarchy who failed to propagate the relevant information blamed. Who was blamed? The plant operators -- they were both most proximal to the accident itself, and on the bottom of the social status hierarchy and unable to pass blame on to anybody else. Most of the plant operators died of cancer before they got their prison sentences. But it's worth noting that only the plant operators ever got prison sentences or any other punishment. The plant operators were actually not at fault -- at all -- because they weren't responsible for the flaws in the reactor design, and furthermore, given the flaws in the reactor design, they were not provided the proper information about what the flaws were and how to properly mitigate them.

In the case of the British Post Office scandal, all of the people who went to prison, went bankrupt, or committed suicide were subpostmasters. "Subpostmasters" is a term I have never encountered before -- apparently is a British term for what we here in the US would call a "branch manager". As far as I've been able to tell, nobody in Post Office Limited management or Fujitsu got convicted, got prison sentences, went bankrupt, or committed suicide. Evidently they fought very hard, both in the legal system, and in terms of PR, to protect themselves. The programmers who wrote the code also have not gotten criminal convictions, prison sentences, gone bankrupt, or committed suicide.

#solidstatelife #relationalmodel #databasetheory

https://www.youtube.com/watch?v=hBJm9ZYqL10