#informationtheory

waynerad@diasp.org

Using compression algorithms to do text classification, competitive with deep neural networks. Neural networks have proven so effective at so many things, sometimes it's interesting to see there are non-neural-network solutions that can compete with them or even outcompete them.

The theory here is that text compression algorithms work by using information theory to reduce redundancy within a text. When combining texts, it can be used to approximate the information distance between two texts. It should be noted that it can only be approximated because no compression algorithm can be proven to be the maximum compressor.

To make an actual classification system, they combined the compression algorithm with something called kNN. "kNN" stands for k-nearest-neighbors. The idea is you pick some number "k" which is the number of clusters that you want. The algorithm then figures out for itself how to group the items to be classified into k clusters. To do this, it needs a "distance" measure.

Here they tried bz2, lzma, zstd, and gzip, and found gzip did the best.

They then compared gzip with the following neural network-based text classification systems: TFIDF+LR, LSTM, Bi-LSTM+Attn, HAN, charCNN, textCNN, RCNN, VDCNN, fastText, BERT, W2V, SentBERT, and TextLength. They tested them on the following datasets: AGNews (academic news), DBpedia (extracted from Wikipedia), YahooAnswers (from Yahoo obviously), 20News (an old news dataset from 1995), Ohsumed (news from medical journals between 1987 and 1991), R8 and R52 (two datasets of news from Reuters), KirundiNews and KinyarwandaNews (two datasets of news in low-resource African languages), SwahiliNews (news in Swahili, a language from east Africa), DengueFilipino (news in Filipino), and SogouNews (news in Chinese, written in pinyin).

The only neural network that consistently outperformed their gzip+kNN system was BERT.

Oh, but there's a catch. BERT only beat the gzip+kNN system when classifying data similar to what it was trained on. When classifying text significantly different from what it was trained on, "out-of-distribution" data in the parlance of statisticians, BERT actually did worse, and the gzip+kNN system beat everything. In addition, the gzip+kNN system requires less computing power.

"Low-resource" text classification: A parameter-free classification method with compressors

#solidstatelife #ai #informationtheory #compression #gzip #knn

dredmorbius@joindiaspora.com

Fairness Reconsidered: Receiving Public as a Commons

The conceit of the Fairness Doctrine was that broadcast spectrum was a commons, and a limited public resource, arbitrarily allocated to a given (usually private) party. The right came with the obligation to manage this common resource in the public interest. The doctrine went through a few iterations before arriving at the "Fairness Doctrine" formula in 1949, notably the Mayflower Decision (1941). There is similar history, though often arriving at different policies, elsewhere, notably the heavy reliance on government-owned or -controlled broadcasting through much of what was otherwise free Europe: the BBC, Germany, France, etc., much of that strongly informed by the rise of fascism and Nazi German in the 1920s and 1930s. (The US had its own fascist / populist demagogues, notably Father Charles Edward Coughlin and Joseph McCarthy.)

This past week's On the Media podcast has a good introduction to the Fairness Doctrine, in the context of Fox News and why the F.D. itself is inadequate to address Fox. (Hint: Cable subscribers.)

The past 5, 10, 20 years or whatever timeframe you care to throw at it, of experience in the online world suggest that treating digital media over (mostly) private infrastructure as strictly private ... has some pronounced failure modes, to use technical understatement.

I haven't seen others making this argument yet, though I suspect some are, but my view is, roughly, that public mindshare is itself a commons, and should be held and managed in the public interest. There's a point at which reach or penetration themselves become exploitation of a public resource, and concern over the impacts of such reach are legitimate public concerns.

If you look at the fundamentals of information theory, there are three (or four) major components:

Sender -> Channel -> Receiver

You could also add noise, encoding, and decoding.

The Fairness Doctrine concerned channel.

Both free-speech and classic censorship matters, concerns sender (and to at least some extent, channel).

The new doctrine I'm suggesting covers the receiver, and specifically the general public as a general message recipient.

One could argue that disinformation, fake news, propaganda, and distraction are forms of intentionally introduced noise, and I'm sure there are elements concerning encoding and decoding which might be similarly considered.

Again, I'm not aware of anyone else offering a similar view, but it seems to me that our traditional models of speech, publishing, broadcasting, censorship, and responsibility are failing us here.

#FairnessDoctrine #FCC #Broadcasting #DigitalMedia #Media #OnTheMedia #Commons #Audience #InformationTheory