#gzip

waynerad@diasp.org

Using compression algorithms to do text classification, competitive with deep neural networks. Neural networks have proven so effective at so many things, sometimes it's interesting to see there are non-neural-network solutions that can compete with them or even outcompete them.

The theory here is that text compression algorithms work by using information theory to reduce redundancy within a text. When combining texts, it can be used to approximate the information distance between two texts. It should be noted that it can only be approximated because no compression algorithm can be proven to be the maximum compressor.

To make an actual classification system, they combined the compression algorithm with something called kNN. "kNN" stands for k-nearest-neighbors. The idea is you pick some number "k" which is the number of clusters that you want. The algorithm then figures out for itself how to group the items to be classified into k clusters. To do this, it needs a "distance" measure.

Here they tried bz2, lzma, zstd, and gzip, and found gzip did the best.

They then compared gzip with the following neural network-based text classification systems: TFIDF+LR, LSTM, Bi-LSTM+Attn, HAN, charCNN, textCNN, RCNN, VDCNN, fastText, BERT, W2V, SentBERT, and TextLength. They tested them on the following datasets: AGNews (academic news), DBpedia (extracted from Wikipedia), YahooAnswers (from Yahoo obviously), 20News (an old news dataset from 1995), Ohsumed (news from medical journals between 1987 and 1991), R8 and R52 (two datasets of news from Reuters), KirundiNews and KinyarwandaNews (two datasets of news in low-resource African languages), SwahiliNews (news in Swahili, a language from east Africa), DengueFilipino (news in Filipino), and SogouNews (news in Chinese, written in pinyin).

The only neural network that consistently outperformed their gzip+kNN system was BERT.

Oh, but there's a catch. BERT only beat the gzip+kNN system when classifying data similar to what it was trained on. When classifying text significantly different from what it was trained on, "out-of-distribution" data in the parlance of statisticians, BERT actually did worse, and the gzip+kNN system beat everything. In addition, the gzip+kNN system requires less computing power.

"Low-resource" text classification: A parameter-free classification method with compressors

#solidstatelife #ai #informationtheory #compression #gzip #knn