Dnext

December 21, 2023 6:24pm

#podmin #server #diaspora #question #compression #gzip #brotli #apache2

Can I turn on brotli compression for diaspora? I have all the blocks in my webserver config, but it still only uses gzip content encoding. Any tips/tricks?

Wayne Radinsky

August 4, 2023 3:00am

Using compression algorithms to do text classification, competitive with deep neural networks. Neural networks have proven so effective at so many things, sometimes it's interesting to see there are non-neural-network solutions that can compete with them or even outcompete them.

The theory here is that text compression algorithms work by using information theory to reduce redundancy within a text. When combining texts, it can be used to approximate the information distance between two texts. It should be noted that it can only be approximated because no compression algorithm can be proven to be the maximum compressor.

To make an actual classification system, they combined the compression algorithm with something called kNN. "kNN" stands for k-nearest-neighbors. The idea is you pick some number "k" which is the number of clusters that you want. The algorithm then figures out for itself how to group the items to be classified into k clusters. To do this, it needs a "distance" measure.

Here they tried bz2, lzma, zstd, and gzip, and found gzip did the best.

They then compared gzip with the following neural network-based text classification systems: TFIDF+LR, LSTM, Bi-LSTM+Attn, HAN, charCNN, textCNN, RCNN, VDCNN, fastText, BERT, W2V, SentBERT, and TextLength. They tested them on the following datasets: AGNews (academic news), DBpedia (extracted from Wikipedia), YahooAnswers (from Yahoo obviously), 20News (an old news dataset from 1995), Ohsumed (news from medical journals between 1987 and 1991), R8 and R52 (two datasets of news from Reuters), KirundiNews and KinyarwandaNews (two datasets of news in low-resource African languages), SwahiliNews (news in Swahili, a language from east Africa), DengueFilipino (news in Filipino), and SogouNews (news in Chinese, written in pinyin).

The only neural network that consistently outperformed their gzip+kNN system was BERT.

Oh, but there's a catch. BERT only beat the gzip+kNN system when classifying data similar to what it was trained on. When classifying text significantly different from what it was trained on, "out-of-distribution" data in the parlance of statisticians, BERT actually did worse, and the gzip+kNN system beat everything. In addition, the gzip+kNN system requires less computing power.

"Low-resource" text classification: A parameter-free classification method with compressors

#solidstatelife #ai #informationtheory #compression #gzip #knn

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, Jimmy Lin. Findings of the Association for Computational Linguistics: ACL 2023. 2023.

Dr. Roy Schestowitz

February 9, 2022 6:34pm

#gnu #guix #gzip only https://guix.gnu.org/blog/2022/sunsetting-gzip-substitutes-availability/

Dr. Roy Schestowitz

September 4, 2021 3:36pm

#gzip has a new version. "A compression utility designed to replace 'compress'. Much better compression and freedom from patented algorithms are its main advantages over compress. Gzip decompresses files created by gzip, compress, or pack; it detects the input format automatically." https://directory.fsf.org/wiki/Gzip

0 Persons are tagged with #gzip

#gzip