OpenAI's latest state-of-the-art models for dense text embeddings are vastly huger and more expensive than previous models but no better and sometimes worse, according to Nils Reimers, an AI researcher at Hugging Face. First I should say a bit about what "dense embeddings" are. First "embeddings" are vectors that capture something of the semantic meaning of words, such that vectors close together represent words with similar meanings and relationships between vectors correlate with relationships between words. Don't worry if calling this an "embedding" makes no sense. Ok, what about the 'dense' part. Well, embeddings can be "sparse" or "dense", where "sparse" means you have thousands of dimensions but most are 0, and "dense" means you have fewer dimensions (say, 400), but most elements are non-zero. Most of the embeddings that you're familiar with are the dense kind: Word2Vec, Fasttext, GloVe, etc.
In his summary he says, "The OpenAI text similarity models perform poorly and much worse than the state of the art."
"The text search models perform quite well, giving good results on several benchmarks. But they are not quite state-of-the-art compared to recent, freely available models."
"The embedding models are slow and expensive: Encoding 10 million documents with the smallest OpenAI model will cost about $80,000. In comparison, using an equally strong open model and running it on cloud will cost as little as $1. Also, operating costs are tremendous: Using the OpenAI models for an application with 1 million monthly queries costs up to $9,000 / month. Open models, which perform better at much lower latencies, cost just $300 / month for the same use-case."
"They generate extremely high-dimensional embeddings, significantly slowing down downstream applications while requiring much more memory."
Usually newer is better and bigger is better, but not always.
OpenAI GPT-3 Text Embeddings -- Really a new state-of-the-art in dense text embeddings?
There are no comments yet.