PvQ LLM Leaderboard.

"Recently, we've been building a small application called PvQ, a question and answer site driven by open weight large-language-models (LLMs). We started with ~100k questions from the StackOverflow dataset, and had an initial set of 7 open weight LLMs to produce an answer using a simple zero shot prompt. We needed a way to see the site with useful rankings to help push the better answers two the top without us manually reviewing each answer. While it is far from an perfect approach, we decided to use the Mixtral model from Mistral.AI, to review the answers together, and vote on the quality in regards to the original question."

"Over a few weeks we generated ~700k answers for the following models:"

"Mistral 7B Instruct"
"Gemma 7B Instruct"
"Gemma 2B Instruct"
"Deepseek-Coder 6.7B"
"Codellama"
"Phi 2.0"
"Qwen 1.5 4b"

But if you look at the leaderboard today, you'll see they've got non-open models on it now like GPT-4 Turbo, GPT-4o-mini, Claude 3.5 Sonnet, Gemini Pro 1.0, and so on.

WizardLM from Microsoft, which I never heard of before, did unexpectedly well.

#solidstatelife

https://pvq.app/leaderboard

2