"Even LLMs need education -- quality data makes LLMs overperform."

In other words, textbooks are all you need?

The idea is that instead of making a huge language model, you zero in on the best possible training data -- which for a large language model means textbooks, or "textbook-like data" -- and even create your own, called "synthetic data".

These researchers developed "a data set of toddler-level stories called TinyStories that could be used to create models of less than ten million parameters that still produced comprehensible outputs. They trained a whole LLM from the ground up in a single day only using a single GPU -- probably less that $100 worth of compute time. The stories it produced were grammatically correct, maintained consistency, and showed reasoning."

"If you were to ask someone to learn how to build a rocket ship just by searching the internet, you'd likely not have great results. Sure, there may be some good resources and communities that ahem get you off the ground. But there's also a lot of cruft out there -- anyone can put something on the internet and there's nobody to vet it."

"If you instead gave someone a textbook on rocketry, they'd at least know how to start, what the concepts are, and how to move towards an answer."

Even LLMs need education -- quality data makes LLMs overperform

#solidstatelife #ai #genai #llms