"OLMo: Open Language Model."

On benchmarks, it doesn't quite match LLaMA (for the 7 billion parameter model) or StableLM (for the 1 billion parameter model), but what it has to offer is total openness: they're releasing all the training data and everything used to construct the model and even "checkpoints" of the model as it was under contraction.

Each model comes with the following:

"Full training data used for these models, including code that produces the training data, from AI2's Dolma, and WIMBD for analyzing pretraining data."

"Full model weights, training code, training logs, training metrics in the form of Weights & Biases logs, and inference code."

"500+ checkpoints per model, from every 1000 steps during the training process, available as revisions on HuggingFace."

"Evaluation code under the umbrella of AI2's Catwalk and Paloma."

"Fine-tuning code and adapted models (coming soon with Open Instruct)"

"All code, weights, and intermediate checkpoints are released under the Apache 2.0 License."

"Dolma" in turn consists of Common Crawl (web pages), The Stack (code), C4 (web pages), Reddit, peS2o (STEM papers), Project Gutenberg (books), and Wikipedia and Wikibooks.

"Dolma is built using a pipeline of (1) language filtering, (2) quality filtering, (3) content filtering, (4) deduplication, (5) multi-source mixing, and (6) tokenization."

Training was done on clusters of both Nvidia and AMD GPUs.

This comes from the Allen Institute for Artificial Intelligence (remember Paul Allen?)

OLMo: Open Language Model

#solidstatelife #ai #genai #llms

1

There are no comments yet.