#stabilityai

waynerad@diasp.org

StableLM dropped yesterday. Er, day before yesterday. Or maybe the day before that. Bah, I'm going to have to get more powerful hardware. I couldn't run Stable Diffusion because my GPU wasn't powerful enough, and I probably can't run this, either.

Anyway, this is a language model made by the same people who made Stable Diffusion. More precisely, this is the first of a suite of large language models from Stability AI. This release is actually two models, one with 3 billion and one with 7 billion parameters. 15 billion and 30 billion models are promised.

They're released under a license called CC BY-SA-4.0. That means Creative Commons ShareAlike 4.0. Under that license, you can share the model and adapt the model, but have to give attribution to Stability AI, and if you modify the model, you have to put the same license on your model, requiring whoever else uses it to also give attribution to Stability AI.

Because it's an open model, I can tell you what it was trained on. It was trained on a dataset called "The Pile". "The Pile" consists of the following subsets:

Pile-CC - 227.12 GiB
PubMed Central - 90.27
Books3 - 100.96
OpenWebText2 - 62.77
ArXiv - 56.21
Github - 95.16
FreeLaw - 51.15
Stack Exchange - 32.2
USPTO Backgrounds - 22.9
PubMed Abstracts - 19.26
Gutenberg (PG-19) - 10.88
OpenSubtitles - 12.98
Wikipedia (en) - 6.38
DM Mathematics - 7.75
Ubuntu IRC - 5.52
BookCorpus2 - 6.3
EuroParl - 4.59
HackerNews - 3.9
YoutubeSubtitles - 3.73
PhilPapers - 2.38
NIH ExPorter - 1.89
Enron Emails - 0.88

Pile-CC, CC for "Common Crawl", is a collection of website crawls from 2008 onwards. PubMed Central is a dataset from the the National Center for Biotechnology Information (NCBI) and has biomedical research. Books3 is a dataset of books, with a mix of fiction and nonfiction. OpenWebText2 is a web scrape that uses upvotes on Reddit submissions as a proxy for outgoing link quality, has content up to 2020, and content in multiple languages. ArXiv you probably know because I bring you all so much stuff from there. It's a preprint server for research papers in math, computer science, and physics. GitHub is the open-source code repository website, which you all probably also know because I talk about it all the time, assuming you don't use it yourself. The Free Law Project has millions of legal opinions from federal and state courts and academic studies that analyze legal decisions. Stack Exchange is a network of websites centered around user-contributed questions and answers (including Stack Overflow, the famous question-and-answer site for coding, which was the first). USPTO Backgrounds is a dataset of background sections from patents granted by the United States Patent and Trademark Office. Wikipedia is the online encyclopedia you all know, chosen unsurprisingly because of its well-written expository prose and how it spans many domains. PubMed Abstracts has abstracts of 30 million PubMed research papers which are not part of PubMed Central mentioned above. Project Gutenberg is a dataset of classic Western literature, and the PG-19 dataset specifically consists of Project Gutenberg books from before 1919. OpenSubtitles is a dataset of English language subtitles from movies and television shows. DM Mathematics refers to the DeepMind Mathematics dataset, which consists of a collection of mathematical problems from topics such as algebra, arithmetic, calculus, number theory, and probability, formatted as natural language prompts. BookCorpus2 is a dataset of books written by "as of yet unpublished authors." Ubuntu IRC is a dataset of publicly available chat logs of all Ubuntu-related channels on the Freenode IRC chat server. EuroParl is proceedings of the European Parliament in 21 European languages from 1996 until 2012, and is considered valuable because it's a multilingual "parallel corpus" -- a corpus that has the same text in multiple languages. YouTube Subtitles is just what the name suggests: a dataset gathered from human generated (auto-generated is excluded) closed captions on YouTube. PhilPapers is a dataset of philosophy publications from the Center for Digital Philosophy at the University of Western Ontario. The NIH ExPorter dataset dataset has grant abstracts for awarded NIH grant applications from the Ex-PORTER service from 1985 to the present. Hacker News you all probably know because I send stuff from there your way. It's a news content aggregator run by Y Combinator, a startup accelerator in Silicon Valley, and articles there tend to focus on computer science and entrepreneurship. This news announcement (Stability LM) is probably there right now. There's one more, the Enron Emails dataset, which is a weird one, but apparently it was included because there generally aren't any publicly-available email datasets, but somehow Enron's emails became public in the company's demise, so it was included so the language model can learn how people talk in emails.

Brrrrp! "StableLM is trained on a new experimental dataset built on The Pile, but three times larger with 1.5 trillion tokens of content. We will release details on the dataset in due course."

So what is described above is less than what StableLM is actually trained on. If I were to guess, I'd guess all the times you see "up to" 2020 or somesuch, they caught up with data up to the current moment.

Stability AI Launches the First of its StableLM Suite of Language Models

#solidstatelife #ai #generativemodels #nlp #lmms #stabilityai #stablelm