wow

Wayne Radinsky - 2023-03-10 04:19:24 GMT

"Take the DNA Delorean: the promise of large language models in genomics."

I can't improve on just quoting from the article so I'm going to quote a few sentences for each of the major developments highlighted, which will still take a bunch of space. Click through to the full article for details and links to the specific technologies mentioned.

"Genomic instrument companies such as Oxford Nanopore Technologies, PacBio, Singular, and Ultima have publicly announced using graphics processing units inside their sequencing platforms for AI-based base calling. These models span CNN, RNN, and transformer-based AI models, including DeepConsensus in PacBio's instruments which uses gap-aware sequence transformers to correct errors and enable read accuracy."

"AI has helped accelerate variant calling, variant filtering, and base calling in genomic instruments and analysis, but what about in other areas that include predictions? Large language models (LLMs) are AI models built on transformer architecture, and their application to DNA, RNA, and proteins is a burgeoning field in genomics."

"Compared to the vocabulary of 20 amino acids and an average sequence length of 350 amino acids for proteins, genomic LLMs operate on a vocabulary of four nucleotides and very long sequences -- the haploid human genome is three billion nucleotide pairs."

"At this year's SuperComputing conference, we shared the Gordon Bell special award with more than two dozen academic and commercial researchers from Argonne National Laboratory, the University of Chicago, and others. The honored work was a genomic LLM that tracks the genetic mutations and predicts variants of concern in SARS-CoV-2, the virus behind COVID-19. With anywhere from 2.5 to 25 billion trainable parameters, the Genome-Scale language models (GenSLMs) represent some of the first and largest whole genome LLMs trained on over 100 million nucleotide sequences."

"In September of this year, Nature featured a deep generative model focusing on regulatory DNA and predictions of lowest and highest levels of expression in yeast."

"Enformer -- released in 2021 -- is a deep learning model with a transformer architecture for genomic enhancers that predicts gene expression from DNA sequences and can integrate information from long-range interactions in the genome. This model helps scientists understand how noncoding DNA makes decisions about gene expression in different cell types, such as in skin, liver, and heart cells, among others."

"scBERT -- released in September 2022 -- is another groundbreaking genomic LLM that understands gene-gene interactions and is trained on large corpora of unlabeled scRNA-Seq data."

"DNABERT -- released in 2021 -- is another genomic LLM that understands nucleotide sequences and can make downstream predictions of promoters, splice sites, and transcription factor binding sites."

Take the DNA Delorean: the promise of large language models in genomics

#solidstatelife #ai #nlp #llms #biology #genomics #proteomics

There are no comments yet.