#proteomics

tekaevl@diasp.org

wow

Wayne Radinsky - 2023-03-10 04:19:24 GMT

"Take the DNA Delorean: the promise of large language models in genomics."

I can't improve on just quoting from the article so I'm going to quote a few sentences for each of the major developments highlighted, which will still take a bunch of space. Click through to the full article for details and links to the specific technologies mentioned.

"Genomic instrument companies such as Oxford Nanopore Technologies, PacBio, Singular, and Ultima have publicly announced using graphics processing units inside their sequencing platforms for AI-based base calling. These models span CNN, RNN, and transformer-based AI models, including DeepConsensus in PacBio's instruments which uses gap-aware sequence transformers to correct errors and enable read accuracy."

"AI has helped accelerate variant calling, variant filtering, and base calling in genomic instruments and analysis, but what about in other areas that include predictions? Large language models (LLMs) are AI models built on transformer architecture, and their application to DNA, RNA, and proteins is a burgeoning field in genomics."

"Compared to the vocabulary of 20 amino acids and an average sequence length of 350 amino acids for proteins, genomic LLMs operate on a vocabulary of four nucleotides and very long sequences -- the haploid human genome is three billion nucleotide pairs."

"At this year's SuperComputing conference, we shared the Gordon Bell special award with more than two dozen academic and commercial researchers from Argonne National Laboratory, the University of Chicago, and others. The honored work was a genomic LLM that tracks the genetic mutations and predicts variants of concern in SARS-CoV-2, the virus behind COVID-19. With anywhere from 2.5 to 25 billion trainable parameters, the Genome-Scale language models (GenSLMs) represent some of the first and largest whole genome LLMs trained on over 100 million nucleotide sequences."

"In September of this year, Nature featured a deep generative model focusing on regulatory DNA and predictions of lowest and highest levels of expression in yeast."

"Enformer -- released in 2021 -- is a deep learning model with a transformer architecture for genomic enhancers that predicts gene expression from DNA sequences and can integrate information from long-range interactions in the genome. This model helps scientists understand how noncoding DNA makes decisions about gene expression in different cell types, such as in skin, liver, and heart cells, among others."

"scBERT -- released in September 2022 -- is another groundbreaking genomic LLM that understands gene-gene interactions and is trained on large corpora of unlabeled scRNA-Seq data."

"DNABERT -- released in 2021 -- is another genomic LLM that understands nucleotide sequences and can make downstream predictions of promoters, splice sites, and transcription factor binding sites."

Take the DNA Delorean: the promise of large language models in genomics

#solidstatelife #ai #nlp #llms #biology #genomics #proteomics

waynerad@pluspora.com

"For more than a decade, molecular biologist Martin Beck and his colleagues have been trying to piece together one of the world's hardest jigsaw puzzles: a detailed model of the largest molecular machine in human cells.

"This behemoth, called the nuclear pore complex, controls the flow of molecules in and out of the nucleus of the cell, where the genome sits. Hundreds of these complexes exist in every cell. Each is made up of more than 1,000 proteins that together form rings around a hole through the nuclear membrane."

"These 1,000 puzzle pieces are drawn from more than 30 protein building blocks that interlace in myriad ways. Making the puzzle even harder, the experimentally determined 3D shapes of these building blocks are a potpourri of structures gathered from many species, so don't always mesh together well. And the picture on the puzzle's box -- a low-resolution 3D view of the nuclear pore complex -- lacks sufficient detail to know how many of the pieces precisely fit together."

"Then, last July, London-based firm DeepMind, part of Alphabet -- Google's parent company -- made public an artificial intelligence (AI) tool called AlphaFold2."

"This is like an earthquake. You can see it everywhere. There is before July and after."

"This year, DeepMind plans to release a total of more than 100 million structure predictions. That is nearly half of all known proteins -- and hundreds of times more than the number of experimentally determined proteins in the Protein Data Bank (PDB) structure repository."

What's next for AlphaFold and the AI protein-folding revolution

#solidstatelife #ai #biology #proteomics #proteinfolding #deepmind #alphafold