#genomics

waynerad@diasp.org

The complete genomes of 240 mammal species have been sequenced. Along with a "whole-genome alignment", where they line up the DNA sequences in one species with the equivalent DNA sequence in another species. The researchers identified 4,552 "conserved" regions. When talking about DNA, a region is considered either "fast-evolving" if it changes rapidly, or "conserved" if it stays the same over time. If a region is "conserved", there's probably a reason -- the sequence needs to be the way it is because it encodes some essential function. Knowing that a region is "conserved", however, doesn't tell you what the reason is. It's just a clue that there's something important going on. And just as a region can be "conserved" within a species over time, a region conserved across related species is probably conserved for a reason. But you don't necessarily know what the reason is.

Examining the conserved regions, the researchers found regions that relate to hibernation, olfaction, vocal learning, and brain size. Not only that, but these evolved multiple times -- sort of. For example, hibernation evolved in both bears and bats. But it is thought that the underlying "coding" genes are conserved across mammal species -- it's the non-coding DNA that controls what genes get switched "on" and "off" that is thought to have evolved independently. "Coding" means the DNA encodes proteins which are assembled by ribosomes after the information on the DNA is transcribed to messenger RNA (mRNA) and ferried over to the ribosomes. "Non-coding" DNA doesn't encode for proteins. It was thought to be "junk" before it was realized that often this "non-coding" DNA controls the expression of the "coding" DNA, through such mechanisms as transcription factors.

A highly detailed phylogenetic tree of mammal species was made that resolves disputes in previous phylogenetic trees. The effects of the Cretaceous-Paleogene (K-Pg) extinction, which wiped out the dinosaurs and enabled placental mammals to take over land and rapidly diversify, is measurable in the genomes.

In addition to the 4,552 nearly perfectly preserved regions, 10.7% of the human genome was considered "unusually conserved" across mammal species. 57.6% of coding bases fall into this category, but on the flip side, 80.7% of the bases in this category are non-coding. If that makes you scratch your head, review your Bayes' Theorem: A given B is different from B given A. Given a base known to be coding, there's a 57.6% chance it's in the "unusually conserved" category. Give a base known to be "unusually conserved", there's an 80.7% chance it's non-coding.

Approximately 439,461 candidates for "regulatory elements", regions of non-coding DNA which regulate the transcription of other genes, were identified for further study. In addition, 2,024,062 transcription factor binding sites were found. Transcription factor binding sites. Transcription factors are proteins that turn genes "on" and "off" by controlling their "transcription" from DNA to messenger RNA (mRNA) which goes to the ribosome to control the assembly of a protein.

In addition to finding regions that are the same compared with other mammals, the researchers looked for regions that are different -- the so-called human accelerated regions. Conserved within the human lineage, but different from other mammals, especially our closest relatives, the chimpanzees. These are thought to underlie human-specific traits. Some of these are thought to be neurodevelopmental genes.

There are also human-specific conserved deletions, where short (2-3 base pair) deletions took place. 10,032 of these human-specific conserved deletions were identified. Many of these are believed to affect neurons and nerve tissue and have affected our cognitive function.

For alignment, they created a new machine-learning tool that can align genes even if they underwent translocations (where a gene or part of a gene moves to a different chromosome) or inversions (where a portion of a DNA sequence gets copied in the reverse direction), and works for non-coding regions as well as the parts within coding regions that don't encode proteins but are "start" and "stop" indicators and such ("introns" as opposed to "exons").

They also looked at transposable elements, colloquially known as "jumping genes". These are DNA sequences that change their positions within a genome, sometimes with dramatic effects such as altering the cell's genetic identity or even its genome size. The lowest genomic percentage of transposable elements was found in the star-nosed mole (27.6%), and the largest percentage was seen in the aardvark (74.5%), whose higher transposable element count corresponds with larger genome size.

Finally, they looked at species that are nearing extinction and found the small population sizes decrease heterozygosity and increases the deleterious genetic load, making extinction even more likely. To determine deleterious genetic load, they looked at regions known to be vital in familiar species such as mice.

What makes a mammal? 423,000 newly identified DNA regions guide our genes

#discoveries #biology #genomics #mammals

tekaevl@diasp.org

wow

Wayne Radinsky - 2023-03-10 04:19:24 GMT

"Take the DNA Delorean: the promise of large language models in genomics."

I can't improve on just quoting from the article so I'm going to quote a few sentences for each of the major developments highlighted, which will still take a bunch of space. Click through to the full article for details and links to the specific technologies mentioned.

"Genomic instrument companies such as Oxford Nanopore Technologies, PacBio, Singular, and Ultima have publicly announced using graphics processing units inside their sequencing platforms for AI-based base calling. These models span CNN, RNN, and transformer-based AI models, including DeepConsensus in PacBio's instruments which uses gap-aware sequence transformers to correct errors and enable read accuracy."

"AI has helped accelerate variant calling, variant filtering, and base calling in genomic instruments and analysis, but what about in other areas that include predictions? Large language models (LLMs) are AI models built on transformer architecture, and their application to DNA, RNA, and proteins is a burgeoning field in genomics."

"Compared to the vocabulary of 20 amino acids and an average sequence length of 350 amino acids for proteins, genomic LLMs operate on a vocabulary of four nucleotides and very long sequences -- the haploid human genome is three billion nucleotide pairs."

"At this year's SuperComputing conference, we shared the Gordon Bell special award with more than two dozen academic and commercial researchers from Argonne National Laboratory, the University of Chicago, and others. The honored work was a genomic LLM that tracks the genetic mutations and predicts variants of concern in SARS-CoV-2, the virus behind COVID-19. With anywhere from 2.5 to 25 billion trainable parameters, the Genome-Scale language models (GenSLMs) represent some of the first and largest whole genome LLMs trained on over 100 million nucleotide sequences."

"In September of this year, Nature featured a deep generative model focusing on regulatory DNA and predictions of lowest and highest levels of expression in yeast."

"Enformer -- released in 2021 -- is a deep learning model with a transformer architecture for genomic enhancers that predicts gene expression from DNA sequences and can integrate information from long-range interactions in the genome. This model helps scientists understand how noncoding DNA makes decisions about gene expression in different cell types, such as in skin, liver, and heart cells, among others."

"scBERT -- released in September 2022 -- is another groundbreaking genomic LLM that understands gene-gene interactions and is trained on large corpora of unlabeled scRNA-Seq data."

"DNABERT -- released in 2021 -- is another genomic LLM that understands nucleotide sequences and can make downstream predictions of promoters, splice sites, and transcription factor binding sites."

Take the DNA Delorean: the promise of large language models in genomics

#solidstatelife #ai #nlp #llms #biology #genomics #proteomics

waynerad@pluspora.com

A plan to de-extinct the thylacine with the goal of re-introducing it into the wild. To be done by a company called Colossal in partnership with the Thylacine Integrated Genomic Restoration Research Lab (TIGRR), based at the University of Melbourne.

"Of all the species that humanity has wiped off the face of the Earth, the thylacine is possibly the most tragic loss. A wolf-sized marsupial sometimes called the Tasmanian tiger, the thylacine met its end in part because the government paid its citizens a bounty for every animal killed. That end came recently enough that we have photographs and film clips of the last thylacines ending their days in zoos. Late enough that in just a few decades, countries would start writing laws to prevent other species from seeing the same fate."

"As with Colossal's mammoth plans, TIGRR intends to obtain thylacine genomes, identify key differences between that genome and related lineages (mostly quolls), and then edit those differences into marsupial stem cells, which would then be used for IVF. It, too, faces some significant hurdles, in that nobody has made marsupial stem cells yet, nor has anyone cloned a marsupial -- two things that have at least been done in placental mammals (though not pachyderms)."

But the thylacine is a more tractable system than a mammoth because more museum samples because it survived until much more recently and a marsupial embryo gets to the point of birth with less nutritional demand and the rest of development takes place in the mother's pouch.

De-extinction company sets its next (first?) target: The thylacine

#genomics

waynerad@diasp.org

A plan to de-extinct the thylacine with the goal of re-introducing it into the wild. To be done by a company called Colossal in partnership with the Thylacine Integrated Genomic Restoration Research Lab (TIGRR), based at the University of Melbourne.

"Of all the species that humanity has wiped off the face of the Earth, the thylacine is possibly the most tragic loss. A wolf-sized marsupial sometimes called the Tasmanian tiger, the thylacine met its end in part because the government paid its citizens a bounty for every animal killed. That end came recently enough that we have photographs and film clips of the last thylacines ending their days in zoos. Late enough that in just a few decades, countries would start writing laws to prevent other species from seeing the same fate."

"As with Colossal's mammoth plans, TIGRR intends to obtain thylacine genomes, identify key differences between that genome and related lineages (mostly quolls), and then edit those differences into marsupial stem cells, which would then be used for IVF. It, too, faces some significant hurdles, in that nobody has made marsupial stem cells yet, nor has anyone cloned a marsupial -- two things that have at least been done in placental mammals (though not pachyderms)."

But the thylacine is a more tractable system than a mammoth because more museum samples because it survived until much more recently and a marsupial embryo gets to the point of birth with less nutritional demand and the rest of development takes place in the mother's pouch.

De-extinction company sets its next (first?) target: The thylacine

#genomics

waynerad@pluspora.com

"Comparative genomic study, the largest to date, includes genetic and phenotypic information of 57 species of mammals and identifies the greater stability of proteins as a common feature in the longest-living species".

So this research is all about convergent amino acid substitutions, and I had to do some work to understand what convergent amino acid substitutions are about. Basically, you can have a mutation in DNA, and that mutation, even if it is just a change to a single nucleotide, can change one of the amino acids in the protein the DNA encodes for. This change is what the word "substitution" refers to. The change can have no effect on the function of the protein, if the change is located somewhere that isn't related to the key parts that control the function of the protein. It also might have little effect, even if it is in one of the key parts, if the new amino acid is similar enough to the old one. It is also possible, however, for the change to have a very radical effect on the function of the protein.

The "convergent" part of that phrase has to do with the concept of convergent evolution. Convergent evolution is when two unrelated species evolve the same trait. For example, wings evolved in reptiles to form birds, and independently in mammals to form bats.

But the headline tells you this is about lifespan, so the question then becomes what all this has to do with longevity? Well, the way this research was structured was to find convergent amino acid substitutions in a variety of species, and compare those with the lifespan, and see which of those have longer or shorter lifespans than expected. This is different from previous genome-wide association studies (GWAS) that have focused on humans only.

They found 2,737 convergent amino acid substitutions in 2,004 genes where long-lived species have one amino acid and short-lived species have another. They further narrowed this down to 996 genes that they believe represent "true longevity signals" using statistical tests ("phylogenetic ANOVA test", whatever that is). The research paper has a lot of complex statistics that I didn't understand and can't summarize for you.

You may be wondering what the amino acid changes do? They think that they increase protein "stability", which is to say, proteins tend to "destabilize" with age and the convergent evolution of certain amino acid choices in long-lived species lead to those species having proteins that better resist this "destabilization".

They speculate what leads to this increased "stabilization" is "contacts in the hydrophobic core" and "a reduction in Van der Waals clashes".

The idea behind the "hydrophobic core" theory is that a protein will maintain a stable structure by constructing a shape such that there is a "center" created by a group of highly hydrophobic amino acids -- hydrophobic means they don't like water. Hydrophillic -- water-loving -- amino acids will be on the outside and will perform the function of the protein. Mutations that increase the stability of this "hydrophobic core" would, then, increase longevity, while mutations that decrease the stability of the "hydrophobic core" would decrease longevity.

Unfortunately I don't understand Van der Waals forces well enough to explain what "Van der Waals clashes" are. Van der Waals forces are forces that arise from quantum fluctuations in electron shells that result in fluctuating positive or negative electric forces that result in attraction or repulsion between parts of molecules that are very close to each other. They play a central role in organic chemistry, so I should probably learn about them.

The evolution of mammals reveals 2,000 new genes key to longevity in humans

#discoveries #evolution #proteome #genomics #gwas #longevity

waynerad@pluspora.com

"An analysis of data from 1.5 million people has identified 579 locations in the genome associated with a predisposition to different behaviors and disorders related to self-regulation, including addiction and child behavioral problems."

"With these findings, researchers have constructed a genetic risk score -- a number reflecting a person's overall genetic propensity based on how many risk variants they carry -- that predicts a range of behavioral, medical and social outcomes, including education levels, obesity, opioid use disorder, suicide, HIV infections, criminal convictions and unemployment."

"Genes don't code for a particular disorder or outcome; there are no genes 'for' substance use disorder, or 'for' behavior problems. Instead, genes influence the way our brains are wired, which can make us more at risk for certain outcomes. In this case, we find that there are genes that broadly influence self-control or impulsivity, and that this predisposition then confers risk for a variety of life outcomes."

Study identifies 579 genetic locations linked to anti-social behavior, alcohol use, opioid addiction and more

#discoveries #psychology #genomics