The Next Frontier: Large Language Models In Biology

Community Article Published October 12, 2025

fig0-cover

Introduction

At its most fundamental level, I think biology can be thought of as an information processing system, albeit an extraordinarily complex and dynamic one. Just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI. - Demis Hassabis

Everyone should cultivate an interest in biology. While the laws of physics and chemistry govern the universe, the principles of biology govern you. We aren't just subject to biology; we are the system itself. When scientists discuss the gut microbiome, they're talking about the trillion organisms in your very own digestive tract. You see, everyone on this planet is a biological stakeholder. To ignore biology is to ignore your own owner's manual. You don't have to be a scientist to care about biology; you just have to be alive. If these words didn't quite sway you, Abhishaike Mahajan offers a more dramatic and intense version of this argument in Ask not why would you work in biology, but rather: why wouldn't you?.

Following this logic, the application of language models in biology presents tremendous opportunities for understanding the complicated web of biological systems, and thus, you should get interested in biological language models (for DNA, RNA, protein, and other biological modalities). These models could be the essential bridge to fascinating breakthroughs in many research areas, such as personalized medicine and novel drug design.

This is a long article, but it was written in a self-contained, semi-conversational style, focusing on the intuition behind these models with minimal information dump.

The Human Genome Project

If there is anything worth doing twice, it's the human genome. - David Haussler

In 2003, the Human Genome Project reached its conclusion, culminating an inspiring journey of unprecedented scale realized through thirteen years of intense, international collaboration spanning twenty organizations across six countries. Despite the final figure standing at 92% of the genome successfully mapped, this represented a monumental achievement, considering the technological limitations of sequencing at the time. With that published draft, humanity gained direct access to the book of life, the fundamental recipe dictating how every single cell in our bodies is built and operates, three billion base pairs of wonder! So, what comes next?

fig1

Biology was totally transformed. It stopped being a science of looking at just one gene at a time and became a data-driven discipline focused on huge, complex systems. The tools to read DNA grew cheaper and faster every year, so quickly that the original Genome Project now feels like a walk in the park. We shot past just genomes and started mapping everything: all the RNA, all the proteins, and even the neural connections in the brain. Welcome to the omics era! Our problem now is that our understanding of biology simply can't keep up with the sheer amount of data being generated. We are overwhelmed by the vast amount of biological data and are struggling to make sense of it. There wasn’t enough pressure for evolution to optimize our brains for omics comprehension. But what about other (non-human) brains? Like dolphins, for example. Kidding! You already know what we’re talking about.

The Language of Life

Don't worry, we're keeping this section brief! If any terms are new, a quick search on Wikipedia or in your favorite Biology textbook will clear things right up.

Biological Molecules

Life relies on three fundamental classes of biomolecules: DNA, RNA, and proteins. These are all polymer molecules, meaning they are constructed as long chains of repeating structural units. DNA and RNA are built from nucleotides, whereas proteins are assembled from amino acids. Each nucleotide unit consists of three primary components: a phosphate group, a sugar (ribose in RNA or deoxyribose in DNA), and nitrogen-containing nucleobase. The genetic alphabet in DNA is composed of these four nucleobases: cytosine [C], guanine [G], adenine [A], and thymine [T]. In RNA, uracil [U] is utilized as a substitute for thymine.

fig2

Proteins are the functional machinery of the cell, as they are directly involved in virtually every process required for life. They act as catalytic enzymes, provide essential structural support, facilitate molecular transport, and regulate signaling pathways. This versatility is due to their complex three-dimensional structure which dictates their specific job (keep that in mind). Proteins are the direct result of the execution of the genetic code, about that shortly. Like other major biological polymers, proteins are constructed from smaller, repeating units, which are amino acids in this case. While hundreds of amino acids exist naturally, cellular protein synthesis relies on a standardized set of 20 proteinogenic amino acids to build all functional proteins (though this number may extend to 22 when including specialized forms utilized in certain organisms). A peptide is a short chain of amino acids, typically containing fewer than 50 units.

fig3

The protein's primary structure is simply its unique, linear chain of amino acids. This sequence contains the necessary information to determine the final, functional tertiary structure: the specific three-dimensional shape of the folded protein. This relationship allows the tertiary structure, and thus the protein's function, to be predicted directly from the initial amino acid sequence.

fig4

The Central Dogma

The flow of information within the cell is governed by the Central Dogma of molecular biology, summarized by : DNA is used to create RNA, and RNA is subsequently used to synthesize protein. The formation of RNA from DNA is defined as transcription, while the synthesis of protein from RNA is called translation. It is important to acknowledge that the word dogma is technically misleading; exceptions to this flow do exist (for instance, the synthesis of DNA from RNA). Regarding the naming, there is a fascinating article from Asimov Press about the specific meaning that Francis Crick intended when he originally coined the term.

fig5

The Genetic Code

The Genetic Code is the set of rules cells use to translate the information encoded in RNA into the sequence of amino acids that make up proteins. The challenge is that RNA has a four-letter alphabet (four nucleotide bases) to specify 20 different amino acids. This is resolved by organizing the bases into groups of three, known as codons. A three-letter codon yields 64 possible combinations, which is far more than the 20 required. This is why the code is described as degenerate: most amino acids are specified by more than one unique codon. This redundancy helps safeguard against errors during replication or transcription, ensuring that minor genetic changes are less likely to disrupt protein structure and function.

fig6

The Rise of Language Models

fig7

The Language Metaphor

Metaphors and analogies are valuable conceptual tools, offering a framework for understanding complex information by drawing a comparison to something we are familiar with. One thing we’re all familiar with is language. In this article, we’ve already used "book of life" to describe our genome, and terms like "vocabulary" and "alphabet" to describe the building blocks of DNA, RNA, and proteins. Interestingly, both linguistics and molecular biology experienced a massive upheaval in the 1950s: the discovery of the structure of DNA occurred around the same time Noam Chomsky introduced generative grammar in linguistics.

Taking proteins as an example, we can compare amino acids as letters; domains (regions conserved across evolution that perform a defined task) as words; and the entire folded protein as a sentence. The ultimate meaning of this sentence, the complete three-dimensional structure of the protein, is its specific biological function. This line of thinking directly suggests we can leverage tools from Natural Language Processing (NLP) to analyze and understand biological sequences.

fig8

It is important to note, however, that there is no clear and precise correspondence between natural language and biological sequences. The analogy is subject to various interpretations; for instance, some research views domains as accurate analogs for words, while others consider them more akin to complex phrases. Nevertheless, these linguistic comparisons remain highly inspiring if used with caution, and the underlying concepts are compelling to explore. For those interested in a visual explanation of this topic, there is a fantastic Kurzgesagt video that we strongly recommend watching.

fig9

Another important aspect of the relationship between language and biomolecules is that DNA, RNA, and proteins are best viewed as different biological modalities rather than different languages. Regarding the broader linguistic comparison, the concept of a "language" can be compared to a species, an analogy that was employed by Charles Darwin himself.

Inevitable Evolution

Large Language Models (LLMs) have revolutionized the field of language understanding, redefining what is possible in artificial intelligence. LLMs’ ability to generate natural, fluid, and coherent text has rendered previous metrics of language comprehension obsolete! Why not employ their comprehension skills in decoding the language of life?

Hold on, isn’t this preposterous? Aren’t we taking the language analogy too far? Why would we assume that the success of LLMs in deciphering natural language can simply be transferred to a vastly more complex system like biological molecules?

Pondering this question forces us to confront an equally profound mystery: How did LLMs ever succeed in comprehending natural language in the first place? It is fascinating that the more we understand the mechanics behind LLMs, the more baffling and perplexing (pun intended!) their success appears.

A language model is simply a predictive tool that anticipates the next token in a given sequence. Fundamentally, written language is just a series of arbitrary symbols or tokens, unique only because of the way it was generated: by us, human beings. Out of all the myriad possible combinations of words, we selectively craft text to convey intended meaning. This is precisely why LLMs, when trained on colossal corpora of human text, are able to derive semantic meaning. Similarly, biological sequences are not random; they were rigorously selected by evolution out of all other possible sequence combinations, specifically to convey a "meaning" represented by a function essential for life. Therefore, a message that can be decoded must exist in these systems. While some gibberish might slip through in human text, an unfortunate organism carrying a gibberish, non-functional protein sequence will not survive long enough for its sequence to be used in training our language models. It’s all just tokens; what truly matters is the message.

A Different Kind of Vectors

In 2013, the introduction of Word2vec marked the beginning of a new era in natural language processing. It has been cited over 40,000 times and, ten years after its publication, was recognized with the NeurIPS 2023 Test of Time award. Soon after, in 2015, the success of Word2vec in generating useful word embeddings was directly employed to create analogous representations for biological sequences, known as BioVec. BioVec introduced a general-purpose vector representation that is applicable across a wide range of biological problems, and exemplified by ProtVec when applied to protein sequences.

fig10

Biological language models (BioLLMs) are predictive tools that anticipate the next token in a biological sequence, whether that token is a nucleotide in DNA and RNA (Genomic Language Model or gLM), or an amino acid in proteins (Protein Language Model or pLM). This is the primary distinction when comparing them to natural language models. Fortunately, and unlike the often-messy raw text data, biological data is already, to some extent, curated and maintained by experts within specialized biological databases. Leveraging this data, various deep learning architectures, from feedforward networks and Recurrent Neural Networks (RNNs) to Transformers, were used to develop BioLLMs.

Challenges In Evaluation

Another important distinction between LLMs and BioLLMs lies in our domain knowledge. In the case of the former, we apply something we barely understand (deep learning and language modeling technology) to process something we do understand (natural language), allowing us to easily judge performance by reviewing the output. This is not true for BioLLMs. Here, we barely understand both the technology and the underlying data; we cannot simply "read" DNA or protein sequences to review the output. Therefore, after training these models on huge amounts of biological data, a fundamental challenge remains: How do we truly know if their predictions are correct?

Language model evaluation is a rapidly evolving research field. Better evaluation methods drive progress and lead to better models. This is especially true for BioLLMs, but the task is significantly more challenging. In the case of Protein Language Models (pLMs), we can determine if the model has learned meaningful biological information through two approaches: examining its learned representations in a zero-shot manner, or by fine-tuning the model on various protein-related tasks where the ground truth is already known, and then assessing its performance.

For zero-shot evaluation, the most straightforward metric is perplexity, which measures how confident the model is in its sequence predictions from the set of available options. A lower perplexity score is indicative of better performance. Since the pLM’s vocabulary consists of roughly 25 tokens (the set of available options), an optimal score is 1 (certainty in a specific amino acid), while 25 would indicate that the model is making completely random predictions. Another example of zero-shot evaluation is examining whether the model has successfully encoded underlying biochemical knowledge within its representations. Amino acids can be grouped according to their chemical characteristics, such as polarity, ionization, and side-chain group type (aliphatic, aromatic, polar, etc.). Therefore, if we visualize the model's representations (embeddings) of these amino acids, we should expect to see them clustered together accordingly.

fig11

We can also fine-tune these models on available datasets for various protein-related tasks. For instance, fluorescent proteins (FPs), like the Green Fluorescent Protein (GFP) that glows when exposed to specific light, are widely used in cell imaging to track molecular events inside living cells. The ability to predict fluorescence is important for selecting the best FP variants based on brightness, photostability, or other properties. Using available experimental data, we can evaluate the model's prediction of a protein’s fluorescence intensity, which is fundamentally a regression task.

fig12

Another valuable task is protein stability prediction. Designing stable proteins is critical for ensuring that drug candidates remain intact before they are degraded. Since traditional experimental methods are highly time-consuming, having a model capable of predicting how sequence changes or environmental factors affect stability would be immensely useful. In addition to these two tasks, numerous others exist that could be used to evaluate BioLLMs and would greatly benefit from their capabilities.

fig13

And finally, the most important task is protein structure prediction: the holy grail of structural biology, and the very task for which AlphaFold2 changed things forever.

The AlphaFold Revolution

E pur si muove (And yet it moves). - Galileo Galilei

fig14

Protein structure prediction is the process of deducing a protein's three-dimensional structure from its linear amino acid sequence: predicting the secondary and tertiary structure from the primary structure. As previously mentioned, the protein’s three-dimensional structure is what encodes its function, and this prediction capability is, needless to say, essential for countless applications in medicine, drug design, and biotechnology.

To predict a protein’s final structure, we must determine how its unstable linear chain of amino acids will fold precisely in three-dimensional space. But protein folding is an immensely difficult problem (NP-Hard problem), so difficult that it seems paradoxical how proteins themselves manage to fold so reliably. And obtaining a protein's structure experimentally is an extremely involved and time-consuming process, typically requiring large amounts of purified protein. Employed techniques such as X-ray crystallography and cryo-EM rely on highly specialized, expensive equipment. Due to this complexity, researchers often spend months or even years to obtain a single validated, high-resolution three-dimensional structure.

In 1994, the Critical Assessment of Structure Prediction (CASP) competition was launched to motivate more researchers to work on the protein structure prediction problem. Every two years, hundreds of research groups from all over the world participate in CASP. In its 14th iteration, held in 2020, DeepMind’s AlphaFold2 delivered an "ImageNet moment" for computational biology by achieving accuracy significantly higher than any previous entry: scoring above 90 for two-thirds of the target proteins.

fig15

One prominent modification in AlphaFold2, compared to its previous iteration, is the employment of a specialized protein language model: Evoformer. This achievement led to Demis Hassabis and John Jumper of Google DeepMind being awarded one half of the 2024 Nobel Prize in Chemistry for “protein structure prediction”, while the other half went to David Baker for “computational protein design." (another story to tell)

But the most inspiring achievement of AlphaFold2 is arguably the way it demonstrates that we can still overcome incredibly hard problems by finding solutions that are practically sufficient. AlphaFold2 didn’t solve the protein folding problem, but its predictions are accurate and useful enough to immediately open the gate to several life-changing applications.

Overview of Biological Language Models

Let’s now have a look at some of the landmark examples of protein language models in this concluding section. It’s important to note that the field is currently experiencing a sort of Cambrian explosion; countless other seminal works cannot be fairly covered in a single article. Here, we will focus on the historical aspect, briefly reviewing three different models that represent distinct architectural choices and follow varying development philosophies.

ESM

Philosophy: Scale

Introduced in 2019 by Meta, the Evolutionary Scale Model (ESM) became the first protein language model to successfully scale both its Transformer architecture and its training data. Previous work, such as TAPE, utilized Transformers but was limited to models smaller than 50 million parameters and training on only about 30 million protein sequences. In contrast, the ESM models range in size from 40 million up to the 670 million parameters of the large ESM-1b variant, which was trained on a massive dataset of 250 million protein sequences. ESM was trained with a Masked Language Modeling (MLM) objective similar to BERT’s. The initial ESM model was succeeded by two further releases, ESM-2 and ESM-3, maintaining a nearly identical architecture while doubling down on scale. Although Meta disbanded its protein research team in 2023, the work fortunately continued; members of the team regrouped to form the new company Evolutionary Scale, where they developed the ESM-3 model.

fig16

ProGen

Philosophy: Control

ProGen, introduced in 2020 by Salesforce Research, is dedicated to controllable protein generation: a protein design task where sequences are created based on specified conditions, such as the host organism or cellular location. It adopts a decoder architecture, following the controllable language generation approach pioneered by Salesforce's earlier CTRL model. This 1.2 billion parameter model was trained on approximately 280 million protein sequences, conditioned on taxonomic and keyword tags like molecular function. Notable authors included You search engine founders Richard Socher and Bryan McCann. ProGen was succeeded by ProGen-2 (from Salesforce) and ProGen-3, the latter developed by Profluent, a recent startup founded by ProGen authors.

fig17

Ankh

Philosophy: Optimize

Named after the ancient Egyptian symbol for the key of life, Ankh was introduced by Proteinea in 2023. Ankh champions a data-efficient, knowledge-guided optimization strategy that explicitly balances both cost and performance, rather than relying solely on massive scaling. It adopts an encoder-decoder architecture based on Prot-T5 and offers two models: Ankh-base (approximately 450 million parameters) and Ankh-large (approximately 1.1 billion parameters). Remarkably, the smaller Ankh-base, with just 450 million parameters, achieved performance comparable to the 15 billion parameter ESM-2 model. Ankh was succeeded by Ankh-2 and Ankh-3, though limited details are available concerning the latter two releases.

fig18

Suggested Readings

TAPE: One of the earliest general benchmarks for protein language models, and the paper provides a great overview of the many protein-related tasks for evaluating the models.

BERTology Meets Biology: This amazing paper from Salesforce Research extends the line of interpretability research to Transformer protein language models through the lens of attention.

AlphaFold2 and its applications: This Nature paper offers a nice overview of the many applications of AlphaFold2 in the fields of biology and medicine.

Community

This is wild. Turning biology into language and AI into the translator!

Sign up or log in to comment