An Introduction to Large Language Models in Biology and Bioinformatics

Large Language Models, a special kind of “Artificial Neural Network” trained to “understand” and “reason” on natural text, computer code, and in the languages of biology, are transforming bioinformatics. Read on this primer to learn the basics about LLMs in all the forms useful to your research and to know how we at Nexco are using them to innovate bioinformatics tools and services, including the adaptation of LLMs to your specific needs with privacy and customizability in mind.

AI generated representation of a neural network

Index

Introduction
- Large Language Models
- Direct Predictions vs. Transfer Learning/Fine-Tuning vs. Analysis of Embeddings
Natural Language Models in Biology
- Mining Literature, as a 24/7 Assistant, and Smoothing Interactions with Software
Biological Large Language Models
- Example: Protein Language Models
- Example: Single-Cell Language Models
- Foundational Models for Biology
Leveraging Large Language Models at Nexco
Further reading

Introduction

Various flavors of “Artificial intelligence” (AI), also called “Machine Learning” or “Deep Learning” methods and originally under the name of “Artificial Neural Networks” (ANNs), have been applied to the natural sciences for at least some 2–3 decades. However, it is in the last 5–10 years that AI has made a substantial impact, actually quite strong in disciplines like biology. This happened mainly due to two factors: the availability of massive amounts of data, which AI models absolutely require in order to “learn”, and the development of new mathematical and computational algorithms at the core of these methods, which somehow are “nothing but” highly non-linear mathematical models with multidimensional inputs and outputs.

Among the most relevant advances in modern AI, a special kind called Large Language Models (LLMs) have gained attention for their ability to handle complex patterns in sequential data and to “reason” on it, even displaying “sparks of intelligence” as some have called it (statement taken from a study of OpenAI’s GPT-4 by Microsoft Research). At their core, and like all AI models, LLMs are built upon ANNs, which are computational/ mathematical frameworks designed to mimic, if we can permit ourselves to make such comparison, how biological neurons work in the brain. Like their flesh analogs, each neuron of an ANN takes inputs from other neurons and sends outputs to other. Mathematically, this involves very simple operations; however, the dense network of connections between neurons allows mixing different kinds of inputs and outputs and concatenating them into highly nested computations that end up modeling highly non-linear responses efficiently, even in cases where each neuron’s mathematical processing is itself linear. That’s how ANNs become highly non-linear mathematical models with multidimensional inputs and outputs, and with high malleability and adaptability to different kinds of data and tasks.

In particular, LLMs excel at analyzing and predicting sequences of information, such as the words or syllables (called “tokens”, which do not necessarily map to exactly words or syllables but rather to one or few consecutive letters, to be exact) in natural language or the arrangement of amino acids in proteins, bases in a gene or atoms in a small molecule (again actually “tokens”, which in these cases usually do match the individual amino acids, bases, or atoms).

In this blog post we will focus on LLMs, in particular on how they can be used to boost biological research in various ways — from useful tasks that every biologist can do with a simple LLM interface like ChatGPT, to LLM-based boosts of a bioinformatician’s skills, to the integration of LLM into other software to facilitate its use, and to training specialized LLMs on protein sequences or genome data to analyze datasets and make biological predictions.

Large Language Models

A widely recognized example of an LLM is ChatGPT, which many biologists may already be familiar with. You may also have heard or even utilized other LLMs such as Gemini, Llama, or Claude. Each with own variations in the exact mathematical architecture and in the kinds of data used for training, these AI systems are all trained to recognize patterns within text by processing massive amounts of data, mostly human-readable text and computer code. The ability of these LLMs to “understand”, “reason”, and generate human language and computer code, stems from this training.

For broad-knowledge LLMs, the data used for training consists in a huge corpus of text dominated by human language but containing also quite a substantial amount of other kinds of information, such as computer code. Thus, the basic capabilities that emerge naturally upon training include not only quite good performance at understanding and generating texts but also at analyzing and writing code. Of direct use to biologists and bioinformaticians, these “basic” capabilities allow LLMs to perform tasks like summarizing scientific papers, looking for specific pieces of information in a huge repository or text, or assisting with coding for bioinformatics applications. Most importantly, LLMs can be adapted to solve problems outside of their original training domain, a flexibility that again makes them valuable across a range of applications, including biological research as we will see.

The power of LLMs comes from their underlying “architecture”, the word used in jargon to refer to how the artificial neurons are “wired” together. The key mathematical/computational unit in LLMs is the so-called “transformer”, developed by Google Brain back in 2017, which is capable of capturing relationships across tokens spread throughout long sequences. As we saw above, these tokens can be pieces of natural language text, of computer code, of amino acids or bases in a biological macromolecule, atoms in a molecule, etc. As a transformer (or a whole LLM) is exposed to large datasets upon training, it “learns” the intricate patterns that govern the relationships between tokens, and thus the structure and relationships of the underlying data, which ultimately allows LLMs to “reason” on the data and to generate outputs that make logical sense.

Training of an LLM takes place by masking (hiding) tokens and predicting them back. The training procedure adapts the numerical values in the mathematical functions that connect the artificial neurons until they can reproduce the input. Being careful with the training procedure, the quality of the training data, the use of independent data to check the model’s performance and reliability, and various other factors, one can get the kind of “smart” models we have today at our disposal.

Direct Predictions vs. Transfer Learning/Fine-Tuning vs. Analysis of Embeddings

LLMs can be put to work in at least 3 different ways:

By directly making predictions by generating text given a user-provided prompt that contains questions and guidelines, possibly together with examples provided in the prompt (what’s called “few-shot learning”). This relies on the model’s inherent abilities to generate immediate predictions; and here’s where “prompt engineering” becomes handy, as it has been shown that proper prompting of LLMs can largely enhance their reasoning capabilities (see Deepmind’s study of various LLMs).
Through transfer learning or other task-specific “fine tunings” and adaptations of the core LLM. This also relies on the model’s inherent abilities to generate immediate predictions, but where these predictions are typically fed into additional modules tuned to carry out more specific tasks.
By analyzing the so-called “embeddings”, which are the set of artificial neurons and their connections that “fire” when a certain piece of information is passed through the LLM. This leverages the internal representations learned during training, applying them to more specialized tasks.

Direct Predictions and Few-Shot Learning. Direct prediction tasks harness a model’s pre-trained knowledge to make immediate assessments or generate outputs. For instance, protein language models can predict the likelihood of specific amino acids appearing in a sequence without requiring any additional training data. By masking parts of the sequence and using its pre-trained understanding of protein structures, the model can estimate the effects of mutations or identify functionally relevant residues. In single-cell biology, these models can simulate gene perturbations or deletions to predict changes in cell states directly.

Additionally, few-shot learning enhances direct predictions by allowing a model to adapt to new scenarios with minimal examples. For example, when using natural language models for biological literature review, researchers can prompt the model with a few examples of complex queries to obtain tailored responses. This flexibility enables the models to interpret and respond to nuanced biological questions, even if they were not explicitly trained for that specific context.

Transfer Learning and Fine-Tuning. While direct predictions can be powerful, some biological tasks require more specialized adjustments. Transfer learning enables researchers to leverage the broad knowledge encoded within a pre-trained model and adapt it to new, more specific tasks. This is commonly achieved by fine-tuning: modifying the model’s parameters using a smaller, task-specific dataset. For example, protein language models like ESM-2 can be fine-tuned to predict specific properties, such as protein stability or immune escape potential in viral proteins, or even 3D protein structures (ESM-Fold). Similarly, single-cell models like Geneformer can be fine-tuned for cell-type labeling or multimodal data integration. We will cover these examples in some more depth later on.

Fine-tuning offers the advantage of customizing a generalist model’s broad understanding to make it highly accurate for a particular task. However, this process can be computationally intensive. To address this, researchers have developed more efficient fine-tuning techniques, enabling smaller laboratories to use large-scale models for specific research needs without requiring extensive computational resources.

Embedding Analysis. A key strength of both natural and biological language models lies in their ability to generate embeddings: internal vector representations that capture complex relationships within the input data. During the processing of text, protein sequences, or single-cell gene expression profiles, the model creates embeddings for individual elements (such as amino acids or genes) and their context within the sequence.

Embeddings act as dimensionality reduction and data-linking units, and since their calculation requires no additional training of the model, they provide a fast, powerful and resource-efficient way to explore large-scale biological datasets. Namely, embeddings can be used to extract meaningful insights without altering the model’s parameters.

In protein research, clustering protein embeddings can identify homologous proteins, aiding in the construction of multiple sequence alignments and informing protein engineering efforts. Similarly, in single-cell omics, the embeddings for each gene can be combined to generate cell representations that facilitate cell clustering, visualization, and identification of subtle subtypes, even in the presence of batch effects.

In protein structure prediction models like Meta’s ESMFold, the underlying LLM (ESM-2) itself is trained to predict masked amino acids, but learning to do this entailed that the LLM learned an internal representation of protein structure and evolution. Thus, extracting and properly processing the right set of embeddings allows the model to produce 3D models of the protein sequence it processes.

The choice between using direct predictions, fine-tuning/transfer learning, and embedding analysis often depends on the specific research question and the available computational resources. Direct predictions, potentially enhanced with few-shot learning, provide quick and adaptable responses for many biological queries. In contrast, transfer learning and fine-tuning offer highly specialized solutions at the cost of increased computational effort. Embedding analysis, meanwhile, serves as a versatile, lightweight tool for exploring relationships and patterns in biological data.

Natural Language Models in Biology

Mining Literature, as a 24/7 Assistant, and Smoothing Interactions with Software

LLMs trained on human-readable text, which as we saw above includes computer code, are transforming how scientists can access, browse, search, understand, and utilize the vast amount of information available in scientific literature. LLMs can quickly navigate large volumes of text to summarize complex biological concepts, assist with unfamiliar terms, and streamline coding tasks. Their versatility makes them a valuable tool for biologists who deal with both technical language and specialized software. Let’s see together some examples.

Processing biological literature. Biological research generates an overwhelming amount of written content, from research papers and reviews to textbooks and online resources. For example, just Pubmed gathers over 37 million abstracts — and it covers only Broad-scope and Life Sciences-related fields.

Keeping up with such amounts of information is certainly challenging. Even simply searching the answer to a very specific question, or knowing whether a long text just touches on a topic, can involve huge efforts by using traditional methods. But LLMs excel at understanding and summarizing natural language, which allows them to process and distill vast amounts of biological literature into easily understandable summaries as well as search, extract and even format specific pieces of information. LLMs are also knowledgeable enough that they can help researchers to get quick overviews, at introductory level, about virtually any topic. Moreover, by using the right prompts users can make LLMs tune the explanations to their exact background and specific needs.

For example, if a scientist asks an LLM to “explain a simple algorithm for multiple sequence alignment for someone with an introductory biology background”, then the LLM won’t spend time explaining what protein sequences are but will try to make the explanations if the algorithm itself simple; and the contrary will happen if the request ends with “for someone expert in computer sciences”. Modern versions of LLMs have been trained with very broad knowledge and fine-tuned to be clear in their “reasoning” steps, also being incredibly adaptive and thus capable of offering explanations that can range from basic overviews to detailed technical discussions, even writing pseudocode or code, depending on the user’s needs.

Beyond summarizing concepts, NLMs can help researchers quickly digest new scientific publications. For instance, they can read a paper’s abstract or full text and generate a quick summary or answer specific questions about the methods or findings. This capability reduces the time required to stay up-to-date with the latest developments, making literature reviews more efficient.

Assisting with unfamiliar concepts. One of the most immediate applications of LLMs trained on natural language text is their ability to assist researchers when they encounter unfamiliar terms or concepts. Biology, and the same is valid for all other natural sciences, is filled with specialized terminology, and grasping this language can be quite difficult, especially for interdisciplinary scientists. LLMs can act as intelligent assistants, offering real-time explanations that adapt to context and the user’s own background. For example, if a biologist comes across a term like “RNA splicing” while reading a paper, they can quickly ask an LLM for a clear explanation or even contextual information related to their specific area of interest.

This can be especially useful in fields like bioinformatics or systems biology, where researchers might frequently encounter new terms and methodologies. By providing on-the-spot definitions and clarifications, LLMs can help biologists focus on their research without having to constantly pause to consult textbooks or online resources.

Writing code and commanding software. LLMs also enhance productivity by simplifying the work of bioinformaticians, particularly in areas where coding of new methods and algorithms for data analysis are central. LLMs can also be extremely useful to assist users of specialized software that operates via commands typed in a terminal or via scripting, where users can now ask the LLM what commands to type in order to carry out a given operation. All this allows scientists to focus on the biological questions at hand rather than getting bogged down in technical implementation.

In addition, NLMs can help debug code. Researchers can input broken code or error messages into the model, which can then diagnose the issue and suggest corrections. This not only saves time but also reduces the frustration of troubleshooting unfamiliar programming problems.

Moreover, as LLMs become easier to plug into existing software, the programs themselves will begin to incorporate the option of just accepting inputs in natural language that are then internally converted into functions.

Finally, it is important that LLMs trained to use software tools can learn to handle complex tasks in whole new ways, even creating quite automated pipelines. ChemCrow is an example of this, dedicated specifically to the field of chemistry and allowing users to pose complex problems in everyday language and have the model handle the technical aspects of interfacing with software for molecular synthesis.

We have covered various ways to use natural language models to power bioinformatics with more specific examples, here:

Biological Large Language Models

By properly adapting the kind of tokens and the network’s architecture, large language models can also be trained on digital information coming from biology, rather than on human-readable texts or computer code. For example, protein sequences, DNA sequences, RNA seq data, etc. are all excellent substrates.

Example: Protein Language Models

Protein language models benefit from being pre-trained on massive datasets of protein sequences, allowing them to grasp important properties like evolutionary constraints and functional features of proteins. This pre-training enables the model to generalize across diverse sequences. Once these models are fine-tuned on smaller, specialized datasets, they can predict crucial protein properties, including stability, interactions, and even assist in designing new proteins with specific structures. Like natural language models, there are various types of protein language models, some trained on specific protein families or using different masking techniques to uncover sequence relationships.

Probably the most notable example of a protein language model is ESM-2, a transformer-based neural network trained on over 250 million protein sequences. During training, random amino acids in each sequence were masked, and the model was tasked with predicting the original amino acids based on the surrounding sequence. By mastering this task, the model learned structural patterns in protein families and even how their evolution is constrained by structure.

Application of protein language models in direct prediction mode isn’t of much use in itself, but it can help to computationally predict, for instance, what mutations are tolerated and likely to take place or rather be deleterious. Embedding analysis and transfer learning have, instead, much more use cases. Embeddings can help to identify similar proteins by clustering related sequences, and can be used to structure multiple sequence alignments without prior annotations or evolutionary alignments, a common requirement for other methods. Through transfer learning, the knowledge gained during training can be repurposed for specific tasks by extracting the learned embeddings and using them in smaller, task-specific models; for example, to predict protein stability, immune escape potential in viruses, and even the pathogenicity of mutations using minimal labeled data. At the culprit, as we saw earlier, an AI model that predicts protein structures from sequences, ESM-Fold, was derived from the ESM-2 model trained to just predict masked amino acids.

Example: Single-Cell Language Models

Single-cell gene expression data is highly complex, with millions of transcriptomes available across different species, tissues, and conditions. To make sense of this high-dimensional data, single-cell language models are trained on large datasets to understand gene interactions and cell-specific expression patterns. After pre-training, these models can be fine-tuned for tasks like cell-type identification and batch correction, making them adaptable across various datasets and for various purposes.

For example, we covered earlier in our blog Geneformer, a language model trained to predict masked genes in lists of genes that describe cells and their states. Training of this model on around 30 million single-cell transcriptomes allowed it to gain a quite fundamental understanding of gene network dynamics, that has been used to predict dosage-sensitive disease genes, identify candidate therapeutic targets, and predict chromatin and network dynamics.

In direct prediction mode, these models can simulate genetic perturbations or gene deletions to predict how gene expression changes in specific cells. Embedding analysis provides direct, simplified representation of cells, akin to dimensionality reduction methods that then allow downstream analyses such as clustering, denoising, visualization, batch effect detection and removal, etc. Some argue that since these models are trained on large, diverse datasets, they can distinguish subtle differences between cell types even in the presence of experimental variability. Finally, single-cell models can be fine-tuned to predict specific properties of cells, such as their type or state. For example, models like scGPT which we have also covered in previous posts are capable of integrating multimodal data, combining gene expression with other factors like chromatin accessibility or protein levels. This makes them highly versatile for applications requiring dataset integration across different biological modalities.

Foundational Models for Biology

Many of the models we have described here are referred to as “foundational”. A foundational model is trained on large-scale, diverse datasets and captures a wide range of features and relationships inherent in the data. Often, direct application of foundational models is not of utmost interest, but their value relies on the general knowledge and understanding they have, which can then be put into practical application via embedding analysis or, most often, by fine-tuning and transfer learning.

Foundational models are thus broad in scope, computationally demanding to train as well as demanding in terms of human expert time curating data and overviewing the training process; but in return they provide a broad base of understanding useful for various tasks, especially so in scientific applications. For biology, we saw above Geneformer and scGPT, and we have also covered other more “fundamental” foundational models such as Evo, which is what’s called “multimodal” — that is it can understand more than a single kind of data, in this case DNA, RNA, and protein sequences thus allowing it to grasp the central dogma of molecular biology (and potentially everything that arises from it, although at this point we are probably setting the expectations a bit too high, for the moment).

AI models begin to capture the whole central dogma of molecular biology

We could also mention other fundamental, foundational, and also multimodal models such as AlphaFold 3. Although this one is not a purely language model, it does use modules that can parse and process tokens describing amino acids, nucleobases, and atoms:

AlphaFold 3 Advances the Future AI Technologies for Pharma and Biotech

For more on foundational models meant specifically to support biology and bioinformatics, you can consult our dedicated blog post:

Foundation Models for Biology to Support Bioinformatics

Leveraging Large Language Models at Nexco

At Nexco we are harnessing the power of modern LLMs to innovate bioinformatics tools and services along the various axes presented above. While for some applications we use paid API-based options, it is critical that recent developments have made LLMs so much accessible and efficient that we can run them locally. More specifically, advances in open-weight models and the availability of smaller, more manageable LLMs mean that we can now run LLMs on our own computers, providing significant benefits in terms of costs, customizability, and privacy because the data we process never leaves our servers.

Unlike the massive LLMs that require extensive infrastructure, these “small” models contain “only” a few million parameters, and their training has been optimized for efficiency targeting high inference accuracy at low costs, often for specific tasks. This holds for natural language LLMs, with various competing “small” models, and also with models trained on biological data such as ProtGPT2.

In addition to the portability and lower costs, the ability to train or fine-tune customized models in-house enables us to protect sensitive data, such as patient or corporate information, while maintaining compliance with data protection regulations — as opposed to LLMs that run on the cloud hence require that users send information to the provider’s servers.