Large Language Models like ChatGPT Begin to Permeate Bioinformatics
Exploring the Impact and Future Potential of AI Language Models in Unraveling Biological Complexity
The gradual integration of Large Language Models (LLMs) like ChatGPT into biology is starting to happen, opening up new ways to do text mining, ontology, and bioinformatics. Given the potential of this technology to reshape the way researchers approach data analysis and knowledge extraction, we at Nexco are closely monitoring its evolution and its concrete applications. Let’s explore together recent works that apply LLMs to problems in biology, discussing their collective impact and their transformative potential.
Text Mining Like Never Before
The idea of automatically extracting information from scientific texts and parsing relationships between biological concepts and mechanisms (such as “Gene X negatively regulates gene Y” or “Drug X is an allosteric inhibitor of protein Y”) has captivated scientists for decades. Unfortunately, the challenges associated with such automated text-mining procedures resulted in rather poor performance; therefore, most software solutions for scientific text mining were of limited usefulness.
However, LLMs stand as a game-changer. Researchers are increasingly turning to LLMs to mine vast repositories of scientific literature for valuable insights, for example to unravel gene networks or predict gene functions. The ability of these LLM-based models to process and prioritize information from diverse sources accelerates the pace of discovery.
Just like the early applications of (not so large) LLMs to “sentiment analysis” allowed a quite accurate distinction of positive, negative and neutral statements, the modern LLMs are complex and informed enough that they can extract meaningful relationships from textual data to link different biological entities (genes, metabolites, etc.) with functional relationships. For example, PEDL+, a user-friendly tool for relation extraction, showcases the adaptability and programmability of these models in complex data analysis pipelines. The success of PEDL+ in pathway curation projects underscores the potential of LLMs to enhance information extraction, providing a bridge between textual data and biological insights.
The PlantConnectome database wonderfully exemplifies how LLMs streamline the extraction of biological knowledge from an ever-expanding corpus of scientific literature. To develop PlanConnectome, researchers processed over 100,000 abstracts of plant biology literature with one of OpenAI’s GPT models, accessed through the API using prompts (instructions) optimized with ChatGPT. The approach unveiled nearly 400,000 functional relationships connecting biological entities such as genes, metabolites and tissues, among many others. The system, fully automated and with an accuracy above 85%, is now available to researchers as a user-friendly database with demonstrated utility in the exploration of gene regulatory networks, protein-protein interactions, and developmental and stress responses.
LLMs and Their Impact on Bioinformatics
The unveiling of the most modern LLMs like GPT-4 from OpenAI, Llama 2 from Meta, Gemini from Google, and others, introduces a new chapter in the integration of AI with bioinformatics. Excitement surrounds the potential applications of these models in assisting how scientists work and interact with information, a major issue given the huge volumes of literature and data produced every year.
Of course, there are many challenges ahead. The closed-source nature of models like GPT-4 raises concerns about transparency and reproducibility -but there are open-source alternatives such as Meta’s Llama 2. Ethical considerations, including bias and safety, demand careful navigation. And of utmost relevance to scientific applications, a very important concern is that of hallucination, that is when the LLM produces output that lacks support in the training data hence we can’t assess its veracity -or it’s just plain wrong despite the LLM presenting it confidently. While this can be partially tackled by considering the scores associated to the tokens (“syllables”) output by the model (see here), the problem is far from solved and is of critical relevance at the moment.
Another important point to consider when developing and using an LLM for a scientific application is the balance between LLM size, amount of text and data available for training, and proper frameworks and procedures for training, especially attempting maximal suppression of hallucinations. Further complicating the situation, recent work from Deepmind on some of the best LLMs revealed a surprisingly large effect of computationally optimized prompts on the accuracy of the generated texts, especially when the LLM was put to solve problems (summarized and exemplified hands-on here). All these factors are tractable, but take time and resources to optimize.
The Future and LLMs at Nexco
As bioinformatics continues to evolve, the future holds promise for highly sophisticated tools that can dramatically accelerate the pace of research. However, it is critical that we and LLM developers address the challenges listed above.
The journey has just begun, and while companies continue to develop novel, bigger LLMs (and they in parallel pursue AGI (Artificial General Intelligence)), there are several routes and concrete applications that we can explore with the current models already today.
No doubt blending the power of LLMs with biological research opens doors to unprecedented possibilities. We at Nexco are following all trends closely, as we have shared with you in this blog post, and are right now exploring how to apply modern LLMs to our services and your requests.
References
Related Posts
Nos locaux:
Nexco Analytics Bâtiment Alanine, Startlab Route de la Corniche 5A 1066 Epalinges, SwitzerlandAppelez-nous
+41 76 509 73 73Laissez-nous un message
contact@nexco.chN'hésitez pas à nous laisser un message
Nous nous ferons un plaisir de vous répondre et de trouver des solutions optimales à vos besoins.