# How retrieval-augmented generation (RAG) can transform drug discovery

[How retrieval-augmented generation (RAG) can transform drug discovery](https://blog.biostrand.ai/how-retrieval-augmented-generation-can-transform-drug-discovery)

BioStrand

12.14.2023

![](https://blog.biostrand.ai/hubfs/RAG%20in%20drug%20discovery.png)

###### Audio version

How retrieval-augmented generation (RAG) can transform drug discovery

8:55

In a recent article on [knowledge graphs and large language models (LLMs) in drug discovery](https://blog.biostrand.ai/integrating-knowledge-graphs-and-large-language-models-for-next-generation-drug-discovery), we noted that despite the transformative potential of LLMs in drug discovery, there were several critical challenges that have to be addressed in order to ensure that these technologies conform to the rigorous standards demanded by life sciences research.

[Synergizing knowledge graphs with LLMs](https://blog.biostrand.ai/knowledge-graphs-and-black-box-llms) into one bidirectional data- and knowledge-based reasoning framework addresses several concerns related to hallucinations and lack of interpretability. However, that still leaves the challenge of enabling LLMs access to external data sources that address their limitation with respect to factual accuracy and up-to-date knowledge recall.

Retrieval-augmented generation (RAG), together with knowledge graphs and LLMs, is the third critical node on the trifecta of techniques required for the robust and reliable integration of the transformative potential of language models into drug discovery pipelines.

## Why Retrieval-Augmented Generation?

One of the key limitations of general-purpose LLMs is their training data cutoff, which essentially means that their responses to queries are typically out of step with the rapidly evolving nature of information. This is a serious drawback, especially in fast-paced domains like life sciences research.

Retrieval-augmented generation enables biomedical research pipelines to [optimize LLM output](https://www.oracle.com/in/artificial-intelligence/generative-ai/retrieval-augmented-generation-rag/) by:

1. [Grounding](https://research.ibm.com/blog/retrieval-augmented-generation-RAG) the language model on external sources of targeted and up-to-date knowledge to constantly refresh LLMs' internal representation of information without having to completely retrain the model. This ensures that responses are based on the most current data and are more contextually relevant.
2. Providing access to the model's information so that responses can be validated for accuracy and sources, ensuring that its claims can be checked for relevance and accuracy.

In short, retrieval-augmented generation provides the framework necessary to augment the recency, accuracy, and interpretability of LLM-generated information.

## How does retrieval-augmented generation work?

Retrieval augmented generation is a natural language processing (NLP) approach that [combines](https://stackoverflow.blog/2023/10/18/retrieval-augmented-generation-keeping-llms-relevant-and-current/) elements of both information retrieval and text generation models to enhance the performance of [knowledge-intensive tasks](https://arxiv.org/abs/2005.11401).

The retrieval component aggregates information relevant to specific queries from a predefined set of documents or knowledge sources which then serves as the context for the generation model.

Once the information has been retrieved, it is combined with the [input context](https://medium.com/@minh.hoque/retrieval-augmented-generation-grounding-ai-responses-in-factual-data-b7855c059322) to create an integrated context containing both the original query and the relevant retrieved information.

This integrated context is then fed into a generation model to [generate](https://www.singlestore.com/blog/a-guide-to-retrieval-augmented-generation-rag/#:~:text=The%20RAG%20model%2C%20which%20integrates,LLM%20returns%20response.) an accurate, coherent, and contextually appropriate response based on both pre-trained knowledge and retrieved query-specific information.

The RAG approach gives life sciences research teams more [control over grounding data](https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview) used by a biomedical LLM by honing it on enterprise- and domain-specific knowledge sources. It also enables the integration of a range of [external data sources](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-customize-rag.html), such as document repositories, databases, or APIs, that are most relevant to enhancing model response to a query.

## The value of RAG in biomedical research

Conceptually, the retrieve+generate model’s capabilities in terms of dealing with dynamic external information sources, minimizing hallucinations, and enhancing interpretability make it a natural and complementary fit to augment the performance of bioLLMs.

In order to quantify this augmentation in performance, a recent research effort evaluated the ability of a [retrieval-augmented generative agent](https://openreview.net/pdf?id=clU5xWyItb) in biomedical question-answering vis-a-vis  LLMs (GPT-3.5/4), state-of-the-art commercial tools (Elicit, Scite, and Perplexity) and humans (biomedical researchers).

The RAG agent, PaperQA,  was first evaluated against a standard multiple-choice LLM-evaluation dataset, PubMedQA, with the provided context removed to test the agents’ ability to retrieve information.  In this case, the RAG agent beats GPT-4 by 30 points (57.9% to 86.3%).

Next, the researchers constructed a more complex and more contemporary dataset (LitQA), based on more recent full-text research papers outside the bounds of LLM’s pre-training data, to compare the integrated abilities of PaperQA, LLMs and human researchers to retrieve the right information and to generate an accurate answer based on that information.

Again, the RAG agent outperformed both pre-trained LLMs and commercial tools with overall accuracy (69.5%) and precision (87.9%) scores that were on par with biomedical researchers. More importantly, the RAG model produced zero hallucinated citations compared to LLMs (40-60%).

Despite being just a narrow evaluation of the performance of the retrieval+generation approach in biomedical QA, the above research does demonstrate the significantly enhanced value that RAG+BioLLM can deliver compared to purely generative AI.

The combined sophistication of retrieval and generation models can be harnessed to enhance the accuracy and efficiency of a range of processes across the drug discovery and development pipeline.

## Retrieval-augmented generation in drug discovery

In the context of drug discovery, RAG can be applied to a range of tasks, from literature reviews to biomolecule design.

Currently, generative models have demonstrated potential for de novo molecular design but are still hampered by their inability to [integrate multimodal information](https://www.sciencedirect.com/science/article/pii/S2666379122003494) or provide interpretability. The RAG framework can facilitate the retrieval of [multimodal Information](https://arxiv.org/pdf/2303.10868.pdf), from a range of sources, such as chemical databases, biological data, clinical trials, images, etc., that can significantly augment generative molecular design.

The same expanded retrieval + augmented generation template applies to a whole range of applications in drug discovery like, for example, compound design (retrieve compounds/ properties and generate improvements/ new properties), [drug-target interaction prediction](https://www.sciencedirect.com/science/article/abs/pii/S1476927123001184)(retrieve known drug-target interactions and generate potential interactions between new compounds and specific targets. adverse effects prediction (retrieve known adverse and generate modifications to eliminate effects). etc.

The template even applies to several sub-processes/-tasks within drug discovery to leverage a broader swathe of existing knowledge to generate novel, reliable, and actionable insights. In [target validation](https://www.cell.com/trends/pharmacological-sciences/fulltext/S0165-6147(23)00137-2#secst0075), for example, retrieval-augmented generation can enable the comprehensive generative analysis of a target of interest based on an extensive review of all existing knowledge about the target, expression patterns and functional roles of the target, known binding sites, pertinent biological pathways and networks, potential biomarkers, etc.

In short, the more efficient and scalable retrieval of timely information ensures that generative models are grounded in factual, sourceable knowledge, a combination with limitless potential to transform drug discovery.

## An integrated approach to retrieval-augmented generation

Retrieval-augmented generation addresses several of the critical limitations and augments the generative capabilities of bioLLMs. However, there are additional design rules and multiple technological profiles that have to come together to successfully address the specific requirements and challenges of life sciences research.

Our LENSai™ Integrated Intelligence Platform seamlessly unifies the semantic proficiency of knowledge graphs, the versatile information retrieval capabilities of retrieval-augmented generation, and the reasoning capabilities of large language models to reinvent the Understanding-Retrieve-Generate cycle in biomedical research.

Our unified approach empowers researchers to query a harmonized life science knowledge layer that integrates unstructured information & ontologies into a knowledge graph.

A semantic-first approach enables a more accurate understanding of research queries, which in turn results in the retrieval of content that is most pertinent to the query.  The platform also integrates retrieval-augmented generation with structured biomedical data from our HYFT technology to enhance the accuracy of generated responses.

And finally, LENSai combines deep learning LLMs with neuro-symbolic logic techniques to deliver comprehensive and interpretable outcomes for inquiries.

To experience this unified solution in action, please [contact us here](https://www.biostrand.ai/contact).

Tags:


[Knowledge Graphs](https://blog.biostrand.ai/tag/knowledge-graphs#categories),

[AI](https://blog.biostrand.ai/tag/ai#categories),

[NLP](https://blog.biostrand.ai/tag/nlp#categories),

[Drug discovery](https://blog.biostrand.ai/tag/drug-discovery#categories),

[Life sciences data management](https://blog.biostrand.ai/tag/life-sciences-data-management#categories),

[Large language models](https://blog.biostrand.ai/tag/large-language-models#categories),

[Retrieval Augmented Generation](https://blog.biostrand.ai/tag/retrieval-augmented-generation#categories)



# NLP, NLU & NLG : What is the difference?

[NLP, NLU & NLG : What is the difference?](https://blog.biostrand.ai/nlp-nlu-nlg-what-is-the-difference)

BioStrand

08.02.2023

![](https://blog.biostrand.ai/hubfs/NLP%2c%20NLU%2c%20NLG%20-%20what%20is%20the%20difference%20copy-2.jpg)

###### Audio version

Play

NLP, NLU & NLG : What is the difference?

AI-generated audio

8:48

In 2022, ELIZA, an early [natural language processing](https://www.csail.mit.edu/news/eliza-wins-peabody-award) (NLP) system developed in 1966, won a Peabody Award for demonstrating that software could be used to create empathy. Over 50 years later, human language technologies have evolved significantly beyond the basic pattern-matching and substitution methodologies that powered ELIZA. As we enter the new age of ChatGP, generative AI, and large language models (LLMs), here’s a quick primer on the key components — NLP, NLU (natural language understanding), and NLG (natural language generation), of NLP systems.

## **What is NLP?**

NLP is an [interdisciplinary field](https://www.techopedia.com/definition/653/natural-language-processing-nlp) that combines multiple techniques from linguistics, computer science, AI, and statistics to enable machines to [understand, interpret, and generate](https://www.linkedin.com/pulse/how-machines-understand-human-language-) human language.

The earliest language models were [rule-based systems](https://levelup.gitconnected.com/the-brief-history-of-large-language-models-a-journey-from-eliza-to-gpt-4-and-google-bard-167c614af5af) that were extremely limited in scalability and adaptability. The field soon shifted towards data-driven statistical models that used probability estimates to predict the sequences of words. Though this approach was more powerful than its predecessor, it still had limitations in terms of scaling across large sequences and capturing long-range dependencies. The advent of recurrent neural networks (RNNs) helped address several of these limitations but it would take the emergence of transformer models in 2017 to bring NLP into the age of LLMs. The transformer model introduced a new architecture based on attention mechanisms. Unlike sequential models like RNNs, [transformers](https://communities.surf.nl/en/artificial-intelligence/article/from-eliza-to-chatgpt-the-stormy-development-of-language-models) are capable of processing all words in an input sentence in parallel. More importantly, the concept of attention allows them to model long-term dependencies even over long sequences. [Transformer-based LLMs](https://www.techopedia.com/definition/34948/large-language-model-llm) trained on huge volumes of data can autonomously predict the next contextually relevant token in a sentence with an exceptionally high degree of accuracy.

In recent years, domain-specific biomedical language models have helped augment and expand the capabilities and scope of ontology-driven bioNLP applications in biomedical research. These domain-specific models have [evolved](https://arxiv.org/pdf/2305.16326.pdf) from non-contextual models, such as [BioWordVec](https://www.nature.com/articles/s41597-019-0055-0), [BioSentVec](https://arxiv.org/pdf/1810.09302.pdf), etc., to masked language models, such as [BioBERT](https://academic.oup.com/bioinformatics/article/36/4/1234/5566506), [BioELECTRA](https://aclanthology.org/2021.bionlp-1.16.pdf), etc., and to generative language models, such as [BioGPT](https://academic.oup.com/bib/article/23/6/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) and [BioMedLM](https://www.mosaicml.com/blog/introducing-pubmed-gpt).

[Knowledge-enhanced](https://www.sciencedirect.com/science/article/abs/pii/S1532046423001132?via%3Dihub) biomedical language models have proven to be more effective at knowledge-intensive BioNLP tasks than generic LLMs. In 2020, researchers [created](https://www.microsoft.com/en-us/research/blog/domain-specific-language-model-pretraining-for-biomedical-natural-language-processing/?OCID=msr_blog_BioMed_tw) the Biomedical Language Understanding and Reasoning Benchmark (BLURB), a comprehensive benchmark and leaderboard to accelerate the development of biomedical NLP.

## **NLP = NLU + NLG + NLQ**

[NLP](https://www.biostrand.ai/natural-language-processing) is a field of artificial intelligence (AI) that focuses on the interaction between human language and machines. It employs a constantly expanding range of techniques, such as tokenization, lemmatization, syntactic parsing, semantic analysis, and machine translation, to extract meaning from unstructured natural languages and to facilitate more natural, bidirectional communication between humans and machines.

![How NLP, NLU and NLG are related](https://blog.biostrand.ai/hs-fs/hubfs/How%20NLP%2c%20NLU%20and%20NLG%20are%20related.png?width=1554&height=1182&name=How%20NLP%2c%20NLU%20and%20NLG%20are%20related.png)

SOURCE: [TechTarget](https://www.techtarget.com/searchenterpriseai/definition/natural-language-generation-NLG)

Modern NLP systems are powered by three distinct natural language technologies (NLT), NLP, NLU, and NLG. It takes a combination of all these technologies to convert unstructured data into actionable information that can drive insights, decisions, and actions. According to Gartner ’s Hype Cycle for NLTs, there has been increasing adoption of a fourth category called [natural language query](https://www.whiz.ai/whitepaper-and-report/where-gartner-positions-natural-language-query-on-the-hype-cycle-and-what-it-means-for-life-sciences) (NLQ). So, here’s a quick dive into NLU, NLG, and NLQ.

## **NLU**

While NLP converts unstructured language into structured machine-readable data, NLU helps bridge the [gap](https://www.expert.ai/blog/dont-mistake-nlu-for-nlp-heres-why/#:~:text=While%20both%20NLP%20and%20NLU,language%2C%20NLU%20provides%20language%20comprehension) between human language and machine comprehension by enabling machines to understand the meaning, context, sentiment, and intent behind the human language. NLU systems process human language across three broad [linguistic levels](https://www.xenonstack.com/blog/difference-between-nlp-nlu-nlg): a syntactical level to understand language based on grammar and syntax, a semantic level to extract meaning, and a pragmatic level to decipher context and intent.

These systems leverage several advanced techniques, including semantic analysis, named entity recognition, relation extraction and coreference resolution, to [assign structure, rules, and logic](https://www.expert.ai/blog/dont-mistake-nlu-for-nlp-heres-why/) to language to enable machines to get a human-level comprehension of natural languages.   The challenge is to evolve from [pipeline models](https://pubmed.ncbi.nlm.nih.gov/36125190/), where each task is performed separately, to blended models that can combine critical bioNLP tasks, such as [biomedical named entity recognition](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04994-3) (BioNER) and [biomedical relation extraction](https://www.sciencedirect.com/science/article/pii/S1476927122001888) (BioRE), into one unified framework.

## **NLG**

Where NLU focuses on transforming complex human languages into machine-understandable information, NLG, another subset of NLP, involves interpreting complex machine-readable data in natural human-like language. This typically involves a [six-stage process](https://www.techtarget.com/searchenterpriseai/definition/natural-language-generation-NLG) flow that includes content analysis, data interpretation, information structuring, sentence aggregation, grammatical structuring, and language presentation. NLG systems generate understandable and relevant narratives from large volumes of structured and unstructured machine data and present them as natural language outputs, thereby simplifying and accelerating the transfer of knowledge between machines and humans.

To explain the NLP-NLU-NLG synergies in extremely simple terms, NLP converts language into structured data, NLU provides the syntactic, semantic, grammatical, and contextual comprehension of that data and NLG generates natural language responses based on data.

## **NLQ**

The increasing sophistication of modern language technologies has renewed research interest in [natural language interfaces](https://arxiv.org/pdf/2212.13074.pdf) like NLQ that allow even non-technical users to search, interact, and extract insights from data using everyday language. Most NLQ systems feature both NLU and NLG modules. The NLU module extracts and classifies the [utterances](https://www.linkedin.com/pulse/conversational-bi-text-sql-debmalya-biswas), keywords, and phrases in the input query, in order to understand the intent behind the database search. NLG becomes [part of the solution](https://www.arria.com/blog/a-guide-to-natural-language-technologies/) when the results pertaining to the query are generated as written or spoken natural language.

NLQ tools are broadly [categorized](https://www.plutora.com/blog/natural-language-queries-explained#:~:text=Natural%20language%20queries%20(NLQ)%20is,they%20can%20make%20business%20decisions) as either search-based or guided NLQ.  The search-based approach uses a [free text search bar](https://www.yellowfinbi.com/campaign/natural-language-query-5-key-benefits-of-guided-approach-to-nlq) for typing queries which are then matched to information in different databases. A key limitation of this approach is that it requires users to have enough information about the data to frame the right questions. The [guided approach to NLQ](https://www.yellowfinbi.com/blog/what-is-guided-nlq) addresses this limitation by adding capabilities that proactively guide users to structure their data questions using [modeled questions](https://www.techtarget.com/searchbusinessanalytics/news/252525829/Yellowfin-enhances-NLQ-tool-in-analytics-platform-update), [autocomplete suggestions](https://www.techtarget.com/searchbusinessanalytics/feature/252509219/QuickSight-Q-a-potential-winner-for-Amazon-BI-platform), and other relevant filters and options.

## **Augmenting life sciences research with NLP**

At BioStrand, our mission is to enable an authentic systems biology approach to life sciences research, and natural language technologies play a central role in achieving that mission. Our LENSai Integrated Intelligence Platform leverages the power of our HYFT® framework to organize the entire biosphere as a multidimensional network of 660 million data objects. Our proprietary bioNLP framework then integrates unstructured data from text-based information sources to enrich the structured sequence data and metadata in the biosphere. The platform also leverages the latest development in LLMs to bridge the [gap](https://www.ipatherapeutics.com/blog/ai-technology/closing-the-gap-of-text-and-the-biosphere-with-lensai-nlp-link) between syntax (sequences) and semantics (functions). For instance, the use of [retrieval-augmented generation](https://huggingface.co/docs/transformers/model_doc/rag) (RAG) models enables the platform to scale beyond the typical limitations of LLM, such as [knowledge cutoff and hallucinations](https://medium.com/neo4j/knowledge-graphs-llms-fine-tuning-vs-retrieval-augmented-generation-30e875d63a35), and provide the up-to-date contextual reference required for biomedical NLP applications.

With the LENSai, researchers can now choose to launch their research by searching for a specific biological sequence. Or they may search in the scientific literature with a general exploratory hypothesis related to a particular biological domain, phenomenon, or function. In either case, our unique technological framework returns all connected sequence-structure-text information that is ready for further in-depth exploration and AI analysis. By combining the power of HYFT®, NLP, and LLMs, we have created a unique platform that facilitates the integrated analysis of all life sciences data.  Thanks to our unique retrieval-augmented multimodal approach, now we can overcome the limitations of LLMs such as hallucinations and limited knowledge.

Stay tuned for hearing more in [our next blog](https://blog.biostrand.ai/knowledge-graphs-and-black-box-llms).

Tags:


[ML](https://blog.biostrand.ai/tag/ml#categories),

[AI](https://blog.biostrand.ai/tag/ai#categories),

[NLP](https://blog.biostrand.ai/tag/nlp#categories)