Through RAG Comes Revolution: Efficient & Accurate AI

Shriya Rajesh is a junior studying Computer Engineering and Computer Science in the Viterbi School of Engineering at the University of Southern California. She is a researcher in the Next Generation Digital Systems Lab and enjoys working on methods of improving efficiency. Shriya has always loved programming and plans to become a Software Engineer.

Artificial intelligence (AI) is a developing field that incorporates Large Language Models to process substantial amounts of information. Large Language Models (LLMs) are trained in data, which is then used by AI to provide output based on a given input. However, this method has flaws that often lead to inaccurate responses. This is where retrieval augmented generation (RAG) comes in. RAG returns more accurate information by storing data, allowing us to efficiently gain information from large databases of documents.

Introduction

You have 20 documents containing information regarding your next project. The hours are ticking by as you pore over each document because you never know which fact will come in handy later. The temptation to use AI lingers, but you know that it may either convey incorrect information or leave out details. You simply cannot trust AI.

Currently, AI uses large language models (LLMs), which allow it to access a large database of information. LLMs find out what information is most relevant to an input question by comparing individual words in the question to words in the stored information. However, words can have different interpretations in different contexts. LLMs tend to assume two irrelevant bits of data correspond with each other, leading to incorrect outputs for questions. Luckily, there is a solution that allows LLMs to provide more accurate results: RAG. Retrieval augmented generation is a framework that can be applied to models that use LLMs. Adding RAG to large language models revolutionizes the performance of current AI, increasing efficiency in industries.

Large Language Models

Large language models analyze data across the internet to find connections between words in order to answer prompts. They work by finding patterns between similar words in the questions and the information in the AI’s database. For example, if the database had an essay about cats and another about dogs and a user asked what cats liked, then the LLM would look for the word cat and output any relevant information. Essentially, an LLM needs training with data to be able to answer questions. Otherwise, it does not know what counts as a pattern. Moreover, an LLM needs to learn how to categorize information and determine what words are important. Without training, it could assume that if a question and a document both had the word “and,” they are related. As humans, we know that conjunctions do not determine the topic. However, AI does not know this, and engineers need to train LLMs to understand how to determine context and efficiently link a query to a relevant document.

Within seconds, AI can find matches between hundreds of words. LLMs are becoming optimal because they search databases quickly and are even able to answer questions on information that they were not trained on [1]. This suggests that LLMs can respond to a wider range of queries because they do not necessarily need to be pre-trained in all types of information. However, these AI models are only able to provide a limited amount of information and will lack accurate responses if the data that LLMs are trained with does not contain the desired information [2]. AI will often incorrectly presume that the information discovered is useful to answer the prompt, leading to wrong or irrelevant answers. Without the RAG process, LLMs are inefficient because they need retraining every time there is new data. Luckily, applying the RAG process allows LLMs to access more data without needing this additional training to produce accurate outputs.

Fundamentals of RAG

The retrieval augmented generation framework does not require additional training because documents are analyzed as they are inputted into the AI model’s database. This process works because of the differences in RAG’s search method. While RAG relies on semantic search, which searches for information based on the meaning of the overall document, current LLMs prefer keyword search, which compares individual words when searching for information. For example, if we had a document about plants that mentioned sunlight often to describe photosynthesis and another document that described how light waves work, then a user might ask the AI what sunlight is. An AI that only uses LLMs would describe the process of photosynthesis, instead of describing sunlight itself, because it matched the word sunlight. An AI that applies the RAG framework would realize that light waves are related to sunlight –even if the word sunlight was rarely mentioned in that document– because the AI would know that the document and question are both asking about how light works rather than for an application of sunlight.

With semantic search, RAG can determine which documents are most relevant to a question. It will store documents in a numerical vector that is determined from the meaning or context of the overall document as opposed to individual words [3]. RAG transforms each set of data (a document, website, or other form of information) into a vector, which is a set of numbers that becomes a point on a multidimensional graph based on the data’s content. For instance, a document mentioning light and heat would get a vector where each number represents an important word, such as light = 123 and heat = 324, to form the vector (123, 324). Each word in the model has an ID associated with it through an algorithm so that similar documents end up next to each other on a graphical representation of the database, as shown in Figure 1. RAG will analyze a document that is placed into the database and then assign a set of coordinates based on an algorithm that depends on the document’s content. By analyzing words throughout the document to plot the data, RAG ends up categorizing based on overall content instead of individual words.

Whenever a user has a query, RAG converts the query into a vector to find documents near it. As shown in Figure 1, the query is plotted next to documents that have similar content to it. RAG retrieves the information that is most relevant to their question, generating a response instantaneously since documents require no retraining and only need to be added into a database. Due to the limitations of current AI technologies, RAG is often a necessary addition to enhance LLMs and provide responses with relevant, accurate, and updated information.

Figure 1. The RAG Process

Enhancing LLMs Through RAG

RAG is necessary because, without it, outliers in the data that the LLM is training with could lead to inaccurate results, otherwise known as hallucinations. AI hallucination is when the LLM makes up details that do not exist in the given information. For example, if one were to ask an AI the number of times the Golden Gate Bridge moved to Egypt, it would output 2, which is not true [4]. When the AI does not know the answer to a question, it will output an answer from irrelevant information. The number 2 is from other information that does not actually answer the user’s question. Because AI is trained to discover patterns in the input data, any outliers caused by hallucinated data would lead the large language model to incorrectly determine the parameters of a pattern.

While hallucinated data does not always lead to inaccurate results, it cannot be relied upon because its output is not always based on facts. If the goal is not to discover a correct answer but instead to generate creative results, then hallucinations could form art [5]. Regardless, even if the goal of the LLM is to create art, hallucinating AI cannot be relied upon to produce a desired result. When an AI is hallucinating, we cannot control its output because it does not use accurate information to determine a result. To combat hallucinations, the data in the LLM needs fine-tuning, which means filtering out information that is irrelevant to the desired topic. This is where RAG is useful: the amount of irrelevant information the AI outputs by ranking results based on relevancy before outputting to the user.

With RAG, technologies can continuously update their information, allowing access to accurate answers without excessive research. Based on a study that consisted of university students, LLMs that implement RAG structure have higher utility rates than the search engine [6]. When allowing continuous updates, RAG ensures access to more information and does not need to make up data. Because it will have the facts to answer a user’s question, it will not hallucinate. A consistently updated database would allow any researcher to be able to effectively find answers. While LLMs on their own may not lead to beneficial artificial intelligence, combining the RAG architecture to the AI allows for increasingly advantageous results.

Fine-Tuned Models

RAG is not the only method of developing LLM models for AI. Instead, LLM models could become fine-tuned as Fine-Tuned (FN) models. Essentially, an FN model is an LLM model that learns patterns from data that are specific to a certain task. It allows AI to answer more accurately, but only for the subject it has received training on. For example, if you wanted to train an FN model to learn about cooking, then the AI would answer any questions with cooking in mind. The model’s data would be related to cooking, meaning that it would be unable to correctly answer questions about other topics. While RAG models tend to be more efficient and tend to have better performance by over 15%, the FN models have an increased creativity measure that is approximately 8% better than the RAG model [7]. RAG performs better overall because, as shown in Figure 2, while FN models are more accurate in its specialized topic, RAG is more flexible at answering a variety of questions. FN models will hallucinate when asked questions they do not have training data for, which makes it better for creativity purposes. In other words, RAG is more efficient and accurate than FN models, and FN models are more likely to produce a unique response.

Figure 2. RAG vs. Fine-tuning: Features of using each framework [8].

One would think that AI could use both FN models and the RAG process to get the best of both worlds. However, this combination tends to have decreased performance speed because it increases the complexity of the model and tunes the data towards a specific subject instead of general usage. For broader use of this combination, we would need to make separate models for each subject so that each model could correctly answer questions on its own subject. However, combining FN models and RAG can increase accuracy when implemented towards a specific subject. For finance, fine-tuning was applied to the RAG framework for a database that consisted of financial data, news articles, and historical market trends – resulting in a 22% increase in accuracy compared to only using LLMs [8]. While the combination of the two helps improve precision when attempting to gain information on a singular topic, speed is also important for most AI models, as applications often require quick decision making, suggesting RAG as the better enrichment for LLMs.

Healthcare

RAG increases reliability, which is necessary for real-life applications that involve precarious decisions. In the healthcare system, RAG architecture can diagnose patients and provide healthcare advice based on their data if it maintains patient confidentiality [9]. Instead of spending time searching through each patient’s information and then consulting resources to determine the best course of action, doctors can employ the RAG framework by simply entering their patients’ data and asking the AI for its recommendations. For instance, RAG can take in a client’s nutritional information and then determine whether the patient is at risk for malnutrition. Using generative AI models to summarize information regarding nutritional status had a 93% accuracy and incorporating the RAG framework increased this to 99% [10].While using RAG is not flawless, it could significantly cut down the amount of time required to treat any patients who are not facing serious issues.

Reducing the amount of time doctors need to spend on each patient is critical amid the massive demand for doctors, which has contributed to mental health issues. Over 50% of medical professionals feel burned out, 20% plan to retire within a year, and approximately 300 physicians die by suicide each year [11]. Doctors have a stressful workload because they put in extra hours and are still unable to help everyone. With RAG, AI becomes accurate enough to complete analytic tasks so that doctors have more time to save lives. With the ability to make precise decisions, AI can be applied to sensitive tasks.

Critical Applications

With the increased accuracy that RAG provides, AI could also be useful for defense. For example, AI models with RAG can be used for cybersecurity tasks such as “intelligence analysis, surveillance, and autonomous decision making” [9]. With RAG, AI can rapidly analyze large datasets on previous attacks to determine how to defend against the security threat. However, allowing AI to access confidential information leads to ethical and legal concerns. AI-generated responses lead to challenges with research integrity and plagiarism [12]. It would be counterproductive to use an AI model to protect sensitive data if the model then plagiarizes that information. However, by encrypting the data before allowing an AI to access it, the model would be unable to use the sensitive information and can ensure its security. After taking security measures, the RAG process can be used for any scenario that requires data analysis.

RAG can help determine where wildfires are most likely to hit by using data from previous years. We could predict where wildfires may occur, target those specific areas, and protect thousands of people who are devastated by the rampaging fires. Existing wildfire prediction models only determine which direction a fire will spread, which does not leave enough time to prevent devastation [13]. Models that attempt to predict where a fire will occur, before the fire has started, are not yet accurate enough to be relied upon. However, as shown in Figure 3, AI models with RAG would be able to analyze information about wind patterns, foliage, and historical fires to output potential fire targets. While it would take immense research to determine all the factors that would need to go into the model, once we have this information, RAG will be able to accurately predict and prevent fires. As AI technologies continue to improve, we can further optimize the RAG process.

Figure 3. Using RAG to Determine Fire Patterns

The Future of RAG

AI models with RAG can be trained to focus on specific parts of a question to refine its answer towards the user’s topic. While RAG currently uses an input question to generate a response, an extension called RAG-end2end can refine the user’s query and train the response generator on a particular topic for a more specific answer [14]. AI with RAG is not yet monumental, but the RAG-end2end extension can increase RAG’s benefits by changing the prompt to become a more precise question. By conditioning RAG to generate responses with data that focuses on a specific topic in the question, the risk of hallucination reduces. Rag-end2end focuses on both the query and response generation to optimize the results of the standard RAG model.

As AI advances, a new concept called Interactive AI (IAI) is being introduced. IAI is better at interacting with humans because it also allows for voice recognition. IAI builds off the RAG framework to further optimize responses with more effective user communication and a smoother experience [15]. Therefore, this tool will be better at interpreting human inputs – whether through a message or a voice command – and will have a better understanding of the context for which the user is asking a question. By understanding the situation, IAI will provide more relevant responses, and RAG in IAI can increase current AI efficiency for higher performance results. Current developers have already started using RAG for basic tasks in crucial industries such as healthcare and defense. As accuracy increases, the RAG framework will allow AI to handle more precise tasks, such as surgeries. With more innovation, the RAG framework will become revolutionary.

For More Information:

Additional Multimedia:

RAG and Fine Tuning: https://www.youtube.com/watch?v=00Q0G84kq3M
LLMs: https://www.youtube.com/watch?v=osKyvYJ3PRM
LLM Examples: https://www.youtube.com/watch?v=lXIedWJRqd4

References

[1] S. Ornes. “How Quickly Do Large Language Models Learn Unexpected Skills?.” Quantamagazine. Accessed: Jan. 23, 2025. [Online]. Available: https://www.quantamagazine.org/how-quickly-do-large-language-models-learn-unexpected-skills-20240213/

[2] “What is retrieval-augmented generation (RAG)?.” McKinsey & Company. Accessed: Jan. 23, 2025. [Online]. Available: https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-retrieval-augmented-generation-rag

[3] J. Pambou. “A Simple Guide To Retrieval Augmented Generation Language Models.” Smashing Magazine. Accessed: Jan. 23, 2025. [Online]. Available: https://www.smashingmagazine.com/2024/01/guide-retrieval-augmented-generation-language-models/

[4] S. Bordoloi. “The hilarious & horrifying hallucinations of AI.” AI Analytics. Accessed: Feb. 8, 2025 [Online]. Available: https://www.sify.com/ai-analytics/the-hilarious-and-horrifying-hallucinations-of-ai/

[5] R. Pradhan, “Addressing AI hallucinations with retrieval-augmented generation,” InfoWorld.com, 2023. Available: https://www.proquest.com/docview/2880265392

[6] L. Pasquarelli, C. Koutcheme, and A. Hellas, “Comparing the Utility, Preference, and Performance of Course Material Search Functionality and Retrieval-Augmented Generation Large Language Model (RAG-LLM) AI Chatbots in Information-Seeking Tasks,” arXiv.org, 2024. Available: https://arxiv.org/abs/2410.13326

[7] R. Lakatos, P. Pollner, A. Hajdu, and T. Joo, “Investigating the performance of Retrieval-Augmented Generation and fine-tuning for the development of AI-driven knowledge-based systems,” arXiv.org, 2024. Available: https://www.proquest.com/docview/2962923564

[8] Prajna, “Hybrid approaches: Combining rag and finetuning for optimal LLM Performance,” Medium, 05-Aug-2024. Accessed: May 8, 2025 [Online]. Available: https://prajnaaiwisdom.medium.com/hybrid-approaches-combining-rag-and-finetuning-for-optimal-llm-performance-35d2bf3582a9.

[9] R. Shan, “Certifying Generative AI: Retrieval-Augmented Generation Chatbots in High-Stakes Environments,” Computer (Long Beach, Calif.), vol. 57, no. 9, pp. 35–44, 2024, doi: 10.1109/MC.2024.3401085. Available: https://ieeexplore.ieee.org/document/10660589

[10] M. Alkhalaf, P. Yu, M. Yin, and C. Deng, “Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records,” Journal of biomedical informatics, vol. 156, pp. 104662-, 2024, doi: 10.1016/j.jbi.2024.104662. Available: https://pubmed.ncbi.nlm.nih.gov/38880236/

[11] L. Sausser, “Burnout threatens Primary Care Workforce and doctors’ mental health,” CBS News, 07-Jun-2023. Accessed: May 8, 2025 [Online]. Available: https://www.cbsnews.com/news/doctor-burnout-primary-care-medical-workforce-mental-health/

[12] S. H. Park. “Use of Generative Artificial Intelligence, Including Large Language Models Such as ChatGPT, in Scientific Publications: Policies of KJR and Prominent Authorities.” National Library of Medicine. Accessed: Jan. 23, 2025. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10400373/

[13] S. Brodsky, “California fires drive race for AI detection tools,” IBM, 20-Jan-2025. Accessed: May 8, 2025 [Online]. Available: https://www.ibm.com/think/news/ai-fire-prediction

[14] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana, and S. Nanayakkara. “Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering.” MIT Press Direct. Accessed: Jan. 23, 2025. [Online]. Available: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00530/114590/Improving-the-Domain-Adaptation-of-Retrieval

[15] R. Zhang et al., “Interactive AI With Retrieval-Augmented Generation for Next Generation Networking,” IEEE network, vol. 38, no. 6, pp. 414–424, 2024, doi: 10.1109/MNET.2024.3401159. Available: https://ieeexplore.ieee.org/document/10531073

Similar Posts

Leave a Reply Cancel reply