Better LLM Agent Quality Through Code Generation and RAG

Sharing our insights and learnings from the journey of building LLM agents.


The rise of large language models (LLMs) is transforming the way companies enable natural, conversational access to their platforms. By integrating LLMs with various tools such as calculators, search engines or databases, we can automate a wide range of tasks. These agents can select the most appropriate services or APIs for the given problem and use it with the correct settings and parameters.

We have built our own LLM-powered agents system, PromptAI, to bridge users directly to our operational data, making data insights more accessible. It allows users to interactively explore vast amounts of time-series data through a chat interface, suggesting actionable insights drawn from Conviva’s extensive knowledge base.

 

We are excited to share our insights and learnings from the journey of building LLM agents, with the hope that they could benefit others building LLM agents.

The key takeaways are:

  • Code as an interface between API and LLM, not JSON: Though most APIs require JSON or another serialized format as an interface, we found that prompting the LLM to write Python code leads to higher accuracy. This comes from the fluency of an LLM in writing code compared to JSON and because code is better suited for expressing complex reasoning for intricate parameter values.
  • Better accuracy of API calls with retrieval-augmented generation (RAG): When API parameters are of high cardinality, using RAG for parameter selection can reduce the space of values the LLM has to consider. This approach reduces the input context length, resulting in higher accuracy and efficiency.

LLM Writes Code to Use Tools

In general, most APIs require the input to be in a machine-readable format, such as JSON. For example, to answer questions like, “How many users were active for the last three days?”, the LLM should generate the JSON payload as shown in Table 1. Based on the given question, the LLM identifies the proper value for each parameter following the specification for each API interface. However, directly generating this serialized format presents unique challenges.

The first challenge is that LLMs are less accurate in generating structured output, such as JSON. If the schema of the payload is complex, LLMs frequently hallucinate keys, use incorrect value types or add an unnecessary trailing comma, resulting in syntax errors.

Another challenge is that finding the proper value of the parameter requires multiple steps of reasoning. For example, to compute the correct value for the “from” and “to” parameters, the LLM must know the current date-time value, then subtract three days from the current date to obtain the correct value for “from”. This is like trying to determine the date three days ago without having a calendar on hand. It involves tracking days, months and even accounting for different lengths of months. However, when generating the final JSON output, the LLM must complete this calculation in a single step, which is challenging and less accurate.

These challenges raise an important question: What is the most appropriate representation or interface between the LLM and APIs?

In PromptAI, we use code as an interface between the LLM and APIs, capitalizing on the LLM’s familiarity with writing code. Unlike serialized data formats like JSON, code enables the LLM to express more complex reasoning and perform intricate tasks.

Figure 1: Example input prompt and output using Python code as an interface

In this Python-based approach, the LLM can fully express the reasoning and calculation to determine the correct parameter values. In JSON format, the LLM must provide the final answer directly. However, in code, it can document intermediate results. As demonstrated in chain-of-thought prompting, writing down intermediate steps can improve the overall accuracy of the LLM’s responses. By executing or parsing the code generated by the LLM, we can obtain the correct parameter values to call the APIs.

Another benefit of using code as an interface is that LLMs are more familiar with writing code than directly outputting structured data in formats like JSON, where syntax errors, such as trailing commas, can easily occur. Generating the desired JSON structure sometimes requires additional fine-tuning. However, LLMs are already trained on large corpora of code and can write code effectively. This makes code a lower-overhead, high-accuracy choice for API integration, especially when using specialized coding LLMs like CodeLlama.

Using RAG to Define Valid Parameter Values

Code serves as a powerful interface between the LLM and data retrieval tools, and one key element of this interaction is ensuring that the LLM selects valid input values for specific tasks. In Python, the Enum module is particularly useful for defining a finite set of valid values that a variable can take. This helps us constrain the space of valid inputs made up of metrics and dimensions.

In Conviva, Python’s Enum is used to constrain the LLM’s input choices to valid metrics and dimensions. However, this Enum setup can become unwieldy, as the high number of potential values may exceed token limits and introduce irrelevant options, leading to inefficiency. Many possible values are irrelevant to a given query, and these irrelevant values consume unnecessary tokens. To optimize this, we can intelligently reduce the values provided to the LLM, focusing only on what is relevant for the context, allowing us to save tokens while still covering all valid possibilities.

Retrieval-augmented generation (RAG) provides an efficient way to extend the LLM’s capabilities by fetching relevant information from external sources. In Conviva’s case, we use RAG to dynamically populate the values of an Enum, further optimizing the interaction between the LLM and our extensive data. For example, if a user asks for active users specific to a certain device type, RAG can retrieve only the relevant device options from the vector database, ensuring that the LLM is working with accurate, context-specific data without overwhelming it with irrelevant values.

To enable RAG, we use a vector database, which stores the embedding of both the Enum value as well as any relevant metadata, such as the value’s description. A vector database allows us to conduct highly customizable vector searches based on both vector similarity and traditional semantic search. By retrieving candidate values from the vector database, we ensure that the LLM works with a refined, relevant subset of possible values instead of being overwhelmed with options.

In terms of cognitive processing, we can think of this as akin to the Type 1 and Type 2 thinking paradigm. RAG acts as a rapid, Type 1 thinker, retrieving quick but relevant results, while the LLM acts as the slower, more thorough Type 2 thinker, analyzing the candidates to select the correct value. This two-step approach improves accuracy. In fact, we observed instances where the LLM alone guessed incorrectly, but with the RAG-enhanced system narrowing the set of choices, the correct value was identified.

Collaboration of Multiple Agents

Another challenge of building agents is that the LLM must be able to interpret the results from these tools and provide meaningful interpretations to users. This requires additional context and knowledge for the LLM. We combined API integration and RAG for a knowledge base to enhance the quality of the response. As mentioned earlier, we use CodeLlama, which specializes in coding, to improve the accuracy of API integration. However, for generating comprehensive final responses, a more generalized LLM like Llama provides superior descriptions due to its broader understanding. By combining the strengths of different models, we can maximize the accuracy of API calling and the quality of the response.

Conclusion

LLM agents hold tremendous potential for automating a wide range of human tasks by leveraging existing APIs or services. When integrating LLMs with APIs, framing the problem in a well-known programming language is often more effective than in JSON or other data formats. The integration of RAG further strengthens the system by dynamically providing relevant context, allowing for precise parameter selection for APIs. By combining these techniques, we have improved the accuracy of our LLM agents, PromptAI, enabling the interaction with data in a conversational manner. We hope our experiences and insights can serve as a valuable guide for others working to build LLM agents.

Originally published on The New Stack