Let's Build An AI Agent: trendrBOT answers natural language questions based on Google Search trends
A lesson in why AI agents are so exciting but also so difficult to scale
Background on AI Agents
As the popularity of LLMs took off with the launch of ChatGPT, the idea of Agents as the next big thing already started to form. The concept is to move beyond a simple prompt-response workflow that relies solely on the model’s own internal logic. Programmatic libraries such as LangChain and productized services like OpenAI GPTs have cropped up to address these needs at varying levels of customization and complexity. Even the core LLM powered services themselves such as ChatGPT and Google Gemini have integrated more advanced agent like workflows natively for popular use cases such as generating images and querying the web.
Gaudy demos showcasing more advanced agent like capabilities have continued to surface to sell the idea of completing tasks such as designing user interfaces, booking vacations, writing complex code and more. But the reality is that most advanced agents have not fully materialized just yet, and even fairly basic agents have not been adopted as quickly as expected. Why is that?
To understand why agents have been slower to evolve than many predicted, let’s build a fairly basic agent from scratch. We will end by reviewing what lessons we can learn and extrapolate what this means for the future of building complex AI agents.
TrendrBOT
The purpose of our agent is to allow users to ask questions in natural language about trending topics online. We want to be able to ask basic questions such as “What are the top 10 trends in the US?” but also more complex, nuanced questions such as “Are there noticeable differences in trends between European and Asian countries?”.
You can skip directly to the trendrBOT Live Demo or deploy the code directly from the TrendrBOT Github repo. Here’s a view of the Web UI:
For those interested, let’s do a deeper dive on how trendrBOT works.
TrendrBOT Deep Dive
To make trendrBOT work we have to figure out how to best integrate a large amount of structured data (our trends data) with an LLM. Because we are dealing with structured data, a vanilla RAG (Retrieval Augmented Generation) archicture would be less than ideal. Another option is to leverage text2SQL, or in other words prompting an LLM to generate a SQL query. This can work under idealized circumstances, but my experience is that it is still very brittle and error prone in most use cases today. A third option is to leverage the increasingly available LLMs with large context windows to pass our structured data directly to the LLM. I decided to use this third option, with an additional step to intelligently select and query the structured data because of still remaining context window limiations.
Here is a logical view of the application:
An explanation:
Step 1: Based on user input, use an LLM to check for a relevant data source
Step 2: Query the matched data source. If relevant data source does not exist, tell the user and bail out.
Step 3: Take the data output results and provide it for context to ask the LLM to answer the question
Step 4: Return the results to the user
For the LLM we will use Gemini 1.5 Pro. This model is a good choice for this use case because we are dealing with large amounts of data, so the large 1M token context window will be critical.
As for our trends data source we will use the Google Trends BigQuery dataset, which provides top 25 overall and top 25 rising terms both in the United States and internationally. This is part of the 200+ free to use publicly available datasets hosted by Google Cloud.
For the UI we will use Streamlit, which can be deployed locally or in the cloud.
The Data Sources
The public Google Trends datasets consist of four BigQuery tables:
international_top_rising_terms: Fast rising terms for non US markets
international_top_terms: Top terms for non US markets
top_rising_terms: Fast rising terms for the US market
top_terms: Top terms for the US market
I decided to stick with the US and international top terms tables to start, and ignored the rising terms for this initial version.
Step 1: Initial Prompt Design
Let’s refer back to our already defined first step:
Step 1: Based on user input, use an LLM to check for a relevant data source
Here is a first attempt to prompt for the correct data source:
Based on the following question, determine which data source is best suited to provide an answer.
Data Sources:
International Google Trends
US Google Trends
Question: [User Question]
Answer:
When we run this against the question “What's trending in the US?” we get this response:
Data Source for US Trends
For the question "What's trending in the US?", the most suitable data source would be:
US Google Trends
Reason:
Specificity: US Google Trends focuses specifically on search trends within the United States, providing insights into the topics and queries that are currently popular among American users.
Relevance: It directly addresses the geographical scope of the question, offering data that is relevant to the US audience.
Granularity: US Google Trends allows for analysis at the national, state, and even city level, providing a more detailed understanding of regional trends within the US.
While International Google Trends offers a broader perspective on global search interests, it wouldn't provide the specific insights needed to answer the question about US trends.
This answer is correct, but all we want is the data source name. Let’s add the sentence “Respond with just the data source” to our prompt and try again:
US Google Trends
This example looks good but I noticed that sometimes the model would respond with something like the following:
Answer: US Google Trends
This is acceptable for a basic chatbot, but since we need to use these results downstream in the next step of our agent architecture, we either need to have strict output enforcement or create custom parsing code. There was also a few cases where the wrong dataset was suggested, even though it was “fairly obvious” from my point of view. Luckily, we have a technique called few shot prompting that can help.
Few Shot Prompting
Few shot prompting is just a fancy way of saying that we will provide our LLMs with examples so that it better understands our expected output format. Choosing proper examples that demonstrate the structure of the output but that do not overfit can be difficult. In general I would recommend investing in or building a model evaluation framework and few shot prompt building process to faciliate this optimization.
Let’s skip those fine details for now and jump right to the 3-shot prompt that was built:
Based on the following question, determine which data source is best suited to provide an answer.
Data Sources:
International Google Trends
US Google Trends
Question: What was trending in the US?
Answer: US Google Trends
Based on the following question, determine which data source is best suited to provide an answer.
Data Sources:
International Google Trends
US Google Trends
Question: What has been trending outside of the US?
Answer: International Google Trends
Based on the following question, determine which data source is best suited to provide an answer.
Data Sources:
International Google Trends
US Google Trends
Question: What is popular in Africa?
Answer: International Google Trends
Based on the following question, determine which data source is best suited to provide an answer.
Data Sources:
International Google Trends
US Google Trends
Question: [User Question]
Answer:
This prompt seemed to generate results consistently. One thing that was added for the final version and not shown here is a list of countries to target. So now, even this relatively simple prompt is starting to get complex as requirements change. You can imagine what would be needed if we had, for example, 100 data sources. In that scenario, we might have to create dozens or hundreds of examples to guide the LLM properly, and at that point model tuning starts to make sense.
Step 2: Query the Database
Here is our previously defined second step:
Step 2: Query the matched data source. If relevant data source does not exist, tell the user and bail out.
This is where we choose between two data sources as output by the LLM in the previous step and execute a SQL query. I wrote a pre-defined SQL query for each data source that return the most recent and highest ranking results truncated to 50,000 rows. One feature added later is the country designation i.e. if the user input specifies an individual country we then modify the query to add a condition in the WHERE clause. Here is the query for the international trends dataset:
WITH abc AS (
SELECT term, rank, week
FROM `bigquery-public-data.google_trends.international_top_terms`
GROUP BY term, rank, week
ORDER BY week DESC, rank ASC
)
SELECT DISTINCT term, rank, week
FROM abc
WHERE country_name = [COUNTRY]
ORDER BY week DESC, rank ASC
LIMIT 50000
These results are then stored as a CSV, with a header row pre-pended.
I left out more advanced querying options for this version because of the nuance of the way the data is represented. As one example, I would love to filter results based on US states, but the US dataset actually stores data in DMAs (see Designated Market Areas). These DMAs do not have strict state boundaries, so matching regional results to user expectatios would be difficult. This type of nuance is typical in datasets which is what makes text2SQL very difficult to accomplish as a generalized use case.
Step 3: Prompt for an answer
Here is our previously defined third step:
Step 3: Take the data output results and provide it for context to ask the LLM to answer the question
Here is our prompt:
Answer the following question based on the provided trending topics. When in doubt, answer with the most recent data
Question:
[ORIGINAL USER QUESTION]
Google Search trends data:
[DATA GENERATED FROM STEP 2]
There is probably room for improvement here. We could better describe what type of output we are expecting or to provided few shot examples. My experience is that few shot prompting for this type of output risks overfitting so I decided to leave it as is for this version.
Step 4: Return the results
The final results look pretty good overall for many questions that I tried out. Here is one basic example:
Question: What are the top 10 trends in the US for the latest available data? Simply list them in bullet points.
Answer:
O.J. Simpson
Real Madrid vs Man City
Masters 2024
Total eclipse 2024
UFC 300
Purdue basketball
Powerball
PSG vs Barcelona
Champions League
What time is the eclipse
And here is a more complex example:
Question: Are there noticeable differences in trends between European and Asian countries?
Answer:
Trend Differences between Europe and Asia:
Based on the provided Google Search trends data from the week of April 7, 2024, we can observe some noticeable differences in trending topics between European and Asian countries:
Europe:
Dominance of Football: A large portion of the top trends relates to European football leagues and matches, like the Champions League ("Ligue des champions", "Liga Mistrzów", etc.), and specific games like "Man United vs Liverpool". This indicates a strong interest in football within the European region.
Solar Eclipse Interest: Several European countries show interest in the upcoming solar eclipse on April 8th, 2024, with searches like "Eclipsa de Soare 2024 romania" and "Sonnenfinsternis 2024".
Regional Events: Some trends highlight regional events or personalities, such as "Wybory samorządowe 2024" (Polish local elections) and "אבשלום פלד" (an Israeli personality).
Asia:
Variety of topics: Compared to Europe, the top trends in Asia exhibit a wider variety of interests. While football ("Man utd Liverpool", "Arsenal đấu với Bayern") and the solar eclipse ("Gerhana matahari 8 April 2024") are present, there's also a mix of entertainment, cultural events and personalities.
Entertainment and Culture: Searches like "桜花賞" (a Japanese horse race), "KKR vs CSK" (an Indian Premier League cricket match), and "버거킹 와퍼 판매 종료 이유" (reasons for Burger King Whopper discontinuation in Korea) point towards diverse entertainment and cultural interests.
Religious Observances: The presence of "Idul Fitri 2024" and related searches like "Niat sholat Idul Fitri" in several Asian countries indicates the significance of this religious holiday in the region.
Conclusion:
While there's some overlap in trending topics, particularly with major events like the solar eclipse and popular sports, European trends lean more heavily towards football, while Asian trends showcase a wider range of interests encompassing entertainment, culture, and religion.
Limitations
Do not expect perfect results for all types questions. Some limitations include:
The Google Trends Public Dataset is sporadically updated
Even the full 1M token context window provided by Gemini 1.5 Pro is not large enough to contain the entirety of the Google Trends public dataset, so query results are usually truncated when submitting for analysis
With limited data points as provided, the LLM will often speculate rightly (or wrongly) on the categorization or context of trends
Just based on the inherent architecture, the LLM's ability to perform precise calculations is limited
Execution can take up to 1 minute. A simple optimization would be to cache the BigQuery results, but the two Gemini 1.5 Pro requests, the second of which has a large context, will take some time regardless.
Final thoughts: The road to more complex agents
Whew! That was a lot of work. Our trendrBOT agent, while imperfect, is complete for now.
I hope this breakdown of building a relatively simple agent demonstrates why it is so difficult to scale these applications as scope and complexity increases. It comes down to the combination of these factors:
Large models are autoregressive which means they are non deterministic and are thus hard to predict and control
Our current tools that can help mitigate the difficulty working with large models lack maturity. I purposely avoided using agent frameworks such as Langchain, but my opinion is that, in their current state, these tools can actually hinder development in a number of ways. Those details will have to be covered in a future article.
As each additional step and feature is added to an agent application, defining, designing and validating each LLM dependent task becomes increasingly difficult. A single point of failure or unexpected behavior can cause a chain reaction, taking down a carefully crafted logical flow.