Retrieval Augmented Generation (RAG) On Google Cloud

A comprehensive breakdown of available services and recommended architectures by use case

May 28, 2024

This article was written by me and cross posted on the Zencore Engineering blog. Zencore is a cloud consulting and engineering firm led by a team of former Google engineers and executives.

Overview

Google Cloud offers many different options to build Retrieval Augmented Generation (RAG) powered applications. This includes discrete components that comprise RAG solutions (embeddings, vector search, LLMs) but also includes options that combine multiple steps or even the entire RAG application in a single service. The best option for you will depend on factors such as your use case, engineering expertise, existing tech stack and future needs.

We will start with a set of use cases and design a solution architecture using the most appropriate options. After that we will go through a detailed breakdown of the full list of services with the pros, cons and recommendations for when to use each.

Disclaimer: These services are all under active development so expect that these details may change over time.

Use Cases

Let’s consider a few different use cases and what an appropriate architecture might look like for each. This is not a comprehensive list of use cases but rather illustrative examples.

Use Case #1 | No-Code Educational Chatbot

Description: We would like to have a chatbot added on our website homepage to answer questions based on a series of educational articles hosted on our website. We do not have a large front end or backend engineering team.

Recommendation: Vertex AI Agent Builder offers a no code solution to ingest web pages into a data store, build a chatbot connected to that data store and host that chatbot on your website.

Architecture:

Steps:

Create a Vertex AI Search data store to crawl and embed the web pages automatically
Create a Vertex AI Agent Builder app and connect to the data store
Use the Agent website Javascript embed feature to add to an existing web page

Use Case #2 | Enhance a Custom Built Chatbot

Description: We have an existing web based custom built chatbot powered by Gemini Pro 1.0. We would like to add the ability for users to select from a list of data corpus and then ask questions about the documents in that corpus. What’s the easiest way to enable this?

Recommendation: Vertex AI Grounding is the easiest way to add document retrieval to an existing Gemini based chatbot.

Architecture:

Steps:

Upload each collection of documents into separate folders or buckets in Cloud Storage
Ingest each set of documents as a separate Vertex AI Search data stores
Add your list of data stores as a selectable dropdown menu to your chatbot UI
When the user makes a chatbot request with associated data store, set grounding on that data store and process the request

Use Case #3 | Literature Study Assistant Chatbot

Description: Our literature website hosts free classical literature that is in the public domain (e.g. the complete works of William Shakespeare). We want to build a study assistant where users can ask detailed questions about characters, events and interpretation about individual books. An example of a question could be “What was different about character A in chapter 1 compared to chapter 10?”

Recommendation: Use the extended 1M+ token context window of Gemini Pro 1.5 to pass entire books in the LLMs context window in order to generate superior results compared to a typical embedding model.

Architecture:

Steps:

Upload each Book into Cloud Storage
Add your list of books as a selectable dropdown menu to your chatbot UI
When the user makes a chatbot request with associated book, pass the entire book as context to help answer each question

Use Case #4 | AI Art Website Advanced Image Search

Description: Our AI art website has nearly ten million images and receives over one million unique visitors per day. We have a feature to filter images by image attributes such as category, size, tags and more. The images are stored on Cloud Storage and the attributes in a PostgreSQL database. We would like to add a feature to allow users to upload an image to find similar images and to filter by image attributes.

Recommendation: Use a combination of Vertex AI Multimodal Embeddings to embed your images, store those embeddings in pgvector in AlloyDB (10x vector search performance over standard PostgreSQL) and query AlloyDB for image similarity and image filters.

Architecture:

Steps:

Migrate your PostgreSQL table that stores your image attributes to AlloyDB, a PostgreSQL compatible database
Add a vector store column (type pgvector) to your AlloyDB table
Use the Vertex AI Multimodal Embeddings API to embed all of your images and store in the pgvector column in AlloyDB
When a website image similarity search request is made, embed the image, perform a vector similarity search and filter by image attributes, all as a single query within AlloyDB. Then return the top similar images.

Service Details

Embeddings

Vertex AI Text Embeddings

About:

Compatible with text input only
Supervised parameter efficient tuning is available

Pricing:

Input: Priced per input characters, with a discount for batch requests
Output: No charge

Quotas:

1,500 requests/minute

Pros:

Simple, easy to use API

Cons:

2,000 token input limit is relatively small
Generates single 768-dimensional vector per embedding with no custom sizing. This may be limiting.
Open source alternatives are abundant and generally inexpensive to host

Recommended for:

If you do not want to manage open source models and have limited customization needs

Vertex AI Multimodal Embeddings

About:

Compatible with text, image and video input

Pricing:

Input: Priced per input characters, image and per video second (with varying fps options)
Output: No charge

Quotas:

120 requests/minute

Pros:

Native video embedding is very convenient and the varying video embed options can help to control costs
Generates default 1408-dimensional vector per embedding. You can also specify lower-dimension embeddings (128, 256, or 512 float vectors) for text and image data.

Cons:

No tuning options
32 token text input limit is very restrictive

Recommended for:

Image (and especially video) if similarity search quality is important

Vector Search

Vertex AI Vector Search

About:

Dedicated vector similarity search database based on Google Research’s ANN / ScaNN techniques
Size your deployment by selecting from various VM machine and cluster sizes
Hybrid Search (dense+sparse) announced at Google Cloud Next ’24 and is being prepared for public preview

Pricing:

Pricing is determined by the size of your data, the amount of queries per second (QPS) you want to run, and the number of nodes you use

Quotas:

Request the necessary quotas per VM type and region before deploying

Pros:

Best in class vector search for speed and scalability. Scale up to billions of entries and generate results in single milliseconds.

Cons:

Can be expensive at scale
Primitive API and SDK. I recommend leaning heavily on Python helper functions.
Requires boot up time for underlying VM cluster, which are always on
Separate from the rest of your data, and while you can do basic query filtering, it will be limited compared to a traditional database

Recommended for:

Highly performant and at scale use cases only

BigQuery Vector Search

About:

Search options: ANN or exact
Use the BigQuery native command ML.GENERATE_EMBEDDING to create an embedding right within BigQuery. This utilizes the Vertex Text Embeddings API under the hood.
Use the BigQuery native statement CREATE VECTOR INDEX to create the vector index and VECTOR_SEARCH to search.

Pricing:

Uses standard compute pricing
There is no charge for the processing required to build and refresh your vector indexes when the total size of indexed table data in your organization is below the 20 TB limit

Quotas:

Tables must be between 5,000–1B rows
Other very specific restrictions apply

Pros:

Generate and store embeddings and perform vector search all self contained within BigQuery
Combine vector search with joins and filters on your related data in a single step
Pricing should be very competitive for small to moderate BigQuery data sizes and query frequency

Cons:

Preview state
There are various table and row size restrictions (see quotas above)

Recommended for:

If BigQuery is your primary datastore and you need to join and filter search results against related data or you only need basic similarity search functionality

CloudSQL for PostgreSQL (pgvector)

About:

Google Cloud’s managed PostgreSQL database
Utilize the built in open source pgvector functionality
pgvector offers various search options and supports up to 2,000 dimensional vectors

Pricing:

Standard CloudSQL for PostgreSQL pricing applies

Quotas:

Standard CloudSQL for PostgreSQL quotas

Pros:

Combine vector search and joins and filters in a single step

Cons:

Embeddings must be generated separately

Recommended for:

If PostgreSQL is your primary datastore and you are performing batch lookups that are not performance dependent

AlloyDB for PostgreSQL (pgvector)

About:

Google Cloud’s managed PostgreSQL compatible database offering better performance
Utilize the built in open source pgvector extension
pgvector offers various search options and supports up to 2,000 dimensional vectors

Pricing:

Standard AlloyDB pricing applies

Quotas:

Standard AlloyDB quotas

Pros:

Combine vector search and joins and filters in a single step
10x vector search performance compared to CloudSQL for PostgreSQL

Cons:

Embeddings must be generated separately
AlloyDB is about 40% more expensive than CloudSQL for PostgreSQL

Recommended for:

If PostgreSQL is your primary datastore and you are performing interactive interactive lookups that are performance dependent then choose AlloyDB over CloudSQL for PostgreSQL

Spanner Vector Search

About:

Implement embedding and vector search right within Spanner via custom code (see examples provided in official Google Cloud documentation).

Pricing:

Standard Spanner, embeddings API and Spanner custom ML model pricing applies

Quotas:

Standard Spanner quotas apply

Pros:

Generate and store embeddings and perform vector search all self contained in Spanner

Cons:

Preview state
Implementation is done via custom code versus a formally productized solution

Recommended for:

If Spanner is your primary datastore and you need to join and filter search results against related data

Grounding, RAG & Agent Solutions

Vertex AI Agent Builder

About:

Formerly branded as Gen App Builder and then later Search & Conversation, Vertex AI Agent Builder offers end-to-end apps or per component features to build custom search indexes, recommendation engines, chatbots and agents.

App types:

Search: Get quality results out-of-the-box and easily customize the engine
Chat: Answer complex questions out-of-the-box
Recommendations: Create a content recommendation engine
Agent: Built using natural language, agents can answer questions from data, connect with business systems through tools, and more

Data stores:

About: Vertex AI Search (data stores) is an all in one managed service to ingest different types of assets, create embeddings and function as a vector database for search
Data sources: Create a data store based on files in Cloud Storage or Google Drive, by indexing your website, from BigQuery and more via the cloud console or API
Data types: Ideal for unstructured data (PDFs, HTML, TXT) but also compatible with different structured data formats as well (with many limitations)
Usage: Data stores can be connected to any of the Agent Builder app types, used as a source for Vertex AI Grounding or used in any custom fashion via API
You can use your own custom embeddings

Pricing:

Pay separately for app query, data store query and per data store index size. Use the estimateDataSize method to calculate expected index size.

Quotas:

Various Agent Builder quotas and limits apply such as document count and query frequency

Pros:

Full UI creation (in addition to API), so this can be helpful for teams with limited engineering resources
Provides an embeddable web UI code designed for each app type
Flexibility to use one or more components discretely or combine for a full end-to-end solution

Cons:

Preview state
Each component has pre-selected options across embeddings, vector search and LLM options with sometimes limited options
It can be relatively pricey
Understanding the full set of rapidly changing products and features can be confusing (see the constant rebranding)

Recommended for:

Easiest no code way to build a RAG retriever (embeddings and vector search) and RAG based chatbot with embeddable UI

Vertex AI Grounding

About:

Select a preexisting Vertex AI Search data source (enterprise edition) and your model (Gemini) and it performs RAG automatically
See Ground to your own data for more information

Pricing:

Gemini 1.0 Pro — pay per character input/output
Vertex AI Search (enterprise edition) — pay for data indexing and per query

Quotas:

See Gemini quotas per region and model
See Search data source quotas for number of documents to index and query volume

Pros:

Can be enabled right within the Vertex AI Gemini prompting UI and easily added to Gemini API calls
Reuse data stores across Grounding, Agent Builder and Search options (some restrictions will apply)

Cons:

Preview state
Only available for Gemini 1.0 Pro today
Limits potential for customization e.g. hybrid search

Recommended for:

Easiest way to add basic RAG to existing Gemini chatbots and custom codebases

LangChain on Vertex AI

About:

Based on LangChain, a very popular open source library, Google Cloud offers a custom SDK with LangChain integrations
LangChain supports general purpose LLM based application, so it can work for RAG use cases plus many more

Pricing:

Quotas:

Pros:

Developers already familiar with LangChain can get started with Google Cloud integrated LangChain services quickly
Extensive set of helper functions and code examples
Has connectors across both Google Cloud and non-Google Cloud services, which makes it is easy to swap out services as needed

Cons:

Preview state
Pure Python implementation, so engineering team must fully implement and maintain
For LangChain in general, there can be differences in versions and features/parameters between the LangChain library and the direct Google Cloud equivalent, with questionable error handling. This can make it harder to optimize and debug compared to using the direct APIs and services.
It is unclear what advantages most users will have using LangChain on Vertex AI versus the regular open source implementation of LangChain

Recommended for:

Teams experienced with LangChain and / or teams with existing LangChain code that will be easy to port to integrate with Google Cloud

Vertex AI RAG (Llamaindex)

About:

Based on Llamaindex, an open source framework purpose built to handle retrieval style applications
Add your documents, select a model (Gemini), and it performs RAG automatically
Document sources: single file, Cloud Storage and Google Drive
Selectable chunk size and overlap
Selectable document count and max distance parameters

Pricing:

Quotas:

Quotas are present across data ingestion, data management and retrieval

Pros:

Developers already familiar with Llamaindex can get started with Google Cloud integrated Llamaindex services quickly
Extensive set of helper functions and code examples
Has connectors across both Google Cloud and non-Google Cloud services, which makes it is easy to swap out services as needed

Cons:

Preview state
Pure Python implementation, so engineering team must fully implement and maintain

Recommended for:

Teams experienced with Llamaindex working specifically on RAG style projects

Other Options

Gemini 1.5 Pro — Full document analysis using 1M+ token context window

About:

Gemini 1.5 Pro is Google’s latest foundation model available via API and the cloud console
New to Gemini 1.5 Pro is an extra large context window of 1 million (soon to be 2 million) tokens. This allows for very large text documents (and other input types) to be processed at runtime.
As an alternative to RAG, provide full documents as context. This should generally outperform RAG.
For multi-document use cases, a document <-> user query lookup will be required. The Document Ranking API can help solve for this.

Pricing:

Pricing is constantly changing, so check out the latest at the link provided.
Input: Priced per text character, per image, per video second and per audio second
Output: Priced per character

Quotas:

As of this writing, Gemini 1.5 Pro has a very low 5 queries/minute default quota. This should increase over time closer to the current Gemini 1.0 Pro quota (300/queries/minute).

Pros:

Better recall performance than RAG — 99% accuracy. This is a huge improvement over prior generation LLMs where recall suffered as context length increased.
Removes the need for the complexity of implementing RAG (depending on use case)

Cons:

Preview state with very low default quota
Pay-per-token pricing means that many requests with large context length may get quite expensive
Response times increase linearly with context length. This could lead to worse response times compared to RAG.

Recommended for:

Cases where queries are generally about single documents that are all smaller than the model token limit and precision matters more than inference time

Brett DiDonato

Discussion about this post

Ready for more?