RAG Pipeline Cost · for RAG builders & CTOs

Full RAG stack cost - in one calculator

Embeddings + vector DB + rerank + generation, with prompt-cache savings modeled. Pick a preset or build your own stack across 9 embedding models, 8 vector DBs, and 132 generation models.

Pricing verified: 2026-07-22 5-stage cost model Cache-aware

What this calculator does

End-to-end cost of a RAG stack — embeddings + vector DB + rerank + generation + prompt cache — in one place.

Why use it

See which of the 5 stages actually dominates your bill (usually generation)
Compare 4 pre-built vendor stacks (Budget, Balanced, Premium, Self-host) at your workload
Avoid the classic RAG cost traps: top-K bloat, missing prompt cache, aggressive re-indexing
Get a shareable URL that captures every input — send to your team as a decision artifact

New to this calculator? Start with the ⚡ Playground — a few sliders, instant ballpark. Then switch to the 🧮 Calculator for your exact number.

Two ways to use this: visualize in the Playground, then get your number in the Calculator.

▶Playground A quick, visual way to see which factors move your result the most. Open the playground → ▤Calculator Enter your real workload for a precise result you can apply to your own usage. Go to the calculator →

What will a RAG pipeline cost?

Query-time LLM calls dominate; indexing is a small one-time slice. Balanced tier, ~30 days assumed.

Estimated monthly cost

—

Change the input sliders below to see new estimates.

Queries / day 5,000

Retrieved tokens / query 6,000

Output tokens / query 500

How we got this estimate

—

💡Cost = queries × (retrieved + output tokens) × rate × ~30 days. More retrieved context means more to read every single query.

A RAG stack has many cost layers — embeddings, storage, retrieval, generation. Enter your numbers below to see the full monthly picture.

📊 Not sure of a value? Fields with a ▾ Typical pill offer broad industry ballparks (sourced typical ranges; Doc sizes: tech docs ~5K tokens, legal ~31K, long reports ~80K; dense pages ~260-520 tokens.) so you can move forward now — your result gets more accurate as you replace them with your own measured numbers. Values marked * are rough estimates.

🎮 Interactive Guide & Calculator Playground →

Enter key values

Monthly queries, chunk size × top-K (these drive generation cost), and pick a generation model. Leave the rest at defaults.

Pick a preset

Click Budget / Balanced / Premium / Self-host. Each preset swaps the embedding, vector DB, and generation model in one click.

Read the big number

Top card = total monthly cost. Green bar below shows which of the 5 stages eats the most.

Act on it

If generation > 60% of total, try a cheaper gen model or enable prompt cache. If storage > generation, check if you need managed vs self-hosted.

Enter every field

Start with monthly queries (the single biggest driver). Then corpus docs × tokens-per-doc for storage. Chunk size (256/512/1024) affects both storage and generation — 512 is the safe default. Top-K retrieved chunks are shoved into every generation call, so higher K = higher bill. Re-indexes/year only matters if you change embedding models; leave at 2 unless you know otherwise.

Pick a preset then tune

Presets swap 3 things together: embedding model, vector DB, generation model. Budget = OpenAI small + pgvector + GPT-5-mini. Premium = Voyage + Pinecone + Claude Sonnet 4.6. Then adjust the prompt cache hit rate slider — 40% is typical for RAG with stable docs; 0% if your context changes every call. Toggle reranker for an extra ~$0.002/query in exchange for better retrieval quality.

Read every panel

Top card = monthly total + per-query cost. 5-stage breakdown shows ingest / vector DB / query embedding / rerank / generation — the cost-share bar visualizes proportions. Comparison table runs all 4 presets at your workload; green row wins on cost. Recommendations panel auto-flags top-K bloat, cache misuse, and vendor mismatches.

Next actions

If a single stage dominates (>60%), optimize there first. If generation is biggest, see Multi-Model Router to route easy queries cheaper. If you have a cache-capable model at hit rate 0, visit Prompt Cache ROI — could be another 30-50% off. At >500K queries/mo, consider whether fine-tuning beats RAG entirely (RAG vs Fine-Tune).

👇 Now try the calculator below with your own AI workloads

🎛 CALCULATOR

🧩 Your RAG workload

Pick your use case, then a tool stack — tweak anything after.

1 · What's your RAG use case? (sets your workload)

2 · Pick a tool stack (sets the models)

Monthly queries Hint: Searches per month that hit your docs.

Retrievals per query Hint: Lookups per question. Simple Q&A = 1; agent-style = 2-3.

Corpus size (docs) Hint: How many documents are in your set. A round number is fine.

Tokens per doc Hint: Length of a typical doc, in words. A one-pager ~600.

Chunk size

Chunk overlap

Retrieval top-K

Full re-indexes / year Hint: How many times a year you rebuild the whole index.

Prompt cache hit rate Hint: How often the same context repeats across questions. 0 if every query is fresh.

Embedding model Hint: The model that turns docs into searchable form.

Vector DB Hint: Which search-index service stores the vectors.

Generation model Hint: The model that writes the final answer.

Include reranker (Cohere rerank-v3 @ $0.002/search)

📈 RESULTS

Monthly cost for this stack

- -

📥

Ingest

🗄️

Vector DB

🔎

Query embed

🎯

Rerank

🧠

Generation

💡 Recommendations

📋 Compare all 4 preset stacks at your workload

Same queries, same corpus - vendor mix varies. Green row = cheapest, gold = your current config.

Stack	Components	Per query	Monthly	Annual

Vector DB deep-dive → Embedding model comparison → RAG vs Fine-Tuning → Get a RAG architecture review →

🎯 Use this result to

📚 Budget your RAG pipeline — Embedding + retrieval + completion + reranking. Full TCO not just LLM cost.
🔍 Find your cost bottleneck — Usually one stage dominates. Calc surfaces which and how much it costs.
📈 Project at scale — Linear at first, non-linear above 10K queries/day. See your scaling curve.
🔌 Integrate with your AI agents — MCP available for agentic workflow integration. Cost-aware RAG routing.

📅 Schedule a call to apply this to your workload

📋 What now?

Generation usually dominates — at scale the LLM answer call is most of the bill, so routing the generation model (or raising cache hit rate) moves cost far more than tweaking chunking.
Embeddings are mostly one-time — indexing is paid once; only re-indexing and query embeddings recur, so a big corpus is cheaper to run than it looks.
Top-K and rerank are quality/cost dials — higher top-K and a reranker improve recall but add per-query cost; tune against real answer quality.

📅 Book a working session to apply this to your workload →

Need help using this calculator for your workloads?

AICost.ai has 50+ calculators and playbooks. Schedule an AvatarVA meeting and we'll work through your real cost scenarios across AI & Cloud: visibility, cost reduction, optimization, forecasting and capacity planning, without sacrificing accuracy or performance.

📅 Schedule an AvatarVA meeting →

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

Full RAG stack cost - in one calculator

RAG Pipeline Calculator

Results

Go deeper

Need help using this calculator for your workloads?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Immediate steps you can take