Methodology: RAG Pipeline Cost

End-to-end RAG stack: embedding + vector DB + retrieval + generation.

Citations last refreshed: 2026-06-03 Pricing snapshot: 2026-06-05 ← Back to calculator

How we keep this honest

Every number on aicost.ai is verified by 11 independent audit layers that run every day at 03:30 EDT — covering structural integrity, math correctness, source-side freshness, and cross-source agreement. We publish today's snapshot date and per-vendor verification timestamps below so you can verify any number yourself.

50
calculators
31
vendors verified
58
cited claims
581
days of history
11
audit layers
See the 8 audit layers
  • Layer 1: Architecture (12 structural invariants)
  • Layer 2: Smoke test (every calc page renders)
  • Layer 3: Golden values (math correctness vs reference)
  • Layer 4: Source resilience (independent reference data sources reachable)
  • Layer 5: Math gotchas (static code analysis)
  • Layer 6: Hybrid reconciliation (cross-source agreement)
  • Layer 7: Drift detection (day-over-day price changes)
  • Layer 8: Vendor cache (per-vendor freshness wiring)
  • Layer 9: Cross-vendor reachability (live vendor pricing page probes)
  • Layer 10: Rendered HTML drift (calc page DOM contracts, 45 pages daily)
  • Layer 11: Pricing freshness (cron heartbeat + per-vendor age tracking)

All 8 layers must pass before any pricing data is considered fresh. The infrastructure runs daily and publishes results to an internal dashboard. If any layer flags an issue, it is treated as stop-the-line work.

How this calculator sources its numbers

Every value falls into one of five categories. Numbers without an asterisk are vendor-published, directly observable, or computed by arithmetic on published data. Numbers marked with * are typical best-target values — we state the working range and invite you to override with your own number.

Vendor-published Directly from the vendor's pricing or docs page. No asterisk.
Published benchmark Independent benchmark (e.g. Chatbot Arena, ANN-Benchmarks, vLLM). Cited with date.
Research paper Peer-reviewed or widely-accepted research (e.g. LLMLingua, RAGAS).
Typical target * No single canonical source exists. We state the working range and explain why.
Computed Arithmetic on vendor-published values (e.g. batch discount × standard rate).

Any individual claim may also be tagged with * if its source has not yet been re-verified against the current vendor page — treat such claims as approximate until the next verification cycle resolves them.

Vendor verification freshness

Each vendor's pricing page is independently re-checked on a cadence ranging from daily to weekly. Below: when each relevant vendor was last verified by our automated pipeline.

cohere
2026-06-05
verified today
by auto-pipeline
openai
2026-06-05
verified today
by auto-pipeline
pinecone
2026-06-05
verified today
by auto-crawler
azure-openai-pricing
2026-04-25
verified 41 days ago
anthropic-docs
2026-04-17
verified 49 days ago

Vendor-published values

Directly from the vendor's own docs. See per-vendor verification dates in the panel above.

Anthropic 1-hour cache writes are billed at 2 times the base input token price.

“1-hour cache write tokens are 2 times the base input tokens price”
Value
2.000000 multiplier
Source
Anthropic — Prompt Caching documentation · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to all supported models; 1-hour cache writes are billed at 200% of base input token price.

Anthropic prompt cache reads are billed at 0.1 times the base input token price.

“Cache reads are 0.1 times the base input tokens price”
Value
0.100000 fraction of base
Source
Anthropic — Prompt Caching documentation · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to all supported models; cache reads are billed at 10% of base input token price.

Anthropic prompt cache has a default 5-minute TTL.

“By default, the cache has a 5-minute lifetime.”
Value
5.000000 minutes
Source
Anthropic — Prompt Caching documentation · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Default TTL applies to all supported models unless explicitly set to 1-hour.

Anthropic 5-minute cache writes are billed at 1.25 times the base input token price.

“5-minute cache write tokens are 1.25 times the base input tokens price”
Value
1.250000 multiplier
Source
Anthropic — Prompt Caching documentation · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to all supported models; 5-minute cache writes are billed at 125% of base input token price.

GPT-5.5 cached input pricing is $0.50 per 1M tokens.

“Cached input: $0.50 / 1M tokens”
Value
0.500000 $/M tokens
Source
OpenAI — Prompt Caching · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Flagship model pricing; applies to GPT-5.5.

GPT-5.4 cached input pricing is $0.25 per 1M tokens.

“Cached input: $0.25 / 1M tokens”
Value
0.250000 $/M tokens
Source
OpenAI — Prompt Caching · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Previous-generation model pricing; applies to GPT-5.4.

LlamaIndex defaults to a chunk overlap of 20 tokens for RAG pipelines.

“By default, LlamaIndex uses a chunk overlap of 20 tokens.”
Value
20.000000 tokens
Source
LlamaIndex — Production RAG Defaults · Wed May 27 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to default RAG pipeline configurations in LlamaIndex framework.

LlamaIndex defaults to a chunk size of 1024 tokens for RAG pipelines.

“By default, LlamaIndex uses a chunk size of 1024 tokens.”
Value
1024.000000 tokens
Source
LlamaIndex — Production RAG Defaults · Wed May 27 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to default RAG pipeline configurations in LlamaIndex framework.

Cohere Embed v4 pricing is $5.00 per instance per hour for Medium performance tier.

“**Embed 4** Medium $5.00 $3,250”
Value
5.000000 $/instance/hour
Source
Cohere — Pricing · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Pricing applies to Embed v4 Medium tier; billed per instance (hourly or monthly).

Cohere Rerank v4 Pro pricing is $10.00 per 1M search units for Large performance tier.

“**Rerank 4 Pro** Large $10.00 $6,500”
Value
10.000000 $/M search units
Source
Cohere — Pricing · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Pricing applies to Rerank 4 Pro Large tier; billed per instance (hourly or monthly).

OpenAI embeddings (text-embedding-3-large) are billed at $0.13 per 1M tokens.

“text-embedding-3-large: $0.13 / 1M tokens”
Value
0.130000 $/M tokens
Source
OpenAI — API Pricing · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

OpenAI embeddings (text-embedding-3-small) are billed at $0.02 per 1M tokens.

“text-embedding-3-small: $0.02 / 1M tokens”
Value
0.020000 $/M tokens
Source
OpenAI — API Pricing · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Pinecone serverless read units pricing ranges from $16 to $18 per million reads (varies by cloud and region).

“Unlimited[$16-$18 per million (varies by cloud and region)]”
Source
Pinecone — Pricing · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to Standard plan for serverless read units.

Pinecone serverless storage pricing is $0.33 per GB per month.

“Unlimited$0.33/GB/mo”
Value
0.330000 $/GB/month
Source
Pinecone — Pricing · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to Standard and Enterprise plans for serverless storage.

Voyage AI voyage-3 is priced at $0.06 per 1M tokens; voyage-3-lite at $0.02/M*

Value
0.060000 $/M tokens
Source
Voyage AI Pricing · Wed Apr 01 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Retrieval-tuned embedder often used by RAG-first teams. Lite tier competitive with OpenAI 3-small on cost.

Vendor pricing pages referenced

All vendor-published prices used by this calculator are sourced from the pages below. See the verification panel above for when each was last re-checked.

See an error or stale value?

We treat methodology as a living document. If a price is wrong, a benchmark is outdated, or you have a better citation, let us know and we will verify within 48 hours.

Email [email protected]
📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

  • All prices are USD per 1 million tokens, current as of 2026-06-05.
  • Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
  • Batch API discounts are 50% off standard rates across providers that offer Batch mode.
  • Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
  • Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
  • Long-context pricing tiers apply when input exceeds model threshold.
  • Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic
2026-06-05
https://www.anthropic.com/pricing
Daily snapshot since Sep 2023 · 578 days captured
Anthropic Docs
2026-06-05
https://platform.claude.com/docs/en/about-claude/pricing
Daily snapshot since Sep 2023 · 578 days captured
OpenAI
2026-06-05
https://openai.com/api/pricing/
Daily snapshot since Sep 2023 · 579 days captured
Google AI
2026-06-05
https://ai.google.dev/gemini-api/docs/pricing
Daily snapshot since Dec 2023 · 554 days captured
Google Vertex
2026-06-05
https://cloud.google.com/vertex-ai/generative-ai/pricing
Daily snapshot since Dec 2023 · 554 days captured
DeepSeek
2026-06-05
https://api-docs.deepseek.com/quick_start/pricing
Daily snapshot since May 2024 · 493 days captured
xAI
2026-06-05
https://x.ai/api
Daily snapshot since Nov 2024 · 411 days captured
Mistral
2026-06-05
https://mistral.ai/pricing
Daily snapshot since Dec 2023 · 552 days captured
Cohere
2026-06-05
https://cohere.com/pricing
Daily snapshot since Sep 2023 · 578 days captured

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model Field Why it’s inferred
Anthropic — Claude Sonnet 4.6 cachedInput Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5 cachedInput Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5 batchInput Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5 batchOutput Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5 cachedInput Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini cachedInput Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2 cachedInput Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2 batchInput Derived at 50% of input.
OpenAI — GPT-5.2 batchOutput Derived at 50% of output.
OpenAI — GPT-5 cachedInput Derived at 10% of input.
OpenAI — GPT-5 batchInput Derived at 50% of input.
OpenAI — GPT-5 batchOutput Derived at 50% of output.
OpenAI — GPT-5.5 Pro cachedInput Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5.5 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5.2 Pro cachedInput Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5.2 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5.1 batchInput Derived at 50% of input.
OpenAI — GPT-5.1 batchOutput Derived at 50% of output.
OpenAI — GPT-5 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5 Nano cachedInput Derived at 10% of input.
OpenAI — GPT-5 Nano batchInput Derived at 50% of input.
OpenAI — GPT-5 Nano batchOutput Derived at 50% of output.
Google — Gemini 3 Flash cachedInput Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro cachedInput Derived at 10% of input.
Google — Gemini 2.5 Flash cachedInput Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash cachedInput Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy) cachedInput Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →