Methodology: RAG Pipeline Cost

How we keep this honest

Every number on aicost.ai is verified by 11 independent audit layers that run every day at 03:30 EDT — covering structural integrity, math correctness, source-side freshness, and cross-source agreement. We publish today's snapshot date and per-vendor verification timestamps below so you can verify any number yourself.

calculators

vendors verified

cited claims

634

days of history

audit layers

See the 8 audit layers

Layer 1: Architecture (12 structural invariants)
Layer 2: Smoke test (every calc page renders)
Layer 3: Golden values (math correctness vs reference)
Layer 4: Source resilience (independent reference data sources reachable)
Layer 5: Math gotchas (static code analysis)
Layer 6: Hybrid reconciliation (cross-source agreement)
Layer 7: Drift detection (day-over-day price changes)
Layer 8: Vendor cache (per-vendor freshness wiring)
Layer 9: Cross-vendor reachability (live vendor pricing page probes)
Layer 10: Rendered HTML drift (calc page DOM contracts, 45 pages daily)
Layer 11: Pricing freshness (cron heartbeat + per-vendor age tracking)

All 8 layers must pass before any pricing data is considered fresh. The infrastructure runs daily and publishes results to an internal dashboard. If any layer flags an issue, it is treated as stop-the-line work.

How this calculator sources its numbers

Every value falls into one of five categories. Numbers without an asterisk are vendor-published, directly observable, or computed by arithmetic on published data. Numbers marked with * are typical best-target values — we state the working range and invite you to override with your own number.

Vendor-published Directly from the vendor's pricing or docs page. No asterisk.

Published benchmark Independent benchmark (e.g. Chatbot Arena, ANN-Benchmarks, vLLM). Cited with date.

Research paper Peer-reviewed or widely-accepted research (e.g. LLMLingua, RAGAS).

Typical target * No single canonical source exists. We state the working range and explain why.

Computed Arithmetic on vendor-published values (e.g. batch discount × standard rate).

Any individual claim may also be tagged with ^* if its source has not yet been re-verified against the current vendor page — treat such claims as approximate until the next verification cycle resolves them.

Vendor verification freshness

Each vendor's pricing page is independently re-checked on a cadence ranging from daily to weekly. Below: when each relevant vendor was last verified by our automated pipeline.

cohere

2026-07-28

verified today

by auto-pipeline

pinecone

2026-07-28

verified today

by auto-crawler

anthropic-docs

2026-04-17

verified 102 days ago

Vendor-published values

Directly from the vendor's own docs. See per-vendor verification dates in the panel above.

Anthropic 1-hour cache writes are billed at 2 times the base input token price.

“1-hour cache write tokens are 2 times the base input tokens price”

Value

2.000000 multiplier

Source

Anthropic — Prompt Caching documentation · Wed Jul 15 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to all models supporting prompt caching; extended TTL is 1 hour.

Anthropic prompt cache reads are billed at 0.1 times the base input token price.

“Cache reads are 0.1 times the base input tokens price”

Value

0.100000 fraction of base

Source

Anthropic — Prompt Caching documentation · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to all models supporting prompt caching; cache reads are billed at 10% of base input token price.

Anthropic prompt cache entries have a default 5-minute lifetime.

“By default, the cache has a 5-minute lifetime.”

Value

5.000000 minutes

Source

Anthropic — Prompt Caching documentation · Wed Jul 08 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Default TTL for prompt cache entries unless explicitly set to 1-hour.

Anthropic 1-hour cache writes are billed at 2 times the base input token price.

“1-hour cache write tokens are 2 times the base input tokens price”

Value

2.000000 multiplier

Source

Anthropic — Prompt Caching documentation · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to all models supporting prompt caching; 1-hour cache writes are billed at 200% of base input token price.

GPT-5.5 cached input pricing is $0.50 per 1M tokens.

“Cached input:$0.50 / 1M tokens”

Value

0.500000 $/M tokens

Source

OpenAI — Prompt Caching documentation · Wed Jun 24 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to GPT-5.5 model; pricing shown in OpenAI pricing page.

GPT-5.4 cached input pricing is $0.25 per 1M tokens.

“Cached input:$0.25 / 1M tokens”

Value

0.250000 $/M tokens

Source

OpenAI — Prompt Caching documentation · Wed Jun 24 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to GPT-5.4 model; pricing shown in OpenAI pricing page.

Default chunk overlap for RAG pipelines in LlamaIndex is 20 tokens.

“...The default chunk overlap is 20 tokens...”

Value

20.000000 tokens

Source

LlamaIndex — Vector Store Index documentation · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Default value for chunk_overlap parameter in VectorStoreIndex; applies to standard text-splitting configurations.

Default chunk size for RAG pipelines in LlamaIndex is 1024 tokens.

“...The default chunk size is 1024 tokens...”

Value

1024.000000 tokens

Source

LlamaIndex — Vector Store Index documentation · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Default value for chunk_size parameter in VectorStoreIndex; applies to standard text-splitting configurations.

Cohere Embed v4 pricing is $4.00 per hour for a Small instance and $5.00 per hour for a Medium instance in Model Vault.

“**Embed 4** Small $4.00 **Embed 4** Medium $5.00”

Source

Cohere — Pricing · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Pricing is per instance in Model Vault; billed hourly or via monthly/annual commitments.

Cohere Rerank v3 pricing is $5.00 per hour for a Medium instance and $10.00 per hour for a Large instance in Model Vault.

“**Rerank 3.5** Medium $5.00 **Rerank 4 Fast** Medium $5.00 **Rerank 4 Pro** Medium $5.00 **Rerank 4 Pro** Large $10.00”

Source

Cohere — Pricing · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Pricing is per instance in Model Vault; billed hourly or via monthly/annual commitments.

OpenAI embed-3-large input tokens are billed at $2.50 per 1M tokens.

“gpt-5.4 $2.50 $0.25 ...”

Value

2.500000 $/M tokens

Source

OpenAI — Pricing · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Pricing table shows embed-3-large (gpt-5.4) input tokens at $2.50 per 1M tokens standard rate; batch rate is $1.25 per 1M tokens.

OpenAI embed-3-small input tokens are billed at $0.10 per 1M tokens.

“gpt-5.4-mini $0.75 $0.075 ...”

Value

0.100000 $/M tokens

Source

OpenAI — Pricing · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Pricing table shows embed-3-small (gpt-5.4-mini) input tokens at $0.75 per 1M tokens standard rate; batch rate is $0.375 per 1M tokens.

Pinecone serverless read units pricing ranges from $16 to $18 per million reads.

“Read UnitsUnlimited[$16-$18 per million (varies by cloud and region)]”

Source

Pinecone — Pricing · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Standard plan; unlimited read units.

Pinecone serverless storage pricing is $0.33 per GB per month.

“Storage$0.33/GB/mo”

Value

0.330000 $/GB/month

Source

Pinecone — Pricing · Wed Jul 22 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Applies to Standard and Enterprise plans; unlimited storage.

Voyage AI voyage-3 is priced at $0.06 per 1M tokens; voyage-3-lite at $0.02/M^*

Value

0.060000 $/M tokens

Source

Voyage AI Pricing · Wed Apr 01 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Retrieval-tuned embedder often used by RAG-first teams. Lite tier competitive with OpenAI 3-small on cost.

Vendor pricing pages referenced

All vendor-published prices used by this calculator are sourced from the pages below. See the verification panel above for when each was last re-checked.

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

Methodology: RAG Pipeline Cost

How we keep this honest

How this calculator sources its numbers

Vendor verification freshness

Vendor-published values

Anthropic 1-hour cache writes are billed at 2 times the base input token price.

Anthropic prompt cache reads are billed at 0.1 times the base input token price.

Anthropic prompt cache entries have a default 5-minute lifetime.

Anthropic 1-hour cache writes are billed at 2 times the base input token price.

GPT-5.5 cached input pricing is $0.50 per 1M tokens.

GPT-5.4 cached input pricing is $0.25 per 1M tokens.

Default chunk overlap for RAG pipelines in LlamaIndex is 20 tokens.

Default chunk size for RAG pipelines in LlamaIndex is 1024 tokens.

Cohere Embed v4 pricing is $4.00 per hour for a Small instance and $5.00 per hour for a Medium instance in Model Vault.

Cohere Rerank v3 pricing is $5.00 per hour for a Medium instance and $10.00 per hour for a Large instance in Model Vault.

OpenAI embed-3-large input tokens are billed at $2.50 per 1M tokens.

OpenAI embed-3-small input tokens are billed at $0.10 per 1M tokens.

Pinecone serverless read units pricing ranges from $16 to $18 per million reads.

Pinecone serverless storage pricing is $0.33 per GB per month.

Voyage AI voyage-3 is priced at $0.06 per 1M tokens; voyage-3-lite at $0.02/M^*

Vendor pricing pages referenced

See an error or stale value?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Immediate steps you can take

How we keep this honest

How this calculator sources its numbers

Vendor verification freshness

Vendor-published values

Anthropic 1-hour cache writes are billed at 2 times the base input token price.

Anthropic prompt cache reads are billed at 0.1 times the base input token price.

Anthropic prompt cache entries have a default 5-minute lifetime.

Anthropic 1-hour cache writes are billed at 2 times the base input token price.

GPT-5.5 cached input pricing is $0.50 per 1M tokens.

GPT-5.4 cached input pricing is $0.25 per 1M tokens.

Default chunk overlap for RAG pipelines in LlamaIndex is 20 tokens.

Default chunk size for RAG pipelines in LlamaIndex is 1024 tokens.

Cohere Embed v4 pricing is $4.00 per hour for a Small instance and $5.00 per hour for a Medium instance in Model Vault.

Cohere Rerank v3 pricing is $5.00 per hour for a Medium instance and $10.00 per hour for a Large instance in Model Vault.

OpenAI embed-3-large input tokens are billed at $2.50 per 1M tokens.

OpenAI embed-3-small input tokens are billed at $0.10 per 1M tokens.

Pinecone serverless read units pricing ranges from $16 to $18 per million reads.

Pinecone serverless storage pricing is $0.33 per GB per month.

Voyage AI voyage-3 is priced at $0.06 per 1M tokens; voyage-3-lite at $0.02/M*

Vendor pricing pages referenced

See an error or stale value?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Immediate steps you can take

Voyage AI voyage-3 is priced at $0.06 per 1M tokens; voyage-3-lite at $0.02/M^*