Batch vs Realtime · for async workloads

Every major provider offers 50% off via batch API. Can you use it?

OpenAI, Anthropic, Google, Mistral, and Bedrock all offer ~50% off for async requests with a 24-hour SLA. The discount is flat - the real question is what fraction of your workload can tolerate the latency.

Pricing verified: 2026-07-22 42/154 models support batch 50% discount · 24h SLA

What this calculator does

Every major provider offers 50% off for batch API requests with a 24-hour SLA. See how much you would save — based on what fraction of your workload can wait.

Why use it

The discount is uniform (50%) across OpenAI, Anthropic, Google, Mistral, Bedrock — no vendor shopping needed
Your only real decision is: what % of my workload is actually async-tolerant?
Batch API has much higher rate limits than realtime — often the only way to handle million-row backfills
Savings are additive with routing and caching — all three together can hit 80%+ total

New to this calculator? Start with the ⚡ Playground — a few sliders, instant ballpark. Then switch to the 🧮 Calculator for your exact number.

Two ways to use this: visualize in the Playground, then get your number in the Calculator.

▶Playground A quick, visual way to see which factors move your savings the most. Open the playground → ▤Calculator Enter your real workload for a precise savings you can apply to your own usage. Go to the calculator →

Batch vs Realtime — how much does batching save?

Realtime is your full bill. Batching the share of work that can tolerate delay earns a discount on those calls.

Cheaper option

—

Change the input sliders below to compare.

Current spend / mo 8,000

Batch-eligible share 35

Batch discount 50

All realtime

—

/ month

With batch

—

/ month

💡Almost always a pure cost win — the only question is how much of your workload can wait.

Batch API gives you 50% off if you can tolerate some delay. Enter your workload below to see exactly how much you'd save — and whether the latency trade-off is worth it.

📊 Not sure of a value? Fields with a ▾ Typical pill offer broad industry ballparks (sourced typical ranges) so you can move forward now — your result gets more accurate as you replace them with your own measured numbers. Values marked * are rough estimates.

🎮 Interactive Guide & Calculator Playground →

📥 Inputs you provide

ModelPick the model you run
Monthly requestsTotal monthly call volume
Input tokens / requestAverage input size
Output tokens / requestAverage output size
Batch-tolerant portionShare of work that can wait

📤 Outputs you get

All realtimeCost with everything realtime
Your mixCost with the batch share moved over
Monthly savingsDollars saved per month
HeadroomSavings if everything eligible moved

🎯 Use your results to

⏱️

Decide what can wait

Classify each call type as user-blocking or not; the non-blocking share is your batch-eligible bucket

💰

Quantify the cut

Real monthly and annual dollars from a 50% discount on the eligible portion

🆚

Compare across models

Same workload across every model — see if switching provider alongside batch saves more

🔌

Integrate with your agents

MCP available so agentic workflows can pull batch economics programmatically

Enter key values

Pick your model (most support batch; o1 does not). Set requests/month. Then move the hero slider: batch-tolerant %.

Pick a workload preset

Chatbot (5%), support (40%), content gen (70%), doc processing (85%), data enrichment (95%), log moderation (90%). Use the latency tier picker below to self-classify.

Read the savings card

Left = all-realtime cost. Middle = your mix. Right = savings. Green/red split bar shows exactly where your dollars go.

Act on it

If you're already at 80%+ batch, audit the remaining 20% — some of it might be moveable. Then stack with routing + caching.

Enter every field

The model dropdown flags which support batch (1 currently does not — o1 reasoning series). Monthly requests and input/output tokens scale absolute savings — the % savings depends only on your batch-tolerant mix. Note that batch API has separate rate limits from realtime — often 10x higher — which matters for large backfills.

Calibrate your batch %

The hero slider is the single most important control. Use presets if you're unsure. The latency tier picker below the slider helps you self-classify: if your workload tolerates 24h latency (🟢), it's batch-friendly; if it needs <1s (🔴), keep it realtime. Mixed workloads are the norm — most teams have a long tail of async work (analytics, moderation, enrichment) even if their main product is realtime.

Interpret all panels

Top 3 cards show all-realtime vs your mix vs savings. Split bar shows dollar distribution between batch and realtime portions (not request distribution). Cross-model comparison table runs your workload across all models — reveals if switching provider alongside enabling batch would save more. No-batch models (e.g., o1) show greyed-out rows.

Next actions

Typical batch completion is 1-4 hours even though the SLA is 24h — design for worst case but expect fast turnaround. Stack with routing (40-70% off) and caching (20-40% off) — all three compose multiplicatively. At 1M+ requests/mo, batch API's higher rate limits often matter more than the discount itself.

👇 Now try the calculator below with your own AI workloads

🎛 CALCULATOR

📦 Your workload

Pick a preset or estimate manually.

Model Hint: Pick the model you'll run. The note below shows its rates. -

Monthly requests Hint: Total runs per month across everything.

Input tokens / request Hint: Words you send in per run. Doc-heavy = a few thousand.

Output tokens / request Hint: Words it sends back per run.

Batch-tolerant portion of workload 70%

% of requests that can wait up to 24 hours for a response. Use the preset above if you're unsure.

Does your latency budget allow batch?

Hint: Share of work that can wait (overnight jobs, reports). That part runs about half price.

📈 RESULTS

All realtime

Your mix

Monthly savings

batch realtime

Batch portion (50% off) Realtime portion (full price)

💡 Recommendations

📋 Same workload across all models

50% batch discount is uniform across providers - absolute savings scale with model price.

Model	All realtime	Your mix	All batch	Savings now

Stack prompt cache savings → Stack multi-model routing → Single-model baseline → Get an AI cost architecture review →

📋 What now?

Audit your call types — list every AI call and mark each user-blocking or not. The "not" bucket is your batch-eligible share; migrate the easy wins (reports, embeddings, moderation, summaries) first.
Design for the 24h SLA, expect faster — batch typically completes in 1-4 hours, but build for worst-case so nothing user-facing depends on it.
Then stack the other levers — routing (40-70% off) and caching (20-40% off) compose multiplicatively with batch, not additively.

📅 Book a cost-architecture review to apply this to your workload →

Need help using this calculator for your workloads?

AICost.ai has 50+ calculators and playbooks. Schedule an AvatarVA meeting and we'll work through your real cost scenarios across AI & Cloud: visibility, cost reduction, optimization, forecasting and capacity planning, without sacrificing accuracy or performance.

📅 Schedule an AvatarVA meeting →

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

Every major provider offers 50% off via batch API. Can you use it?

Batch vs Realtime Calculator

Results

Go deeper

Need help using this calculator for your workloads?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Immediate steps you can take