Multi-Model Router · for AI engineers & FinOps

Route simple queries to Haiku. Keep Opus for the hard ones.

Most production workloads don't need a frontier model for every query. A 70/25/5 split between cheap/mid/premium typically saves 40-70% with no quality loss - provided you classify correctly. See exactly what you'd save.

Pricing verified: 2026-06-13 112 models Classifier overhead included

📖 What this is / how to use

What this calculator does

Route easy queries to Haiku 4.5 / GPT-5-mini / Gemini 3.1 Flash; reserve Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro for the hard ones. See exactly how much you save.

Why use it

Typical production workloads see 40-70% cost reduction — no quality loss if routed correctly
Pick from 6 preset mixes (RAG, support, coding, research, content, balanced) or build your own
Includes classifier overhead — LLM-as-judge / embedding / none — because the classifier itself costs something
Compare your mix against every preset + sanity-check extremes (all-cheap, all-premium)

📖 Read the full guide →

These are the inputs, outputs, and how you can use this calculator for your AI workloads.

📥 Inputs you provide

Monthly requestsTotal monthly call volume
Input tokens / requestAverage input size
Output tokens / requestAverage output size
Tier splitCheap / mid / premium query mix
Baseline modelThe single model you compare against
Routing methodHow queries get assigned to a tier

📤 Outputs you get

Baseline costCost with one model for everything
Routed costCost of your tier mix
Monthly savingsDollars saved per month
Classifier overheadCost of the routing decision

🎯 Use your results to

🔀

Tune your tier split

Test cheap/mid/premium ratios; many workloads hit 40-60% savings around an 80/15/5 mix

🆚

Compare against baseline

Exact dollar gap vs running one model for everything, plus every preset and the extremes

⚙️

Price the classifier

Routing overhead netted out so the savings are honest, not gross

🔌

Integrate with your agents

MCP available so agentic workflows can pull routing economics programmatically

👇 Now try the calculator below with your own AI workloads

Enter every field

Monthly requests drives absolute scale. Input/output tokens per request affect each tier equally — tune to your real average. Then the 3-tier split: cheap % (simple queries — FAQ, retrieval Q&A, autocomplete), mid % (standard reasoning — explanations, synthesis), premium % (hard reasoning — architecture, edge cases). Sum should equal 100; sliders auto-adjust. Baseline selector = the single model you'd use if you weren't routing (defaults to your premium tier).

Pick a classifier strategy

No classifier = rules-based (regex, length, intent-in-URL) at zero overhead. LLM-as-judge = small model classifies each query at ~$0.0001/query (essentially free at scale). Embedding = embed query, match to intent clusters — cheaper still but ~85-92% accuracy vs ~90-95% for LLM-as-judge. Mis-routed queries trigger retries and can erode savings; factor in a 5-10% "escalation overhead" if accuracy is tight.

Interpret all panels

Top 3 cards compare baseline vs routed. Per-tier blocks show monthly cost for each tier + per-request cost — use this to spot tiers where you're paying more than necessary. Cost-share bar visualizes routed dollar distribution including classifier. Comparison table stacks your mix against every preset AND sanity-check extremes ("all cheap", "all premium") — useful for bounding max savings.

Next actions

Savings should be validated before shipping. Run offline eval: does your cheap-tier model handle your cheap-tier queries at acceptable quality? If yes, deploy with monitoring. Monitor the cheap-tier escalation rate (how often cheap-tier output is rejected and retried on premium) — if >10%, your classifier needs tuning. Stack with prompt cache and batch API — savings multiply, not add.

📊 Calculator at a glance

📊 How it works (diagram)

📅 Schedule a meeting via AvatarVA ✉️ Email [email protected]

🎛 CALCULATOR

🧭 Your workload shape

Start with a preset, then tune the mix.

Monthly requests

Input tokens / request

Output tokens / request

⚠️ Your tier percentages don't sum to 100% - results will be normalized.

🟢 Cheap tier - simple queries

70%

🟡 Mid tier - standard queries

25%

🟣 Premium tier - complex queries

Compare against (baseline) What would you be using if you weren't routing? Baseline defaults to your premium-tier model.

Routing method -

📈 RESULTS

Baseline (single model)

With routing

Monthly savings

Cheap tier cost Mid tier cost Premium tier cost Classifier overhead

💡 Recommendations

📋 Compare routing strategies side-by-side

All strategies at your workload. Green = biggest savings, gold = your current mix.

Strategy	Mix	Monthly	vs baseline	Savings

Stack with prompt cache savings → RAG Pipeline Cost → Single-model baseline → Get a routing architecture review →

📋 What now?

Validate the cheap tier before you ship — run an offline eval on your simple-query slice; the savings only hold if the cheap model answers those at acceptable quality.
Add an escalation path — route low-confidence answers up to premium so misroutes don't turn into retries and complaints. Watch the cheap-tier escalation rate; above ~10% means the classifier needs tuning.
Then stack the other levers — prompt caching and batch processing multiply with routing rather than just adding, often another 20-40% off the routed number.

📅 Book a routing-architecture review to apply this to your workload →

Need help using this calculator for your workloads?

AICost.ai has 50+ calculators and playbooks. Schedule an AvatarVA meeting and we'll work through your real cost scenarios across AI & Cloud: visibility, cost reduction, optimization, forecasting and capacity planning, without sacrificing accuracy or performance.

📅 Schedule an AvatarVA meeting →

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

Route simple queries to Haiku. Keep Opus for the hard ones.

Go deeper

Need help using this calculator for your workloads?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)