Multi-Model Router · for AI engineers & FinOps

Route simple queries to Haiku. Keep Opus for the hard ones.

Most production workloads don't need a frontier model for every query. A 70/25/5 split between cheap/mid/premium typically saves 40-70% with no quality loss - provided you classify correctly. See exactly what you'd save.

Pricing verified: 2026-06-13 112 models Classifier overhead included
📖 What this is / how to use
What this calculator does

Route easy queries to Haiku 4.5 / GPT-5-mini / Gemini 3.1 Flash; reserve Opus 4.8 / GPT-5.5 / Gemini 3.1 Pro for the hard ones. See exactly how much you save.

Why use it
  • Typical production workloads see 40-70% cost reduction — no quality loss if routed correctly
  • Pick from 6 preset mixes (RAG, support, coding, research, content, balanced) or build your own
  • Includes classifier overhead — LLM-as-judge / embedding / none — because the classifier itself costs something
  • Compare your mix against every preset + sanity-check extremes (all-cheap, all-premium)

These are the inputs, outputs, and how you can use this calculator for your AI workloads.

📥 Inputs you provide
  • Monthly requestsTotal monthly call volume
  • Input tokens / requestAverage input size
  • Output tokens / requestAverage output size
  • Tier splitCheap / mid / premium query mix
  • Baseline modelThe single model you compare against
  • Routing methodHow queries get assigned to a tier
📤 Outputs you get
  • Baseline costCost with one model for everything
  • Routed costCost of your tier mix
  • Monthly savingsDollars saved per month
  • Classifier overheadCost of the routing decision
🎯 Use your results to
🔀
Tune your tier split

Test cheap/mid/premium ratios; many workloads hit 40-60% savings around an 80/15/5 mix

🆚
Compare against baseline

Exact dollar gap vs running one model for everything, plus every preset and the extremes

⚙️
Price the classifier

Routing overhead netted out so the savings are honest, not gross

🔌
Integrate with your agents

MCP available so agentic workflows can pull routing economics programmatically

👇 Now try the calculator below with your own AI workloads

📊 Calculator at a glance
📊 How it works (diagram)
Multi-Model Router full size
🎛 CALCULATOR
🧭 Your workload shape

Start with a preset, then tune the mix.

⚠️ Your tier percentages don't sum to 100% - results will be normalized.
🟢 Cheap tier - simple queriesHow your traffic divides across cheap, mid, and premium models. This is the core routing decision — most production workloads have 50-70% of queries a cheap model handles fine.How to choose: Cheap (🟢): simple/structured work — classification, extraction, FAQ, autocomplete, status checks. Mid (🟡): standard reasoning — explanations, synthesis, summaries. Premium (🟣): hard reasoning — architecture, multi-step logic, edge cases. Start from a preset, then drag the sliders; they co-adjust toward 100%.Read the full guide →
70%
-
🟡 Mid tier - standard queries
25%
-
🟣 Premium tier - complex queries
5%
-
What would you be using if you weren't routing? Baseline defaults to your premium-tier model.
-
📈 RESULTS
Baseline (single model)
-
-
With routing
-
-
Monthly savings
-
-
Cheap tier cost Mid tier cost Premium tier cost Classifier overhead
💡 Recommendations
    📋 Compare routing strategies side-by-side

    All strategies at your workload. Green = biggest savings, gold = your current mix.

    Strategy Mix Monthly vs baseline Savings
    Stack with prompt cache savings → RAG Pipeline Cost → Single-model baseline → Get a routing architecture review →
    📋 What now?
    📅 Book a routing-architecture review to apply this to your workload →

    Go deeper

    Our playbooks on cutting this number.

    💾
    Prompt Cache ROI
    Stack caching on top of routing
    🧩
    RAG Pipeline Cost
    Where routing shines most
    📦
    Batch vs Realtime
    Another 50% off for async workloads
    🧮
    Baseline LLM Cost
    Single-model sanity check

    Need help using this calculator for your workloads?

    AICost.ai has 50+ calculators and playbooks. Schedule an AvatarVA meeting and we'll work through your real cost scenarios across AI & Cloud: visibility, cost reduction, optimization, forecasting and capacity planning, without sacrificing accuracy or performance.

    📅 Schedule an AvatarVA meeting →
    📖 Data sources & methodology 171 text models · 9 embeddings · 30 vision · 46 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-13

    Methodology

    • All prices are USD per 1 million tokens, current as of 2026-06-13.
    • Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
    • Batch API discounts are 50% off standard rates across providers that offer Batch mode.
    • Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
    • Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
    • Long-context pricing tiers apply when input exceeds model threshold.
    • Embedding prices are input-only (no output tokens generated).

    Primary sources

    Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

    Anthropic
    2026-06-13
    https://www.anthropic.com/pricing
    Daily snapshot since Sep 2023 · 586 days captured
    Anthropic Docs
    2026-06-13
    https://platform.claude.com/docs/en/about-claude/pricing
    Daily snapshot since Sep 2023 · 586 days captured
    OpenAI
    2026-06-13
    https://openai.com/api/pricing/
    Daily snapshot since Sep 2023 · 587 days captured
    Google AI
    2026-06-13
    https://ai.google.dev/gemini-api/docs/pricing
    Daily snapshot since Dec 2023 · 562 days captured
    Google Vertex
    2026-06-13
    https://cloud.google.com/vertex-ai/generative-ai/pricing
    Daily snapshot since Dec 2023 · 562 days captured
    DeepSeek
    2026-06-13
    https://api-docs.deepseek.com/quick_start/pricing
    Daily snapshot since May 2024 · 501 days captured
    xAI
    2026-06-13
    https://x.ai/api
    Daily snapshot since Nov 2024 · 419 days captured
    Mistral
    2026-06-13
    https://mistral.ai/pricing
    Daily snapshot since Dec 2023 · 560 days captured
    Cohere
    2026-06-13
    https://cohere.com/pricing
    Daily snapshot since Sep 2023 · 586 days captured

    Inferred values (marked with * in calculator tables)

    Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

    Vendor / Model Field Why it’s inferred
    Anthropic — Claude Sonnet 4.6 cachedInput Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
    Anthropic — Claude Sonnet 4.5 cachedInput Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
    Anthropic — Claude Sonnet 4.5 batchInput Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
    Anthropic — Claude Sonnet 4.5 batchOutput Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
    Anthropic — Claude Haiku 4.5 cachedInput Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
    OpenAI — GPT-5.4 Mini cachedInput Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
    OpenAI — GPT-5.4 Nano cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
    OpenAI — GPT-5.4 Nano batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Nano batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Pro cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
    OpenAI — GPT-5.4 Pro batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Pro batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.2 cachedInput Derived at 10% of input; no residency uplift.
    OpenAI — GPT-5.2 batchInput Derived at 50% of input.
    OpenAI — GPT-5.2 batchOutput Derived at 50% of output.
    OpenAI — GPT-5 cachedInput Derived at 10% of input.
    OpenAI — GPT-5 batchInput Derived at 50% of input.
    OpenAI — GPT-5 batchOutput Derived at 50% of output.
    OpenAI — GPT-5.5 Pro cachedInput Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
    OpenAI — GPT-5.5 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5.5 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5.2 Pro cachedInput Derived at 10% of input — pro-tier convention.
    OpenAI — GPT-5.2 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5.2 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5.1 batchInput Derived at 50% of input.
    OpenAI — GPT-5.1 batchOutput Derived at 50% of output.
    OpenAI — GPT-5 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5 Nano cachedInput Derived at 10% of input.
    OpenAI — GPT-5 Nano batchInput Derived at 50% of input.
    OpenAI — GPT-5 Nano batchOutput Derived at 50% of output.
    Google — Gemini 3 Flash cachedInput Derived at 10% of input — Google caching discount convention ~90%.
    Google — Gemini 3.1 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 3.1 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 3.1 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.5 Pro cachedInput Derived at 10% of input.
    Google — Gemini 2.5 Flash cachedInput Derived at 10% of input.
    Google — Gemini 2.5 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 2.5 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.5 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash cachedInput Derived at 25% of input per Google 2.0 family caching rates.
    Google — Gemini 2.0 Flash batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 2.0 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    xAI — Grok 4 (legacy) cachedInput Extrapolated at 25% of base.

    Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →