The Cheapest LLM API in 2026 — Full Cost Breakdown by Workload

"Which LLM API is cheapest?" is the wrong question. The right one is cheapest for what you're actually doing. The model that wins a retrieval workload loses a content-generation workload, because input and output tokens are priced differently — output costs roughly 5× input on every frontier model. This guide ranks the cheapest option for each common workload, using live per-token pricing. Want an exact number for your own prompt? Drop it into the calculator — it shows the bill before you ship.

The absolute cheapest models right now

At the raw-price floor, three models are effectively tied:

Gemini 2.5 Flash-Lite — $0.10 input / $0.40 output per million tokens
GPT-4.1 Nano — $0.10 / $0.40 per million
DeepSeek V3.2 — $0.14 / $0.28 per million

DeepSeek is the cheapest of the three for output-heavy work because its output rate ($0.28) is the lowest on the market. For input-heavy work, the two $0.10 models edge ahead. See every rate side by side on the all-models page.

Cheapest for chat / assistants

Chat is a balanced workload: a system prompt plus history on input, a few hundred tokens on output. Two levers decide the winner here — the output rate and prompt caching, because a chat app sends the same system prefix on every turn.

Winner: Claude Haiku 4.5 ($1 / $5) or Gemini 3.5 Flash ($1.50 / $9) once caching is on. Caching a static system prompt bills the cached prefix at ~10% of the input rate, which routinely cuts a chat app's bill 40–60%. The slightly higher headline rate buys noticeably better answer quality than the sub-$0.20 floor models, which matters when a user is reading every reply. Compare them directly: Gemini 3.5 Flash vs Claude Haiku 4.5.

Cheapest for RAG / document Q&A

RAG has a lopsided cost profile: long input (retrieved chunks + question), short output (the answer). That means the input rate dominates and the output rate barely matters.

Winner: Gemini 2.5 Flash-Lite or GPT-4.1 Nano ($0.10 input). At ten cents per million input tokens, stuffing 8K tokens of context into a query costs a fraction of a cent. One caveat from real testing: on hard questions the very cheapest model sometimes misses, and a slightly pricier model (Gemini 3.5 Flash) becomes the better economic choice once you count re-asks and wrong answers. Budget for quality, not just the sticker price.

Cheapest for content generation

Content generation flips RAG on its head: short input (a brief), long output (the article). Here the output rate is everything, and output is the expensive direction.

Winner: DeepSeek V3.2 ($0.28 output). It has the lowest output rate on the market by a wide margin. If you need stronger writing quality, Gemini 2.5 Flash ($2.50 output) is the next step up. Whatever you pick, cap max_tokens and prompt for terse drafts — a 50% cut in output length is a ~40% cut in total cost.

Cheapest for batch / overnight jobs

If a job can wait 24 hours — nightly summaries, data enrichment, eval runs, embedding refresh — every major vendor offers a flat 50% discount through their Batch API. That discount stacks on top of the rates above.

Winner: whatever you'd use for the synchronous version, halved. Run DeepSeek V3.2 output-heavy batch jobs and you are paying $0.14 per million output tokens. There is no cheaper way to process tokens at scale in 2026.

Cheapest for agents and coding

Autonomous agents burn tokens in long loops, so per-token price matters — but a cheap model that fails the task and loops again is not cheap. The economical pattern is two-tier routing: a budget model (Haiku, Flash-Lite, Nano) triages 70–80% of steps, and a flagship (GPT-5.4, Claude Sonnet 4.6) handles the hard 20%. This typically saves 60–85% versus running everything through the flagship, with no meaningful quality loss. For the hardest long-horizon coding, the Mythos-class models (Claude Fable 5) cost more per run but save engineer-hours — see the model list for current rates.

The five rules that beat any model choice

Cache your static prefix. ~10% billing on cached input. Biggest single lever for chat.
Cap output. Output is 5× input; max_tokens is free money.
Batch what can wait. Flat 50% off.
Route by difficulty. Cheap model first, flagship only for the hard cases.
Mind the tokenizer. Non-English text (Cyrillic, CJK) costs 2–4× more tokens — a hidden multiplier for multilingual apps.

FAQ

What is the cheapest LLM API overall in 2026?

By raw price, Gemini 2.5 Flash-Lite and GPT-4.1 Nano at $0.10/$0.40 per million tokens, with DeepSeek V3.2 ($0.14/$0.28) cheapest for output-heavy work. The best choice depends on your input/output ratio.

Is DeepSeek really cheaper than GPT and Claude?

For output-heavy workloads, yes — DeepSeek V3.2's $0.28 output rate is the lowest among production models. For input-heavy RAG, the $0.10-input models are cheaper.

Does prompt caching work on every provider?

Anthropic, OpenAI and Google all support it, with different syntax. Cached input is billed at roughly 10% of the standard rate.

How do I calculate my own cost?

Paste your real prompt into the gpt-cost.com calculator, pick a model, and it shows per-call and at-scale cost with input and output priced separately.

Prices synced from each vendor's official pricing page, June 2026. For exact billing, use each provider's official tokenizer.

The cheapest LLM API in 2026, by workload.