The cheapest LLM API in 2026, by workload.

Updated June 2026. Real per-token prices for GPT, Claude, Gemini, DeepSeek and Llama — ranked by what you are actually doing: chat, RAG, content, batch and agents.

"Which LLM API is cheapest?" is the wrong question. The right one is cheapest for what you're actually doing. The model that wins a retrieval workload loses a content-generation workload, because input and output tokens are priced differently — output costs roughly 5× input on every frontier model. This guide ranks the cheapest option for each common workload, using live per-token pricing. Want an exact number for your own prompt? Drop it into the calculator — it shows the bill before you ship.

The absolute cheapest models right now

At the raw-price floor, three models are effectively tied:

DeepSeek is the cheapest of the three for output-heavy work because its output rate ($0.28) is the lowest on the market. For input-heavy work, the two $0.10 models edge ahead. See every rate side by side on the all-models page.

Cheapest for chat / assistants

Chat is a balanced workload: a system prompt plus history on input, a few hundred tokens on output. Two levers decide the winner here — the output rate and prompt caching, because a chat app sends the same system prefix on every turn.

Winner: Claude Haiku 4.5 ($1 / $5) or Gemini 3.5 Flash ($1.50 / $9) once caching is on. Caching a static system prompt bills the cached prefix at ~10% of the input rate, which routinely cuts a chat app's bill 40–60%. The slightly higher headline rate buys noticeably better answer quality than the sub-$0.20 floor models, which matters when a user is reading every reply. Compare them directly: Gemini 3.5 Flash vs Claude Haiku 4.5.

Cheapest for RAG / document Q&A

RAG has a lopsided cost profile: long input (retrieved chunks + question), short output (the answer). That means the input rate dominates and the output rate barely matters.

Winner: Gemini 2.5 Flash-Lite or GPT-4.1 Nano ($0.10 input). At ten cents per million input tokens, stuffing 8K tokens of context into a query costs a fraction of a cent. One caveat from real testing: on hard questions the very cheapest model sometimes misses, and a slightly pricier model (Gemini 3.5 Flash) becomes the better economic choice once you count re-asks and wrong answers. Budget for quality, not just the sticker price.

Cheapest for content generation

Content generation flips RAG on its head: short input (a brief), long output (the article). Here the output rate is everything, and output is the expensive direction.

Winner: DeepSeek V3.2 ($0.28 output). It has the lowest output rate on the market by a wide margin. If you need stronger writing quality, Gemini 2.5 Flash ($2.50 output) is the next step up. Whatever you pick, cap max_tokens and prompt for terse drafts — a 50% cut in output length is a ~40% cut in total cost.

Cheapest for batch / overnight jobs

If a job can wait 24 hours — nightly summaries, data enrichment, eval runs, embedding refresh — every major vendor offers a flat 50% discount through their Batch API. That discount stacks on top of the rates above.

Winner: whatever you'd use for the synchronous version, halved. Run DeepSeek V3.2 output-heavy batch jobs and you are paying $0.14 per million output tokens. There is no cheaper way to process tokens at scale in 2026.

Cheapest for agents and coding

Autonomous agents burn tokens in long loops, so per-token price matters — but a cheap model that fails the task and loops again is not cheap. The economical pattern is two-tier routing: a budget model (Haiku, Flash-Lite, Nano) triages 70–80% of steps, and a flagship (GPT-5.4, Claude Sonnet 4.6) handles the hard 20%. This typically saves 60–85% versus running everything through the flagship, with no meaningful quality loss. For the hardest long-horizon coding, the Mythos-class models (Claude Fable 5) cost more per run but save engineer-hours — see the model list for current rates.

The five rules that beat any model choice

  1. Cache your static prefix. ~10% billing on cached input. Biggest single lever for chat.
  2. Cap output. Output is 5× input; max_tokens is free money.
  3. Batch what can wait. Flat 50% off.
  4. Route by difficulty. Cheap model first, flagship only for the hard cases.
  5. Mind the tokenizer. Non-English text (Cyrillic, CJK) costs 2–4× more tokens — a hidden multiplier for multilingual apps.

FAQ

What is the cheapest LLM API overall in 2026?

By raw price, Gemini 2.5 Flash-Lite and GPT-4.1 Nano at $0.10/$0.40 per million tokens, with DeepSeek V3.2 ($0.14/$0.28) cheapest for output-heavy work. The best choice depends on your input/output ratio.

Is DeepSeek really cheaper than GPT and Claude?

For output-heavy workloads, yes — DeepSeek V3.2's $0.28 output rate is the lowest among production models. For input-heavy RAG, the $0.10-input models are cheaper.

Does prompt caching work on every provider?

Anthropic, OpenAI and Google all support it, with different syntax. Cached input is billed at roughly 10% of the standard rate.

How do I calculate my own cost?

Paste your real prompt into the gpt-cost.com calculator, pick a model, and it shows per-call and at-scale cost with input and output priced separately.

Prices synced from each vendor's official pricing page, June 2026. For exact billing, use each provider's official tokenizer.

Advertisement