Press "Enter" to skip to content

THE TOP 30 AI MODELS — A COMPREHENSIVE THESIS



Compiled June 25, 2026

Tier 1 — The Closed Frontier (General-Purpose Flagships)
The four labs that trade the overall lead month to month. As of June 2026, Claude Opus 4.8 leads the Artificial Analysis Intelligence Index at 61.4, ahead of GPT-5.5 (60.2), Gemini 3.1 Pro (57) and Grok 4.3 (53). No single model wins every axis.

  1. Claude Opus 4.8 (Anthropic (USA), 28 May 2026)
    Type: Closed · Reasoning + agentic flagship | Context: Long-context (200K+) | Price: Premium tier (~$15 in / $75 out per 1M tokens).
    Unique: Currently the #1 overall model on the Artificial Analysis Intelligence Index (61.4). Built for complex reasoning, long-horizon agentic coding and high-autonomy work — computer use, browser agents, financial analysis. Runs adaptive thinking with a configurable effort level (default ‘high’) plus an optional Fast mode that roughly 2.5x’s output speed at a premium. Tops the hardest general-reasoning benchmark in rotation (Humanity’s Last Exam: 57.9% with tools, 49.8% without) and leads GDPval-AA on economically valuable knowledge work.
    Coding: Leads available field on repository-scale software engineering: 88.6% SWE-bench Verified, 69.2% SWE-bench Pro (vendor harness). About 4x less likely than Opus 4.7 to let flaws slip through its own code, using ~35% fewer tokens per task.
    Latest: Successor to the Opus 4.5/4.6/4.7 line. Pairs with Claude Code (the developer-favourite agentic harness) and ships across Claude apps, the API, Claude Cowork and Claude in Chrome/Excel/PowerPoint.
  2. GPT-5.5 (OpenAI (USA), April 2026)
    Type: Closed · Unified general-purpose flagship | Context: Large | Price: Premium end (~$2.50 in / $15 out per 1M for the standard tier).
    Unique: Powers ChatGPT — the most popular AI product by far. Designed as a ‘unified system’ that decides how much compute and reasoning to spend on each query. Leads creative writing among the four flagships and posts the strongest math scores (≈95.2% AIME 2025). Neck-and-neck with Opus 4.8 at the top for coding; leads Terminal-Bench 2.1 for terminal-heavy work.
    Coding: Tops the standardized SWE-bench Pro leaderboard in head-to-head coding-arena play; succeeded the rapid GPT-5.1/5.2/5.4 cadence.
    Latest: Became OpenAI’s default model in the run-up to mid-June 2026. Ships GPT Image 2 natively for image generation.
  3. Gemini 3.1 Pro (Google DeepMind (USA), February 2026)
    Type: Closed · Multimodal reasoning flagship | Context: 1M tokens (2M on 3.5 Pro) | Price: Cheapest closed frontier for short prompts (~$2 in / $12 out); pricing roughly doubles above 200K-token contexts.
    Unique: Leads on reasoning and data analysis among the four flagships, and is the multimodal leader — native video, audio and a 1M-token window. Cheapest of the closed frontier models for short prompts. Its dedicated ultra-tier reasoning mode, Gemini 3 Deep Think, is the pick for the hardest maths, science and logic (gold-medal IMO-level performance).
    Coding: Competitive (≈63.8% SWE-bench class), strong in Google’s Docs/Sheets ecosystem and via Gemini CLI.
    Latest: Gemini 3.5 Pro began limited rollout in mid-2026, pairing a 2M-token window with frontier-class quality — the best balance of context and capability. Nano Banana 2 / Nano Banana Pro handle image generation.
  4. Grok 4.3 (xAI (USA), End of April 2026)
    Type: Closed · Reasoning-first flagship | Context: 1M tokens (2M on Grok 4 Fast) | Price: By far the most affordable of the four flagships (~$2 in / $15 out).
    Unique: Always-on (‘continuous’) reasoning with configurable reasoning-effort levels (none/low/medium/high). The most aggressively priced of the four flagships and especially strong at agentic tool use, instruction-following and high-factual-accuracy work — currently #1 on Artificial Analysis’s CaseLaw legal-reasoning benchmark. Real-time access to X data is a unique differentiator.
    Coding: Leads raw SWE-bench scores in some 2026 snapshots (~75%); strong agentic and tool-use profile.
    Latest: Shipped a separate Custom Voices voice-cloning suite alongside the model. Grok Imagine handles fast experimental image/video.

Tier analysis: The defining feature of the closed-flagship tier in 2026 is convergence at the top. A year ago a single model could claim a clear overall lead; today the four leaders are separated by single-digit index points and trade positions with every release. This has three consequences. First, brand loyalty is a poor strategy — the rational approach is task-specific selection, choosing Opus 4.8 or GPT-5.5 for demanding code, Gemini 3.1 Pro for analytical reasoning over large documents, and Grok 4.3 where cost and tool-use throughput dominate. Second, the labs have stopped competing purely on raw intelligence, which has largely saturated the easier benchmarks, and now differentiate on reliability, token-efficiency, agentic autonomy and price. Opus 4.8’s headline claim is not that it is smarter than its predecessor but that it is roughly four times less likely to ship a flaw in its own code while using about a third fewer tokens. Third, each flagship now anchors an ecosystem rather than a single chat box: Claude spans Code, Cowork and the Chrome/Excel/PowerPoint agents; GPT-5.5 powers ChatGPT and GPT Image 2; Gemini reaches into Workspace and Vertex; Grok taps real-time X data. Choosing a flagship increasingly means choosing an ecosystem.

———

Tier 2 — The Suspended Apex (Mythos / Fable)
Anthropic’s first public frontier-tier models posted the highest benchmark scores any model has ever recorded — then went dark. A US export-control directive on 12 June 2026 ordered access disabled for all foreign nationals; Anthropic complied by suspending both worldwide for every customer.

  1. Claude Fable 5 (Anthropic (USA), 9 June 2026 (suspended 12 June))
    Type: Closed · Frontier-tier (with extra safety classifiers) | Context: Frontier | Price: Not currently available.
    Unique: Anthropic’s first public frontier-tier (‘Mythos-class’) model. Posted the highest scores any model has recorded: 80.3% SWE-bench Pro and 95.0% SWE-bench Verified (vendor). Adds additional safety measures for biology, cybersecurity and LLM R&D over its sibling. The true coding ceiling of 2026 — that nobody can currently run.
    Coding: Highest coding scores on record (80.3% SWE-bench Pro vendor).
    Latest: Suspended worldwide on the evening of 12 June 2026 under a US export-control directive; see Anthropic’s Project Glasswing / Fable-Mythos access notices.
  2. Claude Mythos 5 / Mythos Preview (Anthropic (USA), June 2026 (suspended / preview only))
    Type: Closed · Frontier-tier (no extra classifier) | Context: Frontier | Price: Not publicly available.
    Unique: The restricted, no-classifier sibling of Fable 5. Claude Mythos Preview currently leads GPQA Diamond (94.6%), the most discriminating reasoning benchmark at the frontier, on the LLM-Stats leaderboard. In limited use by a small number of trusted organisations under Project Glasswing.
    Coding: Frontier-class; shares Fable 5’s underlying model.
    Latest: Access temporarily suspended in response to the same export-control directive.

Tier analysis: The Mythos tier is the most important model story of 2026 precisely because almost nobody can use it. Anthropic’s decision to ship a genuine frontier-tier system and then suspend it within seventy-two hours, in compliance with a US export-control directive, crystallised a new reality: capability and access have decoupled. The strongest models on every leaderboard are, as of this writing, unavailable to the public. For strategists this is clarifying rather than frustrating — it means the practical planning ceiling is the available closed frontier, and the apex tier should be treated as a signal of trajectory (how fast the absolute boundary is moving) rather than a procurable resource. The existence of a paired ‘Fable’ variant, identical in capability but carrying additional safety classifiers for biology, cybersecurity and AI R&D, also signals where the industry is heading: at the apex, safety tooling is shipped as a differentiated product line, not an afterthought.

———

Tier 3 — The Open-Weight Frontier (Chinese Labs)
Open weights stopped being the budget alternative in 2026 and became a default. Chinese labs hold four of the top five open-weight positions, pricing API access 5–30x below Western equivalents and offering downloadable weights for self-hosting.

  1. GLM-5.2 (Z.ai / Zhipu AI (China), 13 June 2026)
    Type: Open-weight (MIT) · 744B params | Context: 1M tokens | Price: ~1/6 of GPT-5.5 pricing; needs ~800GB VRAM (≈8 H200s) to self-host at full precision.
    Unique: The new open-weight leader on the Artificial Analysis Intelligence Index (scores 51, ranks 5th overall). Built for long-horizon coding agents — can work independently for up to 8 hours in a single agentic run, beating GPT-5.5 on several long-horizon coding benchmarks at roughly one-sixth the price. Zhipu was the first to train a frontier model entirely on Huawei Ascend chips with no Nvidia hardware.
    Coding: SOTA-class open-weight coding; strong judgment on ambiguous problems and sustained iteration over thousands of tool calls.
    Latest: Successor to GLM-5 / GLM-5.1. GLM-4.7-Flash (30B/3B active) runs on consumer GPUs.
  2. DeepSeek V4 (Pro / Flash) (DeepSeek (China), 24 April 2026)
    Type: Open-weight (MIT) · MoE, up to 1.6T | Context: 1M tokens | Price: Cheapest credible option — V4 Flash ~$0.14 in / $0.28 out per 1M (cache-hit input as low as ~$0.003).
    Unique: The price/performance throne and the strongest free, self-hostable frontier-adjacent model (80.6% SWE-bench Verified). Pro and Flash variants reset the cost floor; leads competitive-programming benchmarks (LiveCodeBench 93.5, Codeforces 3206) — including some closed frontier APIs. The 2025 ‘DeepSeek moment’ shifted the global narrative from ‘compute is everything’ to ‘architecture and training efficiency matter as much’.
    Coding: Leads all evaluated models on LiveCodeBench and Codeforces; V4 Flash (13B active) fits a small multi-GPU box.
    Latest: V4 Preview shipped with Pro, Flash, 1M context, API access and open weights.
  3. Kimi K2.6 (Moonshot AI (China), 20 April 2026)
    Type: Open-weight · Agentic multimodal | Context: Long | Price: Cache-hit input as low as ~$0.16/M; 4–17x cheaper than GPT-5.4.
    Unique: The agent-swarm pick: an ‘Agent Swarm’ architecture coordinating up to ~100 parallel sub-agents, built for long-horizon coding, design, visual reasoning and autonomous execution. Became the first open-weight model to beat GPT-5.4 (xhigh) on SWE-bench Pro. Cost-efficient sub-agent work is its standout — running many parallel instances without the inference bill becoming the constraint.
    Coding: 58.6% open-weight SWE-bench Pro (April); 88.7 on BenchLM’s coding category; excellent at multi-file editing inside an agent harness.
    Latest: Builds on Kimi K2.5 (notable for the Cursor Composer 2 attribution episode). K2.7 Code adds a coding-specialised variant.
  4. MiniMax M3 (MiniMax (China), 1 June 2026)
    Type: Open-weight (modified MIT) · Native multimodal | Context: 1M tokens | Price: Competitive open-weight pricing; check modified-MIT terms before redistributing.
    Unique: Bet on a cheap 1M-token context plus native multimodality in a single open-weight model — if you feed whole images or documents, its native multimodal window beats bolting retrieval onto a text-only model. Tops the open-weight SWE-bench Pro at 59.0% (June 2026), edging past Kimi K2.6.
    Coding: Highest open-weight SWE-bench Pro at launch (59.0%).
    Latest: Weights rolled out mid-June 2026; successor to the M2.5/M2.7 line.
  5. Qwen 3.6 Plus / Qwen3-Coder-Next (Alibaba Cloud (China), 2026 (rolling))
    Type: Open-weight (Apache 2.0) · MoE | Context: 1M tokens | Price: Open weights free to self-host; hosted variants among the cheapest.
    Unique: The accessibility and multilingual champion. Compact Mixture-of-Experts designs run on a single GPU (Qwen 3.6 ≈3B active) with strong tool calling and vision; unmatched Chinese/Japanese/Korean processing. Qwen3-Coder-Next (80B total, 3B active) is the best practical local coding-agent pick; Qwen 3.6 Plus offers a frontier-competitive 1M-token window for demanding agentic coding.
    Coding: Qwen3-Coder-480B-A35B trained on 7.5T tokens (70% code), SOTA among open coding models; Qwen 3.6 Plus close to closed frontier on coding benchmarks.
    Latest: Widest size range of any family (9B–397B). Qwen 4 family expected Q3/Q4 2026.

Tier analysis: The Chinese open-weight wave is the single biggest structural change in the 2026 market. Restricted from the most advanced Nvidia hardware since late 2022, these labs were forced to compete on architectural and training efficiency — and the resulting breakthroughs, from DeepSeek’s low-cost training to Zhipu training GLM-5.2 entirely on Huawei Ascend chips, now benefit the whole industry. The practical upshot is that frontier-adjacent quality is available with downloadable weights, permissive MIT or Apache licences, and API pricing five to thirty times below Western equivalents. Each lab has carved a distinct niche: Zhipu (GLM) leads open-weight intelligence and long-horizon coding; DeepSeek owns price and competitive-programming reasoning; Moonshot (Kimi) leads agent-swarm parallelism; MiniMax bets on cheap native multimodality at million-token scale; and Alibaba (Qwen) offers the widest size range and the strongest multilingual coverage. The one genuine caveat is the data path: routing sensitive data through Chinese-hosted APIs raises jurisdictional concerns, but self-hosting the open weights inside your own boundary removes that issue while keeping the cost and control advantages.

———

Tier 4 — Western Open-Weight & Specialist Labs
Open-weight options outside China — strongest on permissive licensing, data residency, local deployment and reasoning. The picks for air-gapped, EU-resident or consumer-GPU workloads.

  1. Llama 4 (Scout / Maverick) (Meta (USA), 2025–2026)
    Type: Open-weight (Llama 4 Community License) · Multimodal MoE | Context: Up to 10M (Scout) | Price: Free weights; community-license restrictions on very large commercial users.
    Unique: The open-weight multimodal workhorse and the longest-context model on the board — Llama 4 Scout advertises a 10M-token window, Maverick 1M. Meta’s scale and ecosystem make Llama the default starting point for open multimodal fine-tuning and on-prem deployment.
    Coding: Solid general coding; rounds out the self-host options alongside GLM and Qwen.
    Latest: Scout and Maverick are the current open-weight multimodal releases under Meta’s community license.
  2. Mistral Large 3 / Medium 3.5 / Small 4 (Mistral AI (France/EU), 2025–2026)
    Type: Open-weight (Apache 2.0 on Small) + commercial | Context: Large | Price: Apache 2.0 on Small (fully permissive commercial); competitive API tiers.
    Unique: The European pick for EU data residency and compliance-bound deployments. A full lineup — Large 3 and Medium 3.5 for capability, Small 4 (Apache 2.0) and Devstral 2 for local/coding. Mistral co-leads NVIDIA’s open Nemotron Coalition. The strongest enterprise-ecosystem case in the EU.
    Coding: Devstral 2 (123B dense) and Devstral Small 2 are purpose-built coding models for local agents.
    Latest: Small 4 and Large 3 priced competitively for open-weight multimodal models; part of the Nemotron Coalition.
  3. NVIDIA Nemotron 3 (NVIDIA (USA), March 2026 (GTC))
    Type: Open-weight (Nemotron Open Model License) · Hybrid Mamba-Transformer | Context: 1M tokens | Price: Open weights; engineered for throughput-per-dollar.
    Unique: A hybrid Mamba-2-Transformer MoE with Multi-Token Prediction — Mamba-2 layers process most of the sequence in linear time, making a 1M-token window practical and cheap. Nemotron 3 Nano delivers ~3.3x higher throughput than Qwen3-30B-A3B on an H200 at comparable quality, with a 1M context that runs on consumer GPUs. Anchors NVIDIA’s Nemotron Coalition (with Mistral, Perplexity, Cursor, LangChain).
    Coding: Nemotron 3 Super for free hosted coding use; strong instruction-following after the ‘thinking’ release.
    Latest: Trained on 25T tokens; Nano variant is a standout consumer-GPU option.
  4. Google Gemma 4 (Google DeepMind (USA), 2 April 2026)
    Type: Open-weight (Apache 2.0) · Multimodal | Context: Large | Price: Free weights, fully permissive commercial use.
    Unique: The cleanest local and commercial-license open pick — relicensed to Apache 2.0 (up from the restrictive Gemma Terms). Configurable thinking modes, all sizes multimodal with variable aspect-ratio/resolution support, built-in function calling. The 26B MoE activates only ~4B params per token, excellent for high-throughput reasoning on consumer GPUs.
    Coding: Capable general-purpose coder; strong for on-device and edge inference.
    Latest: April 2026 Apache 2.0 relicense was the headline change.
  5. DeepSeek R1 (DeepSeek (China), Jan 2025 (still widely used))
    Type: Open-weight (MIT) · Reasoning | Context: Large | Price: Open weights; very cheap hosted.
    Unique: The model that proved a ~$6M training run could briefly match the top US model (Feb 2025) and became the most-downloaded free app in the US. Still one of the strongest open reasoning models under MIT, and a key historical inflection point — it reframed the entire industry around training efficiency.
    Coding: Strong algorithmic reasoning; the foundation the V3/V4 line built on.
    Latest: Largely superseded by V4 for production, but historically pivotal and still deployed.

Tier analysis: Outside China, open-weight development is led by a different set of motivations: permissive licensing, data residency, local deployment and architectural experimentation. Meta’s Llama line made open-weight a serious category and still offers the longest context windows on the board. Mistral is the European anchor, the natural pick for EU data-residency requirements and a co-leader of NVIDIA’s open Nemotron Coalition. NVIDIA’s own Nemotron 3 demonstrates that hybrid Mamba-Transformer architectures can make million-token context cheap enough to run on consumer GPUs. Google’s Gemma 4, relicensed to Apache 2.0, is the cleanest local-and-commercial pick. And DeepSeek R1 belongs here as a historical hinge — the model that proved a few-million-dollar training run could briefly match the best of the West and reframed the entire industry around efficiency. Together these models give organisations genuine choice in deployment model, licence and hardware footprint that simply did not exist eighteen months earlier.

———

Tier 5 — Image, Video & Audio Generation
Multimodal generation fractured into specialised tools in 2026: aesthetics, prompt-accuracy, photorealism, text-in-image and video each have a different leader. Multimodal input is now a floor, not a differentiator — these models compete on output quality.

  1. Midjourney V8 / V8.1 (Midjourney (USA), V8 March 2026, V8.1 alpha April 2026)
    Type: Closed · Text-to-image | Context: — | Price: $10/mo Standard, $30/mo Pro. No public API (a hard blocker for product integration).
    Unique: The undisputed leader for artistic and aesthetic quality — gallery-worthy portraits, cinematic concept art, unmatched stylistic depth. V8 adds native 2K resolution, HD Mode and ~5x faster rendering. The style-reference (sref) and moodboard system locks a visual direction across an entire project like no competitor.
    Latest: V8.1 entered alpha 14 April 2026 with sharper textures and better prompt adherence.
  2. OpenAI GPT Image 2 (OpenAI (USA), April 2026)
    Type: Closed · Native multimodal image | Context: — | Price: ~$0.04–0.08 per image via the OpenAI API; full REST access.
    Unique: The prompt-accuracy and text-rendering leader (~99% accuracy, multilingual). Generates natively inside the language model rather than via diffusion, so plain-English instructions execute precisely; excellent at complex multi-element prompts and consistent multi-image series. Carries Content Authenticity tags.
    Latest: Replaced DALL-E twice (DALL-E 3 → GPT Image 1.5 → GPT Image 2). DALL-E API support ended May 2026. Built natively into the GPT-5.x line.
  3. FLUX.2 (Pro / dev / klein) (Black Forest Labs (Germany), 2026)
    Type: Open-weight (dev) + API · Image | Context: — | Price: ~$0.03–0.05 per image (Replicate/fal.ai/BFL); self-hostable.
    Unique: The photorealism leader — skin textures, lighting physics and natural depth of field that fool professional photographers. The pragmatic API choice: predictable per-image pricing, no cold starts, LoRA support. FLUX.2 dev is open-weight on HuggingFace; FLUX.2 klein does sub-second generation on consumer GPUs.
    Latest: FLUX.2 succeeded the 1.1 Pro line; Schnell and klein variants target speed and local use.
  4. Google Veo 3 / 3.1 (Google DeepMind (USA), 2025–2026)
    Type: Closed · Text/image-to-video | Context: — | Price: Via Google’s AI subscriptions and Vertex AI.
    Unique: A leading video generator with emergent zero-shot abilities — tested across 18,000+ generated videos, it simulated buoyancy and solved mazes without being trained on those tasks (‘video models are zero-shot learners’). Reference mode keeps characters consistent across shots.
    Latest: Veo 3.1 is the current release; integrated into Google’s creative stack and partner platforms.
  5. Stable Diffusion 3.5 (Stability AI (UK), 2024–2026)
    Type: Open-weight · Image | Context: — | Price: Effectively free per image (hardware required).
    Unique: The most open and customizable platform — full model weights, unlimited fine-tuning, the largest ecosystem of community checkpoints and extensions. The 4-step Flash variant runs on mobile hardware, making on-device generation viable for privacy-sensitive work. The choice for maximum control and zero per-image cost at volume.
    Latest: 3.5 with open-weight community models; no longer a top-five mainstream default but irreplaceable for local pipelines.
  6. Ideogram 3.0 (Ideogram (USA), 2026)
    Type: Closed · Image (typography specialist) | Context: — | Price: ~$20/mo tiers; API available.
    Unique: Solves the text-in-image problem that plagued every earlier generator — 90–95% text-rendering accuracy versus 30–40% for Midjourney. The specialist pick for logos, signs, labels, posters and any branded content where typography must be legible.
    Latest: Notably improved photorealism alongside its core text strength.

Tier analysis: Generative media has fractured into specialists, and that fragmentation is the key insight. Unlike the language-model frontier, where a single flagship is broadly excellent, the image and video market rewards different leaders on different axes: Midjourney for aesthetic and artistic output, GPT Image 2 for precise prompt-following and text rendering, FLUX.2 for photorealism and API-first integration, Ideogram for typography, Stable Diffusion for customization and on-device privacy, and Veo for video with emergent physical reasoning. Professional studios therefore run a stack rather than a single tool, routing each job to the model that wins its category. Two deeper shifts underpin this: native multimodal generation inside language models (GPT Image 2) is displacing pure diffusion for instruction-heavy work, and per-image API pricing around three to five US cents has turned generation from a standalone subscription into something embedded directly in products and pipelines.

———

Tier 6 — Legacy & Foundational Models (Historical Context)
The models that defined earlier eras and are now mid-tier or retired — but whose architectures and ‘moments’ shaped everything above. Understanding them frames how fast the frontier has moved.

  1. GPT-4 / GPT-4o (OpenAI (USA), 2023–2024)
    Type: Closed · Legacy flagship | Context: Up to 128K | Price: Largely retired in favour of GPT-5.x.
    Unique: The model that brought generative AI into the mainstream and set the template for the modern assistant — multimodal ‘omni’ input (text, vision, audio), tool use and the ChatGPT product that became one of the fastest-growing apps in history. Every flagship since is, in part, a response to GPT-4.
    Coding: Strong for its era; the baseline SWE-bench numbers later models 4x’d.
    Latest: Superseded by the GPT-5 family; historically the reference point for ‘frontier’.
  2. Claude 3 / 3.5 (Opus & Sonnet) (Anthropic (USA), 2024)
    Type: Closed · Legacy flagship | Context: 200K | Price: Superseded by Claude 4.x.
    Unique: Established Anthropic’s long-context, safety-forward, writing-strong identity and the Opus/Sonnet/Haiku tiering that persists today. Claude 3.5 Sonnet was the model that made Claude the developer-tooling favourite, seeding the Cursor/Windsurf/Claude Code ecosystem.
    Coding: First Claude generation to dominate developer tooling.
    Latest: Direct ancestor of the Opus 4.x and Mythos line.
  3. Gemini 1.5 Pro (Google DeepMind (USA), 2024)
    Type: Closed · Legacy multimodal | Context: 1M (then 2M) | Price: Superseded.
    Unique: Pioneered the ultra-long context window at consumer scale — first to ship 1M then 2M tokens — proving that whole books, codebases and hours of video could fit in a single prompt. Set the long-context bar the whole industry then chased.
    Coding: Solid; notable for repository-scale context.
    Latest: Ancestor of the Gemini 3.x line.
  4. Llama 2 / Llama 3 (Meta (USA), 2023–2024)
    Type: Open-weight · Legacy | Context: Up to 128K (Llama 3) | Price: Free weights.
    Unique: The releases that made open-weight a serious category. Llama 2’s permissive license and Llama 3’s quality catalysed the entire local-LLM and fine-tuning ecosystem — the foundation on which today’s open-weight frontier (and most academic work) was built.
    Coding: Baseline open coding capability for its era.
    Latest: Superseded by Llama 4.
  5. DALL-E 3 (OpenAI (USA), Sept 2023 (deprecated May 2026))
    Type: Closed · Legacy image | Context: — | Price: Retired.
    Unique: Brought high-quality, prompt-faithful image generation to a mass audience inside ChatGPT and made ‘type a sentence, get a picture’ a mainstream expectation. Its deprecation in May 2026 — replaced by GPT Image 2 — marks the end of the first mainstream text-to-image era.
    Latest: Deprecated 12 May 2026; superseded by GPT Image 1.5 then GPT Image 2.
  6. GPT-3.5 / InstructGPT (OpenAI (USA), 2022)
    Type: Closed · Legacy | Context: 4K–16K | Price: Retired.
    Unique: The model behind the original ChatGPT launch (Nov 2022) that started the entire consumer-AI wave. Demonstrated that RLHF-tuned instruction-following could turn a raw language model into a usable assistant — the single most consequential product release in the field’s history.
    Coding: Modest by today’s standards; historically transformative.
    Latest: Long superseded; the origin point of the current era.

Tier analysis: The legacy tier exists to calibrate just how fast the frontier has moved. GPT-3.5 and the original ChatGPT launch in late 2022 began the consumer-AI era by proving that RLHF-tuned instruction-following could turn a raw model into a usable assistant. GPT-4 set the template for the modern multimodal assistant. Claude 3.5 Sonnet seeded the developer-tooling ecosystem that Claude Code now dominates. Gemini 1.5 Pro pioneered the million-token context window that the whole industry then chased. Llama 2 and 3 made open-weight viable. DALL-E 3 made text-to-image mainstream before being retired in 2026. None of these is a sensible production choice today, but every model in the earlier tiers is, in a real sense, a response to one of them — and the benchmark numbers they posted are the ones their successors have since multiplied several-fold.

———

Tier 7 — Notable Specialists & Rising Contenders
Models that don’t top the overall index but lead a specific axis — cost floors, speed records, legal reasoning, or fast-rising open challengers worth tracking.

  1. Mercury 2 / Step 3.5 Flash (Inception Labs / StepFun (USA/China), 2026)
    Type: Speed & cost specialists | Context: Varies | Price: Among the cheapest and fastest available.
    Unique: The throughput and price-floor record-holders. Mercury 2 (a diffusion-based LLM) posts the highest output throughput measured — ~729 tokens/sec — making it ideal for streaming UIs and tight agentic loops. Step 3.5 Flash set a ~$0.10/M price floor with math reasoning comparable to far pricier models, pushing the whole market toward sub-$0.05/M.
    Coding: Step 3.5 Flash strong on math; Mercury optimised for latency-critical serving.
    Latest: Part of the 2026 price-and-speed compression wave.
  2. GPT-5 nano / Grok 4 Fast / Gemini Flash (OpenAI / xAI / Google, 2025–2026)
    Type: Small fast tiers of the flagships | Context: Up to 2M (Grok 4 Fast) | Price: From $0.05/M (GPT-5 nano).
    Unique: The high-volume routing tier that quietly runs most production AI. GPT-5 nano ($0.05/M) and DeepSeek V3 ($0.27/M) reset what capable inference costs; Grok 4 Fast exposes the largest practical context window at 2M tokens; Gemini Flash variants produce images and text in 3–5 seconds. The strategic move in 2026 is tiered routing — cheap-and-fast for everyday calls, premium models for final-draft quality.
    Coding: Good-enough coding for high-volume, low-stakes calls; the workhorses of model-routing gateways.
    Latest: Continuous price compression; routing gateways can cut spend ~50% without quality loss.

Tier analysis: The final tier collects models that do not top the overall index but own a single axis, and these are often the most commercially relevant choices. Speed and price specialists — Mercury 2 with record throughput, Step 3.5 Flash setting price floors — make latency-critical and high-volume workloads economical. The small, fast tiers of the flagships (GPT-5 nano, Grok 4 Fast, Gemini Flash) are the quiet workhorses that actually run most production traffic, and they are the foundation of the tiered-routing strategy that defines 2026 cost management. The lesson of this tier is that ‘best overall’ is frequently the wrong question; for a given workload, the right question is which model clears the necessary quality bar at the lowest cost and latency.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *