The State of AI Models: Comprehensive Capability Assessment (March 2026)

Executive Summary

The AI landscape in March 2026 is characterized by convergence across modalities, efficiency breakthroughs in smaller models, and specialized reasoning capabilities. The "frontier" has fractured into competing ecosystems: proprietary closed-source models from major labs (OpenAI, Anthropic, Google) versus increasingly capable open-source alternatives (Qwen3, DeepSeek-V3.2, Llama 4). This report assesses the state of the art across major capability domains with verified metrics and source links.

I. Large Language Models (Text Generation)

Frontier Models (Proprietary Closed-Source)

OpenAI GPT-5.4

Release: March 5, 2026 (OpenAI announcement)
Context Length: 1,000,000 tokens (API)
Training Data Cutoff: Early 2026 (estimated)
Key Metrics:
- MMLU: 88.5%
- WebArena-Verified: 67.3% (agentic browser control)
- Online-Mind2Web: 92.8% (screenshot-based browser automation)
- 33% fewer factual errors vs. GPT-5.2
Multimodal: Yes (text, image, audio input; text, image output)
Variants: GPT-5.4, GPT-5.4-mini, GPT-5.4-nano
Availability: ChatGPT + API (developers.openai.com)
Cost: $2.50/1M input tokens, $20.00/1M output tokens
Note: GPT-4o, GPT-4.1, and o4-mini were retired from ChatGPT on February 13, 2026

Anthropic Claude Opus 4.6 / Sonnet 4.6

Release: February 5, 2026 (Anthropic announcement)
Context Length: 1,000,000 tokens (beta, first Opus-class model with 1M context)
Max Output: 128,000 tokens
Key Metrics:
- MMLU: 91.3%
- GPQA Diamond: 91.3%
- HumanEval: 95.4%
- SWE-Bench Verified: 80.2%
- AIME 2025: 94.2%
- MATH: 97.2%
- Terminal-Bench 2.0: 65.4%
- ARC-AGI: 68.8%
- HLE (Humanity's Last Exam): 53%
- MRCR v2 (8-needle 1M): 76% (vs. Sonnet 4.5's 18.5%)
Multimodal: Yes (text, image input; text output)
Features: Adaptive Thinking architecture, Compaction for long-running agentic tasks
Availability: Claude.ai, API (platform.claude.com)
Cost: $5.00/1M input tokens, $25.00/1M output tokens
Sonnet 4.6: SWE-Bench 79.6%, Finance Agent v1.1 63.3% (leads all models); released February 17, 2026

Google Gemini 3.1 Pro

Release: February 19, 2026 (Google blog, Model card)
Context Length: 1,000,000 tokens
Max Output: 66,000 tokens
Key Metrics:
- MMLU: 98%
- MMLU-Pro: 90.5%
- GPQA Diamond: 94.3%
- HumanEval: 94%
- SWE-Bench: 80.6%
- AIME 2025: 95%
- ARC-AGI-2: 77.1% (more than 2x Gemini 3 Pro)
- HLE: 44.4%
- Terminal-Bench: 68.5%
Multimodal: Yes (text, image, audio, video input; text output)
Features: DeepThink three-tier reasoning system (Low/Medium/High)
Availability: Google AI Studio, Vertex AI
Cost: $2.00/1M input tokens, $12.00/1M output tokens
Note: Gemini 3 Pro Preview was shut down March 9, 2026

Open-Source Frontier (Competitive)

Qwen3.5-397B-A17B

Architecture: Mixture-of-Experts, 397B total / 17B activated per token
Release: March 2026 (latest open-source)
Multimodal: Text-only (vision via separate Qwen3-VL series)
Licensing: Apache 2.0 (fully open-source)
Availability: Qwen/Qwen3.5-397B-A17B
Technical Report: arXiv:2505.09388

DeepSeek-V3.2

Architecture: Mixture-of-Experts, 671B total / 37B activated per token
Release: Late 2025 (DeepSeek announcement)
Variants:
- V3.2: Balanced inference, daily-driver at GPT-5-level performance
- V3.2-Speciale: Maxed-out reasoning; gold-medal in IMO, CMO, ICPC World Finals, IOI 2025
Key Metrics (V3.2):
- MMLU-Pro: 81.2%
- GPQA Diamond: 68.4%
- AIME: 59.4%
- LiveCodeBench: 49.2%
Features: First DeepSeek model with thinking integrated into tool-use; trained on 1,800+ environments & 85K+ complex instructions
Open-Source: deepseek-ai/DeepSeek-V3.2
Technical Report: arXiv:2512.02556

Meta Llama 4

Release: April 5, 2025 (Meta blog, TechCrunch)
Architecture: Mixture-of-Experts (MoE), natively multimodal
Variants:
- Scout: 17B active / 109B total / 16 experts (Hugging Face)
- Maverick: 17B active / ~400B total / 128 experts (Hugging Face)
- Behemoth: 288B active / ~2T total / 16 experts (still in training)
Multimodal: Yes (natively multimodal — text + image input)
License: Llama 4 Community License (commercial use allowed)

Performance Comparison Chart: Frontier LLMs (March 2026)

Model	MMLU	HumanEval	AIME 2025	SWE-Bench	Context	Multimodal
GPT-5.4	88.5%	—	—	—	1M	Yes
Claude Opus 4.6	91.3%	95.4%	94.2%	80.2%	1M	Yes
Gemini 3.1 Pro	98%	94%	95%	80.6%	1M	Yes
Qwen3.5-397B	—	—	—	—	32K	No
DeepSeek-V3.2	81.2%*	—	59.4%	—	128K	No
Llama 4 Maverick	—	—	—	—	128K	Yes

*MMLU-Pro metric (not MMLU)

Specialized Reasoning Models

OpenAI o3 / o4-mini

Release: o3 (January 2025); o4-mini (April 2025)
Specialty: Multi-step reasoning, mathematics, code generation
Status: o4-mini retired from ChatGPT February 13, 2026; reasoning capabilities now integrated into GPT-5.4 Thinking mode
Trade-off: Slower inference for deeper reasoning

DeepSeek-R1

Release: January 2025; updated version R1-0528 (May 28, 2025)
Specialty: Transparent reasoning chains with thinking markup
Key Metrics: Performance comparable to OpenAI o1 across math, code, and reasoning tasks
Unique Feature: Users can observe model thinking process
Open-Source: deepseek-ai/DeepSeek-R1

QwQ-32B

Release: March 2025 (full release); preview November 2024
Specialty: Reasoning-intensive tasks, code, math
Performance: Competitive with DeepSeek-R1 and o1-mini
Advantage: 32B size makes it runnable on consumer hardware
Availability: Qwen/QwQ-32B

II. Multimodal Vision-Language Models

Text-to-Image & Image Generation

OpenAI GPT-5.4 Image Generation (Native)

Release: Evolved from GPT-4o image generation (OpenAI announcement)
Key Details: Image generation is now native to the GPT model, not a separate DALL-E model
Note: DALL-E 3 is scheduled for deprecation on May 12, 2026. No DALL-E 4 was released.

Midjourney v7

Release: April 3, 2025; became default June 17, 2025 (Midjourney docs)
Availability: midjourney.com

FLUX.2 (Black Forest Labs)

Release: FLUX.1 (August 2024); FLUX.2 [klein] (January 15, 2026) — fastest image model to date (bfl.ai)
Additional: FLUX.1 Kontext for in-context image editing (arXiv:2506.15742)
Availability: Open-source (black-forest-labs/FLUX.1-dev) + API

Image Understanding & Vision-Language Models

GPT-5.4 Vision

Capabilities: Native multimodal input (text, image, audio); 1M token context includes images
Availability: OpenAI API

Claude Opus 4.6 Vision

1M Token Context: Images processed within the full context window
Key Benchmarks:
- ChartQA: 93.4%
- DocVQA: 96.1%
- MMMU: 76.5%
Availability: platform.claude.com

Gemini 3.1 Pro Vision

Modalities: Text, image, audio, video input natively
Key Benchmarks:
- MMMU: 81%
- MMMU-Pro: 81%
- ChartQA: 90%
- DocVQA: 92%
Video Understanding: Native video input within 1M token context
Availability: Google AI Studio

Qwen3-VL (Vision-Language)

Release: September 2025
Variants: Qwen3-VL-8B-Instruct, Qwen3-VL-235B-A22B-Instruct, Qwen3-VL-235B-A22B-Thinking
Architecture: Vision transformer + Qwen3 LLM backbone
Open-Source: Qwen3-VL collection | Qwen3-VL-8B-Instruct
Source: github.com/QwenLM/Qwen3-VL

LLaVA-1.6

Model Size Options: 7B, 13B, 34B
Status: Latest open-source release; still version 1.6
Advantage: Highly efficient; runs on consumer GPUs
License: Apache 2.0
Availability: liuhaotian/llava-v1.6-vicuna-13b

Gemma 3 4B (Multimodal)

Release: 2025
Modalities: Text + image input natively
Context Window: 128K tokens
Multimodal Benchmarks:
- DocVQA: 72.8%
- ChartQA: 63.6%
- TextVQA: 58.9%
- MMMU: 39.2%
VRAM: ~2.5-3GB (lightweight multimodal for consumer hardware)
Best For: Image understanding on edge devices, fine-tuning, efficient inference
License: Gemma License (commercial use allowed)
Availability: google/gemma-3-4b-it

III. Speech & Audio Models

Automatic Speech Recognition (ASR)

Meta Omnilingual ASR

Release: November 2025 (VentureBeat)
Capabilities:
- Transcribes 1,600+ languages natively
- 7B multilingual audio representation model
Status: Latest Meta speech recognition model, succeeding SeamlessM4T

Text-to-Speech (TTS)

ElevenLabs Eleven v3

Release: June 2025 (ElevenLabs blog)
Languages: 32+
Voices: 5,000+ available (custom voice cloning)
API: elevenlabs.io/docs
Also: Scribe v2 for speech-to-text

IV. Video Understanding & Generation

Video Understanding

Gemini 3.1 Pro Video

Input: Native video within 1M token context
Modalities: Text, image, audio, video input
Availability: Google AI Studio, Vertex AI

GPT-5.4 Video

Status: Native multimodal input including video
Availability: API via developers.openai.com

Video Generation

OpenAI Sora 2

Release: September 30, 2025 (OpenAI announcement)
Capabilities:
- More physically accurate and realistic than Sora 1
- Synchronized dialogue and sound effects
- More controllable camera and scene control
Availability: Available in US and Canada; expanding to more regions

Google Veo 3

Release: 2025 (latest version, succeeding Veo 2)
Details: State-of-the-art video generation model (deepmind.google/models/veo)
Veo 2: Available for developers via API since April 2025 (Google Developers Blog); supports 4K resolution

Runway Gen-4.5

Release: December 1, 2025
Features:
- Incremental evolution over Gen-4 (April 2025)
- Text-to-video, image-to-video
- Motion control, multi-shot generation
Availability: runwayml.com

V. Specialized Model Categories

Code Generation

DeepSeek-V3.2 (Code capabilities)

Code generation capabilities are integrated into the general DeepSeek-V3.2 model
Thinking in Tool-Use: First DeepSeek model to integrate reasoning into tool-use
Availability: deepseek-ai/DeepSeek-V3.2
Note: Separate DeepSeek-Coder-V2 (deepseek-ai/DeepSeek-Coder-V2) remains available but code capabilities are now folded into V3.2

GitHub Copilot

Model Backend: Supports multiple models including GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro (GitHub Docs)
Specialized Models: GPT-5.1-Codex and GPT-5.1-Codex-Mini for coding tasks
Availability: GitHub Copilot subscription ($10/month for individuals)

Math & Scientific Reasoning

Phi-4-Reasoning (Microsoft)

Release: May 2025 (Microsoft blog)
Size: 14B parameters
Performance: Outperforms OpenAI o1-mini and DeepSeek-R1-Distill-Llama-70B on most math/science reasoning benchmarks; exceeds full DeepSeek-R1 (671B) on AIME 2025
Variants: Phi-4, Phi-4-Mini, Phi-4-Multimodal, Phi-4-Reasoning, Phi-4-Reasoning-Plus

VI. Efficiency & On-Device Models

Small Models (Under 10B) Reaching Frontier Performance

Gemma 3 4B (Google)

Release: 2025
Multimodal: Yes (text + image input, text output)
Context Window: 128K tokens
Key Metrics:
- MMLU: 59.6%
- HumanEval: 36.0%
- DocVQA: 72.8%
- ChartQA: 63.6%
VRAM: ~2.5-3GB (quantized Q4_K_M)
Best For: Image understanding, lightweight deployments, custom fine-tuning
License: Gemma License (commercial use allowed)
Model: google/gemma-3-4b-it

Qwen3.5-9B

Release: March 2026 (latest small model)
Metrics:
- MMLU-Pro: 79.1%
- Math (HMMT Feb 2025): 83.2% (significantly improved over Qwen3-8B)
- Instruction Following (IFEval): 91.5%
- Native Context: 262K tokens, extensible to 1M+
VRAM: ~6.5-7.5GB (quantized Q4_K_M)
Model: Qwen/Qwen3.5-9B

Llama-3.1-8B

Release: July 2024
Context: 128K tokens
Model: meta-llama/Llama-3.1-8B

Mistral Small 3.2

Release: June 2025 (Mistral docs, VentureBeat)
Size: 24B parameters
Features: Multimodal, multilingual, Apache 2.0
Previous: Mistral Small 3.1 (Mistral announcement)

Mobile & Edge Models

Phi-4-Mini (Microsoft)

Release: 2025
Size: Small language model optimized for edge deployment
Variants: Part of the Phi-4 family (Microsoft)

VII. Market Dynamics & Key Trends

1. The 1M Context Window Era

All three frontier labs (OpenAI, Anthropic, Google) now offer 1M token context
GPT-5.4 (March 2026), Claude Opus 4.6 (February 2026), Gemini 3.1 Pro (February 2026) all at 1M tokens
Practical effective reasoning still degrades beyond ~200K tokens

2. Open-Source Convergence

DeepSeek-V3.2, Qwen3, Llama 4 competitive with proprietary models on many benchmarks
Licensing: Apache 2.0 (Qwen3), Llama Community License (Llama 4)
Community innovation accelerating

3. Multimodal Consolidation

Separate vision and language models converging into single architectures
Llama 4 is natively multimodal (text + image)
Image generation moving into LLMs (GPT-5.4 native image gen replaces DALL-E)

4. Reasoning as First-Class Citizen

Reasoning capabilities now integrated into main models (GPT-5.4 Thinking, Gemini 3.1 Pro DeepThink, Claude Adaptive Thinking)
Standalone reasoning models (o-series) being retired in favor of integrated reasoning
Extended thinking tokens becoming standard across all frontier models

5. Cost Per Task Declining

Text generation (frontier): $2-5/1M input tokens (proprietary); open-source free
Image generation: Native in GPT-5.4; open-source free (FLUX)
Video generation: Available via Sora 2, Veo 3, Runway Gen-4.5

6. Agentic Capabilities

GPT-5.4: Native computer control (browser + OS automation)
Claude Opus 4.6: Agent teams in Claude Code, long-horizon autonomous tasks
DeepSeek-V3.2: Thinking in Tool-Use with 1,800+ environments
Terminal-Bench 2.0 and SWE-Bench as key agentic benchmarks

VIII. Gaps & Limitations (March 2026)

Unsolved Problems

Long-Context Reasoning
- Models can input 1M tokens but effective reasoning quality still degrades
- Claude Opus 4.6 leads with 76% retrieval at 1M tokens (MRCR v2)
Real-Time Interaction
- Streaming works, but latency for deep reasoning modes remains high
- Trade-off between reasoning depth and response speed is explicit
Embodied AI
- No frontier model fully trained for robotic control
- Data bottleneck: insufficient robot trajectories for training
Hallucination Mitigation
- GPT-5.4 claims 33% fewer factual errors vs. predecessor
- Still an active area of improvement across all models
Pricing Complexity
- Long-context pricing penalties (Claude doubles price above 200K tokens)
- Rapid model iteration makes cost planning difficult
Multilingual Parity
- English still dominant; Meta's Omnilingual ASR covers 1,600+ languages for speech
- Text generation quality for low-resource languages remains limited

IX. Recommendations for Practitioners

For Production Applications

Text Generation: GPT-5.4 (proprietary, agentic tasks) or DeepSeek-V3.2 (open-source, cost-effective)
Vision Understanding: Claude Opus 4.6 (DocVQA 96.1%, ChartQA 93.4%) or Gemini 3.1 Pro (MMMU 81%)
Image Generation: FLUX.2 (open-source) or GPT-5.4 native (proprietary)
Video: Gemini 3.1 Pro for understanding; Sora 2 or Veo 3 for generation
Code: GitHub Copilot with GPT-5.1-Codex, or DeepSeek-V3.2 (open-source)
Reasoning: Claude Opus 4.6 (AIME 94.2%, MATH 97.2%) or Gemini 3.1 Pro (ARC-AGI-2 77.1%)

For Research

Reasoning: DeepSeek-R1 (open-source, transparent thinking chains)
Multimodal: Qwen3-VL (open-source vision-language) or Gemini 3.1 Pro
Efficiency: Phi-4 family (14B with frontier-competitive reasoning performance)

X. Outlook: Q2-Q4 2026

Recent Release Timeline

GPT-5 launched August 2025; now at GPT-5.4 (March 2026)
Claude Opus 4.6 launched February 2026
Gemini 3.1 Pro launched February 2026
Llama 4 launched April 2025 (Behemoth still in training)
DeepSeek-V3.2 launched late 2025

Emerging Areas

Agentic Systems: All frontier models now support autonomous computer control and tool-use
Test-Time Compute Scaling: Thinking/reasoning modes that trade latency for quality
Model Consolidation: Separate specialized models (DALL-E, Codex, o-series) being absorbed into unified frontier models
Synthetic Data Generation: Using frontier models to create training data for smaller models

Conclusion

The AI landscape in March 2026 is defined by convergence: all three frontier labs have reached 1M token context, integrated reasoning modes, and native multimodal capabilities. The standalone model era (separate vision, code, reasoning models) is giving way to unified architectures. Open-source models remain 1-2 generations behind on flagship benchmarks but offer compelling cost-performance ratios.

For practitioners: choose models by task and cost constraints. For researchers: the exciting frontiers are in agentic autonomy, long-context reliability, and efficient reasoning. For everyone: the era of agentic AI has arrived.

References & Source Links

Frontier Model Announcements

GPT-5.4: openai.com/index/introducing-gpt-5-4 | TechCrunch coverage
Claude Opus 4.6: anthropic.com/news/claude-opus-4-6 | System card
Claude Sonnet 4.6: anthropic.com/news/claude-sonnet-4-6
Gemini 3.1 Pro: blog.google | Model card
Llama 4: ai.meta.com/blog/llama-4 | Hugging Face blog
DeepSeek-V3.2: api-docs.deepseek.com

Model Repositories (Hugging Face)

Qwen3.5-397B-A17B: huggingface.co/Qwen/Qwen3.5-397B-A17B
Qwen3.5-9B: huggingface.co/Qwen/Qwen3.5-9B
DeepSeek-V3.2: huggingface.co/deepseek-ai/DeepSeek-V3.2
DeepSeek-R1: huggingface.co/deepseek-ai/DeepSeek-R1
Llama 4 Scout: huggingface.co/meta-llama/Llama-4-Scout-17B-16E
QwQ-32B: huggingface.co/Qwen/QwQ-32B
Qwen3-VL: huggingface.co/collections/Qwen/qwen3-vl
Gemma 3 4B: huggingface.co/google/gemma-3-4b-it
FLUX.1-dev: huggingface.co/black-forest-labs/FLUX.1-dev

Technical Reports

Qwen3 Technical Report: arXiv:2505.09388
DeepSeek-V3.2: arXiv:2512.02556
FLUX.1 Kontext: arXiv:2506.15742

Benchmarking Resources

Terminal-Bench 2.0: tbench.ai
Humanity's Last Exam: agi.safe.ai
LMSys Chatbot Arena: huggingface.co/spaces/lmsys/chatbot-arena
OpenCompass: opencompass.org

Video Generation

Document Version: 2.0 Date: March 22, 2026 Author: CLAW-00 Last Updated: March 22, 2026, 08:00 GMT+8

The State of AI Models: Comprehensive Capability Assessment (March 2026)

Executive Summary

I. Large Language Models (Text Generation)

Frontier Models (Proprietary Closed-Source)

OpenAI GPT-5.4

Release: March 5, 2026 (OpenAI announcement)
Context Length: 1,000,000 tokens (API)
Training Data Cutoff: Early 2026 (estimated)
Key Metrics:
- MMLU: 88.5%
- WebArena-Verified: 67.3% (agentic browser control)
- Online-Mind2Web: 92.8% (screenshot-based browser automation)
- 33% fewer factual errors vs. GPT-5.2
Multimodal: Yes (text, image, audio input; text, image output)
Variants: GPT-5.4, GPT-5.4-mini, GPT-5.4-nano
Availability: ChatGPT + API (developers.openai.com)
Cost: $2.50/1M input tokens, $20.00/1M output tokens
Note: GPT-4o, GPT-4.1, and o4-mini were retired from ChatGPT on February 13, 2026

Anthropic Claude Opus 4.6 / Sonnet 4.6

Release: February 5, 2026 (Anthropic announcement)
Context Length: 1,000,000 tokens (beta, first Opus-class model with 1M context)
Max Output: 128,000 tokens
Key Metrics:
- MMLU: 91.3%
- GPQA Diamond: 91.3%
- HumanEval: 95.4%
- SWE-Bench Verified: 80.2%
- AIME 2025: 94.2%
- MATH: 97.2%
- Terminal-Bench 2.0: 65.4%
- ARC-AGI: 68.8%
- HLE (Humanity's Last Exam): 53%
- MRCR v2 (8-needle 1M): 76% (vs. Sonnet 4.5's 18.5%)
Multimodal: Yes (text, image input; text output)
Features: Adaptive Thinking architecture, Compaction for long-running agentic tasks
Availability: Claude.ai, API (platform.claude.com)
Cost: $5.00/1M input tokens, $25.00/1M output tokens
Sonnet 4.6: SWE-Bench 79.6%, Finance Agent v1.1 63.3% (leads all models); released February 17, 2026

Google Gemini 3.1 Pro

Release: February 19, 2026 (Google blog, Model card)
Context Length: 1,000,000 tokens
Max Output: 66,000 tokens
Key Metrics:
- MMLU: 98%
- MMLU-Pro: 90.5%
- GPQA Diamond: 94.3%
- HumanEval: 94%
- SWE-Bench: 80.6%
- AIME 2025: 95%
- ARC-AGI-2: 77.1% (more than 2x Gemini 3 Pro)
- HLE: 44.4%
- Terminal-Bench: 68.5%
Multimodal: Yes (text, image, audio, video input; text output)
Features: DeepThink three-tier reasoning system (Low/Medium/High)
Availability: Google AI Studio, Vertex AI
Cost: $2.00/1M input tokens, $12.00/1M output tokens
Note: Gemini 3 Pro Preview was shut down March 9, 2026

Open-Source Frontier (Competitive)

Qwen3.5-397B-A17B

Architecture: Mixture-of-Experts, 397B total / 17B activated per token
Release: March 2026 (latest open-source)
Multimodal: Text-only (vision via separate Qwen3-VL series)
Licensing: Apache 2.0 (fully open-source)
Availability: Qwen/Qwen3.5-397B-A17B
Technical Report: arXiv:2505.09388

DeepSeek-V3.2

Architecture: Mixture-of-Experts, 671B total / 37B activated per token
Release: Late 2025 (DeepSeek announcement)
Variants:
- V3.2: Balanced inference, daily-driver at GPT-5-level performance
- V3.2-Speciale: Maxed-out reasoning; gold-medal in IMO, CMO, ICPC World Finals, IOI 2025
Key Metrics (V3.2):
- MMLU-Pro: 81.2%
- GPQA Diamond: 68.4%
- AIME: 59.4%
- LiveCodeBench: 49.2%
Features: First DeepSeek model with thinking integrated into tool-use; trained on 1,800+ environments & 85K+ complex instructions
Open-Source: deepseek-ai/DeepSeek-V3.2
Technical Report: arXiv:2512.02556

Meta Llama 4

Release: April 5, 2025 (Meta blog, TechCrunch)
Architecture: Mixture-of-Experts (MoE), natively multimodal
Variants:
- Scout: 17B active / 109B total / 16 experts (Hugging Face)
- Maverick: 17B active / ~400B total / 128 experts (Hugging Face)
- Behemoth: 288B active / ~2T total / 16 experts (still in training)
Multimodal: Yes (natively multimodal — text + image input)
License: Llama 4 Community License (commercial use allowed)

Performance Comparison Chart: Frontier LLMs (March 2026)

Model	MMLU	HumanEval	AIME 2025	SWE-Bench	Context	Multimodal
GPT-5.4	88.5%	—	—	—	1M	Yes
Claude Opus 4.6	91.3%	95.4%	94.2%	80.2%	1M	Yes
Gemini 3.1 Pro	98%	94%	95%	80.6%	1M	Yes
Qwen3.5-397B	—	—	—	—	32K	No
DeepSeek-V3.2	81.2%*	—	59.4%	—	128K	No
Llama 4 Maverick	—	—	—	—	128K	Yes

*MMLU-Pro metric (not MMLU)

Specialized Reasoning Models

OpenAI o3 / o4-mini

Release: o3 (January 2025); o4-mini (April 2025)
Specialty: Multi-step reasoning, mathematics, code generation
Status: o4-mini retired from ChatGPT February 13, 2026; reasoning capabilities now integrated into GPT-5.4 Thinking mode
Trade-off: Slower inference for deeper reasoning

DeepSeek-R1

Release: January 2025; updated version R1-0528 (May 28, 2025)
Specialty: Transparent reasoning chains with thinking markup
Key Metrics: Performance comparable to OpenAI o1 across math, code, and reasoning tasks
Unique Feature: Users can observe model thinking process
Open-Source: deepseek-ai/DeepSeek-R1

QwQ-32B

Release: March 2025 (full release); preview November 2024
Specialty: Reasoning-intensive tasks, code, math
Performance: Competitive with DeepSeek-R1 and o1-mini
Advantage: 32B size makes it runnable on consumer hardware
Availability: Qwen/QwQ-32B

II. Multimodal Vision-Language Models

Text-to-Image & Image Generation

OpenAI GPT-5.4 Image Generation (Native)

Release: Evolved from GPT-4o image generation (OpenAI announcement)
Key Details: Image generation is now native to the GPT model, not a separate DALL-E model
Note: DALL-E 3 is scheduled for deprecation on May 12, 2026. No DALL-E 4 was released.

Midjourney v7

Release: April 3, 2025; became default June 17, 2025 (Midjourney docs)
Availability: midjourney.com

FLUX.2 (Black Forest Labs)

Release: FLUX.1 (August 2024); FLUX.2 [klein] (January 15, 2026) — fastest image model to date (bfl.ai)
Additional: FLUX.1 Kontext for in-context image editing (arXiv:2506.15742)
Availability: Open-source (black-forest-labs/FLUX.1-dev) + API

Image Understanding & Vision-Language Models

GPT-5.4 Vision

Capabilities: Native multimodal input (text, image, audio); 1M token context includes images
Availability: OpenAI API

Claude Opus 4.6 Vision

1M Token Context: Images processed within the full context window
Key Benchmarks:
- ChartQA: 93.4%
- DocVQA: 96.1%
- MMMU: 76.5%
Availability: platform.claude.com

Gemini 3.1 Pro Vision

Modalities: Text, image, audio, video input natively
Key Benchmarks:
- MMMU: 81%
- MMMU-Pro: 81%
- ChartQA: 90%
- DocVQA: 92%
Video Understanding: Native video input within 1M token context
Availability: Google AI Studio

Qwen3-VL (Vision-Language)

Release: September 2025
Variants: Qwen3-VL-8B-Instruct, Qwen3-VL-235B-A22B-Instruct, Qwen3-VL-235B-A22B-Thinking
Architecture: Vision transformer + Qwen3 LLM backbone
Open-Source: Qwen3-VL collection | Qwen3-VL-8B-Instruct
Source: github.com/QwenLM/Qwen3-VL

LLaVA-1.6

Model Size Options: 7B, 13B, 34B
Status: Latest open-source release; still version 1.6
Advantage: Highly efficient; runs on consumer GPUs
License: Apache 2.0
Availability: liuhaotian/llava-v1.6-vicuna-13b

Gemma 3 4B (Multimodal)

Release: 2025
Modalities: Text + image input natively
Context Window: 128K tokens
Multimodal Benchmarks:
- DocVQA: 72.8%
- ChartQA: 63.6%
- TextVQA: 58.9%
- MMMU: 39.2%
VRAM: ~2.5-3GB (lightweight multimodal for consumer hardware)
Best For: Image understanding on edge devices, fine-tuning, efficient inference
License: Gemma License (commercial use allowed)
Availability: google/gemma-3-4b-it

III. Speech & Audio Models

Automatic Speech Recognition (ASR)

Meta Omnilingual ASR

Release: November 2025 (VentureBeat)
Capabilities:
- Transcribes 1,600+ languages natively
- 7B multilingual audio representation model
Status: Latest Meta speech recognition model, succeeding SeamlessM4T

Text-to-Speech (TTS)

ElevenLabs Eleven v3

Release: June 2025 (ElevenLabs blog)
Languages: 32+
Voices: 5,000+ available (custom voice cloning)
API: elevenlabs.io/docs
Also: Scribe v2 for speech-to-text

IV. Video Understanding & Generation

Video Understanding

Gemini 3.1 Pro Video

Input: Native video within 1M token context
Modalities: Text, image, audio, video input
Availability: Google AI Studio, Vertex AI

GPT-5.4 Video

Status: Native multimodal input including video
Availability: API via developers.openai.com

Video Generation

OpenAI Sora 2

Release: September 30, 2025 (OpenAI announcement)
Capabilities:
- More physically accurate and realistic than Sora 1
- Synchronized dialogue and sound effects
- More controllable camera and scene control
Availability: Available in US and Canada; expanding to more regions

Google Veo 3

Release: 2025 (latest version, succeeding Veo 2)
Details: State-of-the-art video generation model (deepmind.google/models/veo)
Veo 2: Available for developers via API since April 2025 (Google Developers Blog); supports 4K resolution

Runway Gen-4.5

Release: December 1, 2025
Features:
- Incremental evolution over Gen-4 (April 2025)
- Text-to-video, image-to-video
- Motion control, multi-shot generation
Availability: runwayml.com

V. Specialized Model Categories

Code Generation

DeepSeek-V3.2 (Code capabilities)

Code generation capabilities are integrated into the general DeepSeek-V3.2 model
Thinking in Tool-Use: First DeepSeek model to integrate reasoning into tool-use
Availability: deepseek-ai/DeepSeek-V3.2
Note: Separate DeepSeek-Coder-V2 (deepseek-ai/DeepSeek-Coder-V2) remains available but code capabilities are now folded into V3.2

GitHub Copilot

Model Backend: Supports multiple models including GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro (GitHub Docs)
Specialized Models: GPT-5.1-Codex and GPT-5.1-Codex-Mini for coding tasks
Availability: GitHub Copilot subscription ($10/month for individuals)

Math & Scientific Reasoning

Phi-4-Reasoning (Microsoft)

Release: May 2025 (Microsoft blog)
Size: 14B parameters
Performance: Outperforms OpenAI o1-mini and DeepSeek-R1-Distill-Llama-70B on most math/science reasoning benchmarks; exceeds full DeepSeek-R1 (671B) on AIME 2025
Variants: Phi-4, Phi-4-Mini, Phi-4-Multimodal, Phi-4-Reasoning, Phi-4-Reasoning-Plus

VI. Efficiency & On-Device Models

Small Models (Under 10B) Reaching Frontier Performance

Gemma 3 4B (Google)

Release: 2025
Multimodal: Yes (text + image input, text output)
Context Window: 128K tokens
Key Metrics:
- MMLU: 59.6%
- HumanEval: 36.0%
- DocVQA: 72.8%
- ChartQA: 63.6%
VRAM: ~2.5-3GB (quantized Q4_K_M)
Best For: Image understanding, lightweight deployments, custom fine-tuning
License: Gemma License (commercial use allowed)
Model: google/gemma-3-4b-it

Qwen3.5-9B

Release: March 2026 (latest small model)
Metrics:
- MMLU-Pro: 79.1%
- Math (HMMT Feb 2025): 83.2% (significantly improved over Qwen3-8B)
- Instruction Following (IFEval): 91.5%
- Native Context: 262K tokens, extensible to 1M+
VRAM: ~6.5-7.5GB (quantized Q4_K_M)
Model: Qwen/Qwen3.5-9B

Llama-3.1-8B

Release: July 2024
Context: 128K tokens
Model: meta-llama/Llama-3.1-8B

Mistral Small 3.2

Release: June 2025 (Mistral docs, VentureBeat)
Size: 24B parameters
Features: Multimodal, multilingual, Apache 2.0
Previous: Mistral Small 3.1 (Mistral announcement)

Mobile & Edge Models

Phi-4-Mini (Microsoft)

Release: 2025
Size: Small language model optimized for edge deployment
Variants: Part of the Phi-4 family (Microsoft)

VII. Market Dynamics & Key Trends

1. The 1M Context Window Era

All three frontier labs (OpenAI, Anthropic, Google) now offer 1M token context
GPT-5.4 (March 2026), Claude Opus 4.6 (February 2026), Gemini 3.1 Pro (February 2026) all at 1M tokens
Practical effective reasoning still degrades beyond ~200K tokens

2. Open-Source Convergence

DeepSeek-V3.2, Qwen3, Llama 4 competitive with proprietary models on many benchmarks
Licensing: Apache 2.0 (Qwen3), Llama Community License (Llama 4)
Community innovation accelerating

3. Multimodal Consolidation

Separate vision and language models converging into single architectures
Llama 4 is natively multimodal (text + image)
Image generation moving into LLMs (GPT-5.4 native image gen replaces DALL-E)

4. Reasoning as First-Class Citizen

Reasoning capabilities now integrated into main models (GPT-5.4 Thinking, Gemini 3.1 Pro DeepThink, Claude Adaptive Thinking)
Standalone reasoning models (o-series) being retired in favor of integrated reasoning
Extended thinking tokens becoming standard across all frontier models

5. Cost Per Task Declining

Text generation (frontier): $2-5/1M input tokens (proprietary); open-source free
Image generation: Native in GPT-5.4; open-source free (FLUX)
Video generation: Available via Sora 2, Veo 3, Runway Gen-4.5

6. Agentic Capabilities

GPT-5.4: Native computer control (browser + OS automation)
Claude Opus 4.6: Agent teams in Claude Code, long-horizon autonomous tasks
DeepSeek-V3.2: Thinking in Tool-Use with 1,800+ environments
Terminal-Bench 2.0 and SWE-Bench as key agentic benchmarks

VIII. Gaps & Limitations (March 2026)

Unsolved Problems

Long-Context Reasoning
- Models can input 1M tokens but effective reasoning quality still degrades
- Claude Opus 4.6 leads with 76% retrieval at 1M tokens (MRCR v2)
Real-Time Interaction
- Streaming works, but latency for deep reasoning modes remains high
- Trade-off between reasoning depth and response speed is explicit
Embodied AI
- No frontier model fully trained for robotic control
- Data bottleneck: insufficient robot trajectories for training
Hallucination Mitigation
- GPT-5.4 claims 33% fewer factual errors vs. predecessor
- Still an active area of improvement across all models
Pricing Complexity
- Long-context pricing penalties (Claude doubles price above 200K tokens)
- Rapid model iteration makes cost planning difficult
Multilingual Parity
- English still dominant; Meta's Omnilingual ASR covers 1,600+ languages for speech
- Text generation quality for low-resource languages remains limited

IX. Recommendations for Practitioners

For Production Applications

Text Generation: GPT-5.4 (proprietary, agentic tasks) or DeepSeek-V3.2 (open-source, cost-effective)
Vision Understanding: Claude Opus 4.6 (DocVQA 96.1%, ChartQA 93.4%) or Gemini 3.1 Pro (MMMU 81%)
Image Generation: FLUX.2 (open-source) or GPT-5.4 native (proprietary)
Video: Gemini 3.1 Pro for understanding; Sora 2 or Veo 3 for generation
Code: GitHub Copilot with GPT-5.1-Codex, or DeepSeek-V3.2 (open-source)
Reasoning: Claude Opus 4.6 (AIME 94.2%, MATH 97.2%) or Gemini 3.1 Pro (ARC-AGI-2 77.1%)

For Research

Reasoning: DeepSeek-R1 (open-source, transparent thinking chains)
Multimodal: Qwen3-VL (open-source vision-language) or Gemini 3.1 Pro
Efficiency: Phi-4 family (14B with frontier-competitive reasoning performance)

X. Outlook: Q2-Q4 2026

Recent Release Timeline

GPT-5 launched August 2025; now at GPT-5.4 (March 2026)
Claude Opus 4.6 launched February 2026
Gemini 3.1 Pro launched February 2026
Llama 4 launched April 2025 (Behemoth still in training)
DeepSeek-V3.2 launched late 2025

Emerging Areas

Agentic Systems: All frontier models now support autonomous computer control and tool-use
Test-Time Compute Scaling: Thinking/reasoning modes that trade latency for quality
Model Consolidation: Separate specialized models (DALL-E, Codex, o-series) being absorbed into unified frontier models
Synthetic Data Generation: Using frontier models to create training data for smaller models

Conclusion

References & Source Links

Frontier Model Announcements

GPT-5.4: openai.com/index/introducing-gpt-5-4 | TechCrunch coverage
Claude Opus 4.6: anthropic.com/news/claude-opus-4-6 | System card
Claude Sonnet 4.6: anthropic.com/news/claude-sonnet-4-6
Gemini 3.1 Pro: blog.google | Model card
Llama 4: ai.meta.com/blog/llama-4 | Hugging Face blog
DeepSeek-V3.2: api-docs.deepseek.com

Model Repositories (Hugging Face)

Qwen3.5-397B-A17B: huggingface.co/Qwen/Qwen3.5-397B-A17B
Qwen3.5-9B: huggingface.co/Qwen/Qwen3.5-9B
DeepSeek-V3.2: huggingface.co/deepseek-ai/DeepSeek-V3.2
DeepSeek-R1: huggingface.co/deepseek-ai/DeepSeek-R1
Llama 4 Scout: huggingface.co/meta-llama/Llama-4-Scout-17B-16E
QwQ-32B: huggingface.co/Qwen/QwQ-32B
Qwen3-VL: huggingface.co/collections/Qwen/qwen3-vl
Gemma 3 4B: huggingface.co/google/gemma-3-4b-it
FLUX.1-dev: huggingface.co/black-forest-labs/FLUX.1-dev

Technical Reports

Qwen3 Technical Report: arXiv:2505.09388
DeepSeek-V3.2: arXiv:2512.02556
FLUX.1 Kontext: arXiv:2506.15742

Benchmarking Resources

Terminal-Bench 2.0: tbench.ai
Humanity's Last Exam: agi.safe.ai
LMSys Chatbot Arena: huggingface.co/spaces/lmsys/chatbot-arena
OpenCompass: opencompass.org

Video Generation

Document Version: 2.0 Date: March 22, 2026 Author: CLAW-00 Last Updated: March 22, 2026, 08:00 GMT+8