The Best Local AI Models Right Now: March 2026 Edition
Text, image, video, music, voice, and code — the complete open-source stack, benchmarked and ranked.


You can run a world-class AI production stack on hardware you already own, for $0 in API costs, starting today.
That sentence wasn't true eighteen months ago. It is now.
The open-source AI ecosystem has closed the gap with commercial offerings faster than anyone predicted. For text generation, image creation, video production, music composition, voice synthesis, and code — there are open-weight models that match or challenge the best proprietary APIs. The difference is that these run locally, cost nothing per call, and improve every month.
We benchmark the field monthly. Here's where things stand in March 2026.
Before You Pick a Model: Know Your Hardware
The right model for you is determined by your VRAM, not your ambition.
| Your Setup | What You Can Run |
| Consumer GPU (8GB VRAM) | FLUX schnell, SDXL, Wan 2.1 small, Mistral Small 3, Kokoro TTS, ACE-Step music |
| Prosumer GPU (16–24GB) | Most models in this guide — the sweet spot |
| High-end workstation (40GB+) | Mochi video, full DeepSeek R1, SkyReels |
| CPU only / Apple Silicon | Kokoro TTS, Mistral Small 3 via Ollama, ACE-Step music |
A single NVIDIA L4 (24GB, ~$0.80/hr on cloud) runs everything in this guide comfortably. That's what we run our video generation pipeline on.
Text / LLM
The headline: Open-source reasoning models now beat GPT-4o on published benchmarks.
Alibaba's Qwen3-235B-A22B is the current open-source leader. It uses a Mixture-of-Experts architecture — 235B total parameters, only 22B active at inference — which means it runs on hardware that would choke a dense model of similar capability. Quantized to 4-bit, it fits in 24GB VRAM. Benchmarks from GPQA, AIME25, and LiveCodeBench put it ahead of GPT-4o on reasoning tasks.
For raw reasoning and math, DeepSeek R1 (671B, MIT license) is the one to beat. It's large — you need 40GB+ quantized — but the chain-of-thought output is genuinely useful, not just impressive on paper.
If you're running on a Mac Mini or consumer GPU, Mistral Small 3 (24B, Apache 2.0) is the practical pick. Fast enough for real-time agentic loops, capable enough for most writing and analysis tasks, fits in 16GB.
| Model | Params | License | Best For |
| Qwen3-235B-A22B | 235B MoE | Apache 2.0 | Best open reasoning overall |
| DeepSeek R1 | 671B | MIT | Math, chain-of-thought |
| Llama 4 Maverick | 400B MoE | Community | General — Meta's latest |
| Mistral Small 3 | 24B | Apache 2.0 | Speed + efficiency, local agents |
| Gemma 3 27B | 27B | Gemma ToS | Best mid-size, Google-quality reasoning |
Bottom line: If you're paying for GPT-4o for reasoning tasks, run Qwen3-235B-A22B quantized for a week. The gap is smaller than you think.
Image Generation
The headline: FLUX has won. The question is which FLUX.
Black Forest Labs' FLUX family has become the default for serious local image generation. FLUX.1 schnell (Apache 2.0) is the fastest — it generates quality images in 1–4 steps, runs on 8GB VRAM, and costs nothing. FLUX.1 Kontext Pro is the current quality leader: best prompt adherence, best at editing existing images, the model you reach for when it has to look right.
Stable Diffusion XL isn't dead. Its ecosystem — LoRA libraries, ComfyUI nodes, fine-tuned models for specific styles — remains unmatched. If you have a specialized style need or existing SDXL workflows, SDXL is still the right choice.
Janus-Pro from DeepSeek deserves a mention as the only model here that both understands and generates images — useful for building pipelines where the model needs to reason about visual content.
| Model | License | Best For | VRAM |
| FLUX.1 Kontext Pro | Non-commercial | Quality generation + editing | 16GB |
| FLUX.1 schnell | Apache 2.0 | Speed, free production use | 8GB |
| Stable Diffusion XL | CreativeML RAIL | Ecosystem, fine-tuning | 8GB |
| Janus-Pro | Apache 2.0 | Multimodal understand + generate | 8GB |
Bottom line: Start with FLUX.1 schnell. If quality matters for final output, step up to Kontext Pro via API. Both are in ComfyUI today.
Video Generation
The headline: Local video generation is real production tooling now, not a research demo.
Wan 2.1/2.2 (Apache 2.0) is the practical standard for 2026. It runs on 8–24GB VRAM depending on the model variant, produces cinematic output, and has the widest ComfyUI support. The 5B fp16 variant is the best value: 10GB download, runs on a 24GB GPU, output quality that would have required commercial tools six months ago.
LTX-Video 19B is the fastest — near-real-time image-to-video on a 24GB GPU. If you're building a pipeline where latency matters, this is your model.
HunyuanVideo (Tencent, 13B) produces the highest quality clips of the locally-runnable models. It's slower and needs a full 24GB, but the output is noticeably better for complex scenes.
| Model | License | Best For | VRAM |
| Wan 2.2 | Apache 2.0 | Best all-round — 2026 standard | 8–24GB |
| LTX-Video 19B | Apache 2.0 | Speed, real-time i2v | 24GB |
| HunyuanVideo | Tencent | Highest quality locally | 24GB |
| CogVideoX 5B | Apache 2.0 | Image-to-video, consistent subjects | 12GB |
| Mochi 1 | Apache 2.0 | Best motion quality (research) | 40GB+ |
Bottom line: Wan 2.2 for production. LTX-Video 19B for speed. HunyuanVideo when the clip has to be cinematic. All three run on a single L4 GPU.
Music Generation
The headline: ACE-Step 1.5 is the breakout — and it runs on a Mac.
ACE-Step 1.5 (Apache 2.0) claims to outperform most commercial music generation tools while running locally on Mac, AMD, Intel, and CUDA. That claim deserves scrutiny — but the GitHub is active, the demos are convincing, and the architecture is built for speed. It's the first local music gen tool we've seen that doesn't feel like a research prototype.
DiffRhythm took a different approach: train on 1 million songs and let users drive generation with lyrics + a style prompt. Give it a verse and tell it "lo-fi hip-hop, 90 BPM" and it produces a full track. For lyrics-first workflows — artists who write before they produce — this is the most direct path.
Meta's MusicGen Large remains the safe choice for teams that need a known quantity with published benchmarks. It's not the fastest or most expressive, but it works, reliably, every time.
| Model | License | Best For | Platform |
| ACE-Step 1.5 | Apache 2.0 | Overall quality, local use | Mac/CUDA/AMD |
| DiffRhythm | Apache 2.0 | Lyrics → track, style control | GPU |
| MusicGen Large | CC-BY-NC 4.0 | Controlled generation, safe bet | Local |
| Stable Audio Open | Apache 2.0 | 45s high-quality generation | Local |
Bottom line: Test ACE-Step 1.5 first. If it delivers on the benchmarks, it's a significant shift in what's possible locally without paying Suno or Udio per track.
Text-to-Speech
The headline: Kokoro is 82 million parameters. It sounds better than models twenty times its size.
Kokoro (Apache 2.0) is the story of this benchmark cycle. 82M parameters — small enough to run on CPU — producing voice output that rivals models with billions of parameters. The architecture avoids diffusion entirely, which makes it fast and predictable. For agent voiceovers, tutorial narration, and content automation, this is the new default.
If you need custom voice cloning — training on a specific person's voice — Orpheus-TTS is the pick. Fine-tunable, Apache 2.0, ranked fourth on the current TTS leaderboard.
| Model | Size | License | Best For | Hardware |
| Kokoro | 82M | Apache 2.0 | Speed + quality, agent voiceovers | CPU |
| Orpheus-TTS | ~1B | Apache 2.0 | Voice cloning, fine-tuning | 4GB GPU |
| Fish Speech V1.5 | 500M | CC-BY-NC 4.0 | Multilingual, zero-shot cloning | 4GB GPU |
| CosyVoice2 | 0.5B | Apache 2.0 | Real-time streaming TTS | 2GB GPU |
Bottom line: Kokoro first. Add Orpheus-TTS if you need to clone a specific voice. Both are free to run and cost nothing per call — which means ElevenLabs at $30+/month is hard to justify for most use cases.
Code Models
The headline: Open-source code models are within striking distance of Claude Sonnet on SWE-Bench.
The benchmark that matters for agentic coding tasks is SWE-Bench — it measures whether a model can solve real GitHub issues, not just pass coding quizzes. The current standings:
| Model | License | SWE-Bench | Context | Notes |
| Claude Sonnet 4.6 (paid) | Proprietary | 77.2% | 200K | Current commercial leader |
| MiniMax-M2.1 | Open | ~77% | 1M | Matches Sonnet on published benchmarks |
| GLM-4.7 | MIT | 73.8% | 200K | Closest open model to Sonnet |
| DeepSeek V3.2 | MIT | 73.1% | 128K+ | Less hallucination than Qwen at scale |
| Qwen3-Coder | Apache 2.0 | 70.6% | 1M | 1M context — handles full codebases |
The gap between the best open models and the best paid models is now 3–4 percentage points on SWE-Bench. For many routine coding tasks — scaffolding, debugging, documentation — GLM-4.7 or DeepSeek V3.2 are viable. For complex agentic tasks requiring judgment, Claude Sonnet 4.6 still wins.
Bottom line: Run GLM-4.7 or DeepSeek V3.2 for routine tasks. Keep Claude Sonnet for architecture decisions and complex agentic loops. The cost savings are real.
The Stack We'd Build Today
If you're starting from scratch and want the best open-source AI production stack in March 2026:
| Role | Model | Cost |
| Text agent | Mistral Small 3 via Ollama | $0 |
| Heavy reasoning | Qwen3-235B quantized | $0 |
| Image generation | FLUX.1 schnell | $0 |
| Video generation | Wan 2.2 5B | $0 |
| Music generation | ACE-Step 1.5 | $0 |
| Voice/TTS | Kokoro | $0 |
| Code assistance | Qwen3-Coder quantized | $0 |
| Hardware | Single NVIDIA L4 24GB (~$0.80/hr spot) | Pay as you go |
Total API cost: $0. Hardware cost for cloud: $0.80/hr when you need it, $0 when you don't. Or run on hardware you already own.
What We're Watching for April
- ACE-Step 1.5 benchmarks — the "outperforms commercial" claim needs third-party validation
- Wan 2.2 i2v quality — the image-to-video variant just launched, we'll have benchmarks next month
- MiniMax-M2.1 on SWE-Bench — ~77% if confirmed would tie Claude Sonnet 4.6 as an open model
- Llama 4 Maverick fine-tunes — Meta just released it; the community ecosystem takes 4–6 weeks to kick in
We publish this benchmark monthly. Every model in this guide will have been superseded within six months. That's the pace we're operating at now.
Data sources: Onyx AI Self-Hosted LLM Leaderboard (updated February 2026), hyperstack.cloud video model comparison (February 2026), SWE-Bench published results, ACE-Step GitHub, r/LocalLLaMA community benchmarks, pixazo.ai image generation guide.
Questions or corrections: marc@thecgaigroup.com

