Local LLMs in 2026: a hardware-and-model field study

TL;DR

Recall is a solved problem locally. Qwen 3.6 35B-A3B, Qwen 3.6 27B, and Gemma 4 31B all reproduce 20-line code blocks at >95 % accuracy on prompts up to 280 KB.
Code generation is not. Of the three coders I tested on a real PRD, only the dense Qwen 3.6 27B produced a usable app. The MoE 35B-A3B got close but ran out of context. Gemma never opened in a browser.
Apple Silicon's unified memory wins above ~32K context. An RX 9700 XTX chews through small contexts at 99 t/s but collapses to 20 t/s by 96 K. An M5 Max holds steady at 25 t/s from 32 K all the way out to 196 K.
32 GB is the working minimum, 64 GB is comfortable, 128 GB is rarely needed — unless you're running giant dense models overnight.

The lineup

Four machines, four models. I deliberately tested out-of-the-box quantizations with no quant-fiddling — I wanted to know what downloading the popular build off LM Studio gets you, not what's possible with an afternoon of tuning.

Models

Model	Architecture	Type	Quant	On-disk size
`google/gemma-4-31b`	gemma4	Dense · 31B	Q4_K_M	19.9 GB
`zai-org/glm-4.7-flash` excluded	glm4_moe	MoE · 30B	6-bit	24.4 GB
`qwen/qwen3.6-35b-a3b`	qwen35moe	MoE (~3B active) · 35B	Q4_K_M	22.1 GB
`qwen/qwen3.6-27b`	qwen35	Dense · 27B	Q4_K_M	17.5 GB

GLM 4.7 Flash is in the table for completeness but excluded from the results. It scored so poorly on the recall benchmark — long pauses, frequent malformed outputs — that including it would just add noise.

Hardware

Machine	Memory	GPU
MacBook Pro M5 Max	128 GB unified	40-core GPU
MacBook Pro M1 Max	64 GB unified	32-core GPU
Mac Mini M2 Pro	32 GB unified	19-core GPU
PC	32 GB + 24 GB VRAM	AMD Radeon RX 9700 XTX

Study 1 · Positional recall (codeneedle)

Alex Ziskind built a benchmark called codeneedle (video) that stuffs a large source file into a model's context and asks it to reproduce the first 20 lines of named functions verbatim. It measures positional recall — not lookup, not comprehension, but the brute "do you remember exactly what was at line 412?" question that long-context coding really tests.

What I changed

I forked codeneedle and added four things needed to run it usefully against my models:

C# support. The benchmark only spoke JavaScript and Python; I added a .cs extractor (regex + brace-matching) plus the prompt-anchor wording for C# methods. I shipped a Player.cs corpus along with it, ~3,300 lines from a real game-server project.
relax_indent on by default. Several models — Gemma in particular — produce semantically-correct code with normalized indentation. C# methods live two levels deep inside a namespace; if the model emits one tab instead of two, a linter fixes it in milliseconds. Penalizing that as a recall miss flatters strict copy-paste behavior over content correctness, so I made the scorer compare with leading whitespace stripped on both sides.
Auto-credit blank expected lines. Many of the "20-line" fixtures end with a trailing blank line. Every model I tested would happily reproduce the 19 lines of code and stop, costing it a recall point for the missing newline. I changed the scorer to mark blank expected lines as matched.
Default --max-tokens 8000. Gemma 4 31B's reasoning chain alone can eat 4 K tokens before it gets to the answer. With a 1.5 K budget it consistently ran out and returned empty; with 8 K it had room to think. I never let any model fail for lack of token budget.

(One thing I didn't change: quantization. Everything below is at the public LM Studio community quants — Q4_K_M for Gemma and the Qwens, 6-bit for GLM.)

Results — lines matched (out of possible)

Each corpus contains 11–16 functions; the benchmark asks for 20 lines of each. Numerator is total matched lines across all functions; denominator is the total possible.

http_server.py — 1,351 lines, 51 KB prompt

jquery.js — 10,716 lines, 286 KB prompt

Player.cs — 3,295 lines, 142 KB prompt

All three models pass everything they finish. The story isn't accuracy — it's cost-of-recall. Total wall-clock time to complete each corpus tells the real tale:

Total wall-clock per corpus run

The MoE model is roughly 30× faster than Gemma on the same content. Gemma's average per-function latency on Player.cs was 111 seconds; 35B-A3B's was 3.6. On the largest corpus (jquery.js, ~280 KB prompt), Gemma never produced a single result — it just sat there grinding through reasoning tokens. That matches my eyeball impression during the runs: Gemma thinks a lot before it writes. Sometimes that's a virtue. For positional recall it isn't.

The two Qwens are nearly tied on accuracy but very different in shape. The MoE (35B-A3B with ~3 B active) is the speed king — it actually got a perfect 220/220 on http_server. The dense 27B is steadier (zero hallucinations on http_server, only 2 on jQuery vs the MoE's 15) but ~3× slower per token.

Takeaway

Recall on a long context is essentially solved for these model classes. If your local-LLM use case is "answer questions about code I'm pasting into the context," any of the three works. The MoE wins on throughput; Gemma loses on time-to-answer; the dense Qwen is the most disciplined.

Study 2 · Build the Urlist from a PRD

Recall is one thing. Synthesizing a working app is another. For this round I used the PRD from Burke Holland's "Can Open Source Models Beat Opus at a Fraction of the Cost?" video — a complete spec for a link-sharing app (spec gist). Same PRD, same starting prompt that Burke uses in the video, run against each model.

Setup

Pi agent as the harness, specifically because of its very low-token system prompt. Claude Code, Copilot CLI, and Cursor all ship with 10–20 K of system context out of the box. On a local model with a finite window and expensive prompt processing, that overhead is a tax I don't want to pay. Pi assumes the model is competent enough at bash and the filesystem to get by with a thin harness.
No planning mode. I pasted the PRD straight into the prompt, the same way Burke does in the video.
One round of light follow-up to fix obvious gaps (which OAuth provider to use, empty-list state, and a couple of error-handling tweaks).

Results

🏆

Qwen 3.6 27B (dense)

Almost-perfect first pass. Built the entire app, including OAuth, OpenGraph scraping, drag-to-reorder cards, and the public list page. ~1 hour for the initial build, ~30 minutes of follow-up to handle empty-list state on the dashboard and tighten up the OAuth config. The end result was actually usable in a browser.

⚠️

Qwen 3.6 35B-A3B (MoE)

Almost runnable, fundamentally broken. Got far enough that I could npm install and start the server, but key flows were wrong. At one point it ran out of context, the conversation auto-compacted, and after compaction it lost the thread. I never got a clean working version. It felt like a model from 6–12 months ago — flashes of brilliance, can't quite finish.

🛑

Gemma 4 31B

Never opened in a browser. Gemma's deep-thinking habit becomes a liability the moment a task spans multiple files: it kept reasoning about implications of changes it hadn't made yet, then losing track of what was real on disk. Great at general-purpose Q&A, not a code generator.

What it actually looked like

The verdicts above are easier to take on faith with the artifacts in front of you. Same PRD, same prompt, same starter project — these are screenshots of what each model actually shipped after the build run. Click any image to view full size.

Qwen 3.6 27B Urlist homepage — **Qwen 3.6 27B (dense) — homepage.** Branded logo, marketing headline, three feature cards, signed-in user with avatar in the nav, a footer.

Qwen 3.6 35B-A3B Urlist homepage — **Qwen 3.6 35B-A3B (MoE) — homepage.** A title, a subtitle, and a paste box. No marketing copy, no feature cards, no footer.

Qwen 3.6 27B Urlist editor with rich link preview — **27B — list editor.** Pasting `nytimes.com` kicks off an OpenGraph fetch and renders a real preview card with the NYT logo, full headline, description, and source URL — exactly what the PRD asks for.

Qwen 3.6 35B-A3B Urlist editor with broken link preview — **35B-A3B — list editor.** Same paste, but the preview card is a blank thumbnail with `nytimes.com / nytimes.com`. The OG-scraping endpoint exists but doesn't actually populate the metadata.

Qwen 3.6 27B Urlist published list view — **27B — published list.** Vanity slug, dated header, link counter, the rich preview card again, and a "Share this list" panel with a copyable URL. Done.

Qwen 3.6 35B-A3B Urlist published list view — **35B-A3B — published list.** Page title is the raw slug. No date, no link counter, no share panel, generic chain-icon placeholder, and the preview is still just `nytimes.com / nytimes.com`.

Both models read the same PRD, both produced something that ran, but only one of them shipped the spec. The MoE pattern-matched the shape of the deliverable; the dense model actually built it.

Takeaway

Density beats cleverness for end-to-end generation. The MoE is wonderful when you need fast, shallow answers — code review, recall, "what does this regex do" — but a multi-file build benefits from sustained, dense attention.

That insight reshaped my mental model of which model to load:

Qwen 3.6 27B (dense) for actual implementation work, when I have an hour to wait.
Qwen 3.6 35B-A3B (MoE) for fast retrieval, code review, "find me the function that does X."
Gemma 4 31B — I could not find a niche for it that the two Qwens didn't already cover, so I dropped it from my rotation.

Study 3 · Hardware & token speed

Last study: same model, same prompt, run across every machine I had. The prompt is a non-trivial TypeScript task — "build me a robust async job queue with concurrency, rate-limiting, exponential-backoff retries, AbortSignal cancellation, and Vitest tests" — that produces 10–14 K output tokens. I ran it in LM Studio at 64 K context.

Tokens per second at 64 K context

Qwen 3.6 35B-A3B (MoE)

Qwen 3.6 27B (dense)

Two Apple Silicon observations from these:

GPU cores beat generation. The M1 Max (32 GPU cores) is faster than the M2 Pro (19 GPU cores) on both models, even though the M2 Pro is a generation newer. This shows up clearly on the MoE — 42.9 t/s vs 33.9 t/s — and on the dense model the two are nearly tied at ~7 t/s, both crawling. RAM bandwidth and GPU-core count are what matter for inference, not the SoC's marketing label.
The discrete GPU is competitive at this context size. The RX 9700 XTX is right behind the M5 Max on the MoE (95.6 vs 98.9 t/s) and ahead of every Mac on the dense model (40.9 t/s). At 64 K context, with the model sitting in 24 GB of VRAM, the discrete GPU does what discrete GPUs are good at.

The catch — what happens when context grows

Here's the chart that changed how I think about local LLM hardware. Same model (Qwen 3.6 27B dense), same prompt, varying only context size:

Tokens / sec vs context size — RX 9700 XTX vs M5 Max

The XTX starts at 99 tokens/sec at 32 K context, falls to 41 by 64 K, hits 21 at 96 K, and is still at 20 at 128 K — the ceiling I tested it at. The same model on the M5 Max produces a flat line: 25.5 t/s at 32 K, 24.2 t/s at 196 K. I extended only the M5 run out to 196 K to confirm performance stays stable as you approach 200 K context, and it does. That's the unified-memory advantage in one image.

The discrete GPU is faster when it fits. Past 32 K, the cliff in the XTX line is consistent with the model + KV cache no longer fitting in 24 GB of VRAM and the driver paging into system RAM over PCIe. The unified-memory machine doesn't care — it never moves anything. For coding workloads where context naturally grows over a session, the Mac is the more honest performer.

How much RAM do you actually need?

All four models I tested fit comfortably in 32 GB of unified memory with their working context. My Mac Mini (32 GB) ran Qwen 3.6 27B at low t/s but a usable context size — fine for a long-running task you can leave alone. 64 GB is the comfortable answer: any of these models, plenty of headroom for context, and room to keep an editor and browser open. 128 GB is overkill for development, but starts paying off if you want to load 70 B+ dense models for overnight tasks. The denser the model, the more brutally the per-token cost grows — and at that point you may as well use a hosted model anyway.

Operational tips I picked up along the way

Pi agent + the LM Studio plugin

Two things made the local-model workflow tolerable. First, stakira/pi-lmstudio — a Pi agent plugin that lets you switch the active LM Studio model from inside the agent prompt instead of editing models.json. With four models in rotation, constantly hand-editing config got old fast.

Teach the agent to use ripgrep with an APPEND_SYSTEM.md

Pi's system prompt is intentionally lean, which is great for context budget — but it doesn't tell the model anything about preferred tools. Without prompting, every one of these models reaches for plain grep -rn (or worse, a find piped into xargs grep) the moment they need to search a codebase. On a repo with thousands of files that's a one- to two-minute round-trip where rg would have finished in milliseconds. The model isn't slow — the tool is.

The fix that worked was an APPEND_SYSTEM.md in ~/.pi/agent/ with a small ripgrep cheat-sheet that gets prepended to the system prompt:

# Ripgrep (`rg`)

Prefer `rg` over `grep` / `find -name` for searching code.
It's fast and respects `.gitignore` by default.

## Essentials
rg "pattern"              # recursive search
rg -i "p"                 # case-insensitive
rg -F "literal"           # no regex (faster, no escaping)
rg -w "p"                 # whole word
rg -l "p"                 # files with matches only
rg -C 3 "p"               # 3 lines of context

## Scope
rg "p" -t ts              # by file type (--type-list to see all)
rg "p" -g "*.md"          # include glob
rg "p" -g "!**/dist/**"   # exclude glob
rg --files                # list files (find replacement)

## Tips
- Single-quote regexes: `rg 'foo\.bar'`.
- Use `-F` whenever you don't need regex.
- `rg` exits non-zero on no matches; add `|| true` in scripts.

AGENTS.md to stop "everything is JavaScript" assumptions

All four models default to assuming you're in a TypeScript / JavaScript project, even mid-session in a clearly different codebase. They'd happily filter on *.ts files inside a .NET 10 game-server repo. The fix was an AGENTS.md at the project root telling the model what stack it's actually working on — including the subfolder layout, the language, and the build command. Two paragraphs of context saved me from a lot of useless globbing.

When to use what

The three studies converge on a fairly small decision tree:

"I need an offline assistant for retrieval"

Qwen 3.6 35B-A3B. ~3 B active parameters means it's fast on consumer hardware (95+ t/s on an M5 Max or an RX 9700 XTX at 64 K context), and it's an excellent recaller. Ideal for "summarize this codebase," "find the file that does X," "translate this snippet."

"I want to actually build something offline"

Qwen 3.6 27B (dense). Slow — plan on real wall-clock time — but it finishes. The only model in this round that produced a usable end-to-end app from a real PRD.

"I'm picking a Mac to run all of this"

32 GB if budget-bound, 64 GB if you want comfort. GPU-core count matters more than generation. The unified-memory architecture is the real advantage as context windows grow.

"I have a PC with a big GPU"

Great for short-context tasks — it'll outrun any Mac under ~32 K context. As soon as the model + KV cache exceed VRAM (which happens fast on coding tasks), it starts paging and falls off a cliff. Use it; just don't expect it to scale with context.

The simplest takeaway, though: the gap between local and cloud is not what it was twelve months ago. A Qwen 3.6 27B running on a 64 GB MacBook Pro is genuinely usable for a real coding session — slowly, but usably — and that's a state of affairs I would not have believed at the start of 2025.

Notes on methodology

All recall numbers come from codeneedle JSON dumps with my fork's scorer (relax-indent on, blank-line credit on, C# extractor enabled).
All token-speed numbers were collected in LM Studio with KV cache quantization = Q8 and the same TypeScript "build me a job-queue utility" prompt (~280 lines of spec, 10–14 K output).
Each hardware run was the second invocation of the model after a warm-up to keep the JIT context-loading penalty out of the headline number; prompt-processing time is reported separately.
I deliberately did not tune quantization, sampler settings, or system prompts beyond the lightweight Pi defaults — the goal was to characterize the out-of-the-box experience.

Comments, corrections, or "you're holding it wrong" notes welcome via the usual social links on the about page.