Running an Open-Source Claude Code on an NVIDIA DGX Spark

The best local models and agent tools for a $4K desktop AI workstation

2026-02-17 aicodingopen-sourcenvidialocal-llm

The NVIDIA DGX Spark is a $3,999 desktop AI workstation with 128GB of unified memory and a Blackwell GPU. The pitch: run large open-source models locally and stop paying per-token. The question: can you actually replicate the Claude Code experience with open-source tools?

Short answer: you can get 70-80% of the way there for routine coding. The gap is real for complex multi-file reasoning. Here are the best options.

The Hardware: What You're Working With

The DGX Spark (originally "Project DIGITS," renamed at GTC March 2025, shipping since October 2025) packs an GB10 Grace Blackwell superchip into a 1.1-liter box:

Spec	Value
GPU	Blackwell, 6,144 CUDA cores, 5th-gen Tensor Cores
Memory	128GB LPDDR5x unified (shared CPU+GPU)
Bandwidth	273 GB/s
Compute	1 PFLOP (FP4 sparse)
Storage	4TB NVMe (Founders Edition)
Power	240W external USB-C
Size	5.9" x 5.9" x 2"
Price	$3,999

The 128GB unified pool is the key number — it determines what models fit. The 273 GB/s memory bandwidth is the bottleneck — it's what makes decode speed slower than Apple Silicon (M3 Ultra gets 800+ GB/s) or discrete GPUs (RTX Pro 6000 gets 1.8 TB/s). Two Sparks can link via ConnectX-7 at 200 Gbps, doubling capacity to 256GB.

Practical model limits on a single Spark: - 70B dense models at FP8 or INT4: comfortable - 120B dense models at INT4: fits with headroom - 235B MoE models at Q3: tight but works (~12 tok/s) - 480B+ MoE models: need aggressive quantization, ~5 tok/s, or two units

Top 3 Recommendations

1. Devstral Small 2 (24B) — Best Bang for the Buck

Field	Value
Developer	Mistral AI
Parameters	24B dense
License	Apache 2.0
Context	256K tokens
SWE-bench Verified	68.0%
VRAM (BF16)	~48GB
VRAM (INT4)	~12GB

This is the standout. A 24B model hitting 68% on SWE-bench Verified — beating many models 5x its size — is remarkable. It was explicitly built as a software engineering agent, designed to work inside the OpenHands framework.

At BF16 it uses only 48GB of the Spark's 128GB, leaving 80GB for context, OS, and other processes. Decode speed will be fast given the small model size relative to available bandwidth. You could even run a second model alongside it (e.g., a reasoning model for "architect mode").

Mistral also released Mistral Vibe, a native CLI built around Devstral that provides a Claude Code-like terminal experience out of the box.

Why it's #1: Best SWE-bench score per GB of VRAM. Apache 2.0 license. Fits entirely in memory at full precision. Purpose-built for agentic coding. Has its own CLI tool.

2. Qwen 2.5 Coder 32B Instruct — The Proven Workhorse

Field	Value
Developer	Alibaba / Qwen Team
Parameters	32B dense
License	Apache 2.0
Context	128K tokens
HumanEval	~92%
Aider Polyglot	73.7%
VRAM (BF16)	~64GB
VRAM (INT4)	~18-20GB

The community consensus best-in-class for local coding agents through most of 2025. Scores 73.7% on Aider's polyglot benchmark — competitive with GPT-4o. Has the broadest tool support: works well with Aider, OpenCode, Cline, Goose, and every other major agent framework.

At BF16 it uses 64GB — half the Spark's memory, leaving ample room. At INT4 it's only 18-20GB, which means you could run it alongside other models or services.

The OpenHands team fine-tuned this into OpenHands LM 32B using reinforcement learning on successful agent trajectories, specifically optimizing for multi-step coding workflows. That variant hits 37.2% SWE-bench standalone, 53-60% when paired with the full OpenHands scaffold.

Why it's #2: Most mature ecosystem support. Proven across every major agent tool. Strong benchmarks. Apache 2.0. Battle-tested by millions of developers. The safe choice.

3. Devstral 2 (123B) — Maximum Quality on a Single Spark

Field	Value
Developer	Mistral AI
Parameters	123B dense
License	Modified MIT
Context	256K tokens
SWE-bench Verified	72.2%
VRAM (BF16)	~246GB
VRAM (INT4)	~62GB

The highest SWE-bench score among open-weight models that fits on a single Spark. At INT4 quantization (~62GB), it runs comfortably with room to spare. At INT8 (~123GB) it's borderline but feasible.

72.2% SWE-bench Verified puts it in the same tier as top proprietary models. The 256K context window is generous for large codebase understanding.

Why it's #3: Highest coding quality you can run on a single Spark. The tradeoff is decode speed — at 123B even with INT4, you'll get slower token generation than the smaller models. For complex problems where quality matters more than speed, this is the one.

Honorable Mentions

Model	Params	SWE-bench	Fits on Spark?	Notes
Qwen 2.5 Coder 72B	72B dense	Strong	FP8/INT4 yes	Bigger sibling of the 32B, diminishing returns
DeepSeek-R1-Distill-Qwen-32B	32B dense	—	BF16 yes (64GB)	Best for hard reasoning problems, slow and verbose
Qwen3 235B A22B (MoE)	235B/22B active	~91.5% HumanEval	Q3 tight (~112GB)	Fast inference per active param, MoE overhead
Llama 3.3 70B	70B dense	88.4% HumanEval	INT4/INT8 yes	Solid generalist, not a coding specialist
Qwen3 Coder 480B (MoE)	480B/35B active	69.6% SWE-bench	Q2 only, ~5 tok/s	Needs two Sparks for practical use
Codestral 25.01	22B dense	86.6% HumanEval	BF16 yes	Not fully open-source (research license)

The Agent Layer: CLI Tools That Make It Work

A model alone isn't enough — you need an agent framework that handles file editing, terminal commands, git integration, and multi-step task orchestration. Here are the best options for local models:

OpenCode — The Direct Claude Code Replacement

The most direct open-source answer to Claude Code. Go-based terminal agent, ~95K GitHub stars, GitHub official partnership. Nearly identical interaction patterns to Claude Code but provider-agnostic.

Works with Ollama out of the box
LSP integration for code-aware context
Multi-session support, auto-compaction, session sharing
Remote Docker container sessions
Head-to-head benchmark: Claude Code completed tasks in 9m 9s vs OpenCode's 16m 20s, but OpenCode generated more thorough output

Best local pairing: Qwen 2.5 Coder 32B or Devstral Small 2

Aider — The Mature Terminal Coding Agent

41K stars, 4.9M pip installs, 15 billion tokens/week. The most battle-tested option.

Architect mode: separates reasoning (R1) from editing (Coder) — ideal for a two-model setup
Git-aware with automatic commits
Auto lint/test with error feedback loops
Runs its own polyglot benchmark — the standard for measuring local model coding quality
88% of aider's own code was written by aider

Best local pairing: Qwen 2.5 Coder 32B (73.7% on Aider's benchmark)

Mistral Vibe — Purpose-Built for Devstral

Mistral's native CLI, released December 2025 alongside Devstral 2. End-to-end code automation designed specifically for Devstral models.

Best local pairing: Devstral Small 2 (obviously)

Other Notable Options

Goose (Block/Square, 30K stars) — broader automation agent, strong MCP integration, donated to Linux Foundation
Cline (fastest-growing GitHub project of 2025) — VS Code-based, not CLI, but excellent with local models
Crush (Charmbracelet, 12K stars) — beautiful TUI, mid-session model switching, broad platform support
OpenHands (65K stars) — research-grade, best for running autonomous agents on issue backlogs at scale

The Honest Gap

Running local models for coding is real and practical for: - Routine code edits, boilerplate, simple functions - Code explanation and documentation - Single-file refactoring - Test generation - Quick bug fixes with clear error messages

The gap versus Claude Code with Opus/Sonnet remains for: - Complex multi-file architectural reasoning - Subtle bug diagnosis requiring deep codebase understanding - Long chains of autonomous tool use without drift - Strict edit format compliance (smaller models break formatting) - Knowledge of latest APIs and frameworks (frozen at training cutoff)

The sweet spot is a hybrid approach: use local models for routine work (free, private, fast for small models), fall back to cloud APIs for hard problems. Every agent tool listed above supports this — you can switch models mid-session or configure different models for different task types.

Two Sparks: Is It Worth It?

Two DGX Sparks link via ConnectX-7 (200 GbE over QSFP, using RDMA/RoCE — not NVLink). Combined: 256GB unified memory. Price: $7,998. The question is whether that extra 128GB unlocks meaningfully better models.

What Two Sparks Actually Unlock

Model	Single Spark	Dual Spark	Verdict
GPT-OSS 120B (MoE)	36-52 tok/s	55-75 tok/s	Meaningful gain (+53-108%)
Qwen3 235B A22B (MoE)	~12 tok/s	21-25 tok/s	Good gain, best dual-Spark use case
MiniMax M2.1 230B (MoE)	Doesn't fit well	36 tok/s (0 ctx), 22 tok/s (32K ctx)	Genuine sweet spot
Qwen3-VL-32B (FP8)	7 tok/s	12 tok/s	+71%, but 32B fits fine on one
Qwen3 Coder 480B (MoE, FP8)	~43 tok/s (MoE efficient loading)	Framework bugs, no real gain yet	Surprisingly runs on one Spark
Llama 3.1 405B (INT4)	~2 tok/s (IQ2)	1.76 tok/s	Worse. NVIDIA says "insufficient memory headroom."
DeepSeek V3/R1 671B	Does not fit	Does not fit	Need 4+ Sparks or cloud

The sobering finding: the models that most need two Sparks (405B dense, 671B) still don't run well on two. The 405B generates slower on two Sparks than a 70B on one. DeepSeek 671B doesn't fit at any useful quantization level — even at Q2 (~168GB), you're looking at severe quality degradation. Someone tried 8 DGX Sparks with DeepSeek V3 via vLLM+Ray and hit OOM errors.

The models that actually benefit are MoE models in the 120B-235B total parameter range. Those are good models, but they also run on a single Spark (just slower or at lower quantization).

The Interconnect Reality

The ConnectX-7 link is 200 Gbps aggregate — but ServeTheHome found each QSFP port is limited to PCIe Gen 5 x4, so real-world bandwidth is ~90-100 Gbps per direction. Compare to NVLink in datacenter systems at 900 GB/s bidirectional. This means:

Larger models scale better — all_reduce overhead is small relative to compute
Smaller models barely benefit — Qwen3-30B MoE gained only 17% on dual Spark
Software maturity is rough — users report triton allocator errors, speculative decoding crashes at high concurrency, and significant performance differences between container versions

The $400/Day Cloud vs. Two Sparks Decision

At $400/day in cloud compute ($12,000/month, $146,000/year), two Sparks at $7,998 pay for themselves in 20 days — if they can replace the spend. Here's the reality check:

What $400/day buys in cloud tokens:

Provider	Model	Tokens/day for $400
Anthropic	Claude Opus 4.6	~50M tokens
Anthropic	Claude Sonnet 4.5	~89M tokens
Together AI	DeepSeek V3	~320M tokens
Groq	Llama 3.3 70B	~606M tokens
DeepSeek API	DeepSeek V3	~600-700M tokens

For context, Claude Code averages $6/day for a typical developer. $400/day is 67x that — a very heavy workload or multi-agent setup.

What two Sparks can replace: - Routine agentic tasks (file reads, code edits, boilerplate) — a 70B model locally is roughly Sonnet-quality for these. If this is 50% of your spend, payback is 40 days. - Complex reasoning, architecture, hard debugging — frontier models (Opus, o1) are genuinely better. Local 70B-120B models miss things and require more back-and-forth.

What two Sparks cannot replace: - DeepSeek 671B quality — you'd need 8x H100s in the cloud (~$383/day on RunPod) to run the AWQ-quantized version - Claude Opus reasoning quality — no open-source model at any size matches it on hard problems - Speed on large models — Llama 70B decodes at 2.7 tok/s on a Spark vs. instant responses from cloud APIs

Electricity is negligible: Two Sparks at full load 24/7 cost ~$460/year. Noise compared to $146K/year in cloud spend.

The hybrid recommendation: Buy two Sparks ($8K), route routine tasks locally, keep a ~$100-150/day cloud budget for frontier-quality work. Or skip the second Spark — a single unit handles most practical models, and the dual-Spark sweet spot (235B MoE models) is a narrow band.

Blog Posts and Analyses Worth Reading

LMSYS DGX Spark In-Depth Review — The most rigorous public benchmark. Finds RTX Pro 6000 is 4x faster on token generation, and 3x used RTX 3090s beat the Spark on 120B decode throughput.
EXO Labs: DGX Spark + Mac Studio — Hybrid cluster (2x Sparks + M3 Ultra Mac Studio) achieves 2.8x speedup via disaggregated prefill/decode. Creative but only tested on 8B models.
On-Premise LLM Deployment Cost-Benefit Analysis — Academic paper with break-even tables. A 32B model breaks even against Claude Opus in 0.3 months on a $2K GPU at 10M tokens/month.
How Should We Buy Compute? — Interactive calculator. Shows a $9,900 Mac Studio M3 Ultra takes 6.4 years to pay off vs. Together AI pricing — but flips much faster vs. Claude/GPT-4o pricing.
Simon Willison on DGX Spark — Honest practitioner notes: ARM64+CUDA creates ecosystem friction, software maturity was rough at launch.
NVIDIA Forums: Dual Spark Performance — Real user benchmarks across multiple models on linked Sparks. The most useful primary source.

Recommended Setup

For a single DGX Spark aiming at Claude Code-like workflows:

Install Ollama — the universal local model runtime
Pull Devstral Small 2 (ollama pull devstral-small:24b) — your daily driver at 48GB BF16
Pull Qwen 2.5 Coder 32B (ollama pull qwen2.5-coder:32b) — alternative with broader ecosystem support
Install OpenCode or Aider — your agent framework
Set num_ctx to at least 16384 in Ollama — the default 4K is inadequate for agentic workflows
Keep a cloud API key handy — for the 20-30% of tasks where local models fall short

Total cost: $3,999 for the hardware, $0/month for inference. At Claude Code's ~$200/month heavy usage, the Spark pays for itself in ~20 months — assuming you can tolerate the quality gap for most tasks.

Research