Research

Digital Surface Labs

Running an Open-Source Claude Code on an NVIDIA DGX Spark

The best local models and agent tools for a $4K desktop AI workstation

The NVIDIA DGX Spark is a $3,999 desktop AI workstation with 128GB of unified memory and a Blackwell GPU. The pitch: run large open-source models locally and stop paying per-token. The question: can you actually replicate the Claude Code experience with open-source tools?

Short answer: you can get 70-80% of the way there for routine coding. The gap is real for complex multi-file reasoning. Here are the best options.


The Hardware: What You're Working With

The DGX Spark (originally "Project DIGITS," renamed at GTC March 2025, shipping since October 2025) packs an GB10 Grace Blackwell superchip into a 1.1-liter box:

Spec Value
GPU Blackwell, 6,144 CUDA cores, 5th-gen Tensor Cores
Memory 128GB LPDDR5x unified (shared CPU+GPU)
Bandwidth 273 GB/s
Compute 1 PFLOP (FP4 sparse)
Storage 4TB NVMe (Founders Edition)
Power 240W external USB-C
Size 5.9" x 5.9" x 2"
Price $3,999

The 128GB unified pool is the key number — it determines what models fit. The 273 GB/s memory bandwidth is the bottleneck — it's what makes decode speed slower than Apple Silicon (M3 Ultra gets 800+ GB/s) or discrete GPUs (RTX Pro 6000 gets 1.8 TB/s). Two Sparks can link via ConnectX-7 at 200 Gbps, doubling capacity to 256GB.

Practical model limits on a single Spark: - 70B dense models at FP8 or INT4: comfortable - 120B dense models at INT4: fits with headroom - 235B MoE models at Q3: tight but works (~12 tok/s) - 480B+ MoE models: need aggressive quantization, ~5 tok/s, or two units


Top 3 Recommendations

1. Devstral Small 2 (24B) — Best Bang for the Buck

Field Value
Developer Mistral AI
Parameters 24B dense
License Apache 2.0
Context 256K tokens
SWE-bench Verified 68.0%
VRAM (BF16) ~48GB
VRAM (INT4) ~12GB

This is the standout. A 24B model hitting 68% on SWE-bench Verified — beating many models 5x its size — is remarkable. It was explicitly built as a software engineering agent, designed to work inside the OpenHands framework.

At BF16 it uses only 48GB of the Spark's 128GB, leaving 80GB for context, OS, and other processes. Decode speed will be fast given the small model size relative to available bandwidth. You could even run a second model alongside it (e.g., a reasoning model for "architect mode").

Mistral also released Mistral Vibe, a native CLI built around Devstral that provides a Claude Code-like terminal experience out of the box.

Why it's #1: Best SWE-bench score per GB of VRAM. Apache 2.0 license. Fits entirely in memory at full precision. Purpose-built for agentic coding. Has its own CLI tool.

2. Qwen 2.5 Coder 32B Instruct — The Proven Workhorse

Field Value
Developer Alibaba / Qwen Team
Parameters 32B dense
License Apache 2.0
Context 128K tokens
HumanEval ~92%
Aider Polyglot 73.7%
VRAM (BF16) ~64GB
VRAM (INT4) ~18-20GB

The community consensus best-in-class for local coding agents through most of 2025. Scores 73.7% on Aider's polyglot benchmark — competitive with GPT-4o. Has the broadest tool support: works well with Aider, OpenCode, Cline, Goose, and every other major agent framework.

At BF16 it uses 64GB — half the Spark's memory, leaving ample room. At INT4 it's only 18-20GB, which means you could run it alongside other models or services.

The OpenHands team fine-tuned this into OpenHands LM 32B using reinforcement learning on successful agent trajectories, specifically optimizing for multi-step coding workflows. That variant hits 37.2% SWE-bench standalone, 53-60% when paired with the full OpenHands scaffold.

Why it's #2: Most mature ecosystem support. Proven across every major agent tool. Strong benchmarks. Apache 2.0. Battle-tested by millions of developers. The safe choice.

3. Devstral 2 (123B) — Maximum Quality on a Single Spark

Field Value
Developer Mistral AI
Parameters 123B dense
License Modified MIT
Context 256K tokens
SWE-bench Verified 72.2%
VRAM (BF16) ~246GB
VRAM (INT4) ~62GB

The highest SWE-bench score among open-weight models that fits on a single Spark. At INT4 quantization (~62GB), it runs comfortably with room to spare. At INT8 (~123GB) it's borderline but feasible.

72.2% SWE-bench Verified puts it in the same tier as top proprietary models. The 256K context window is generous for large codebase understanding.

Why it's #3: Highest coding quality you can run on a single Spark. The tradeoff is decode speed — at 123B even with INT4, you'll get slower token generation than the smaller models. For complex problems where quality matters more than speed, this is the one.


Honorable Mentions

Model Params SWE-bench Fits on Spark? Notes
Qwen 2.5 Coder 72B 72B dense Strong FP8/INT4 yes Bigger sibling of the 32B, diminishing returns
DeepSeek-R1-Distill-Qwen-32B 32B dense BF16 yes (64GB) Best for hard reasoning problems, slow and verbose
Qwen3 235B A22B (MoE) 235B/22B active ~91.5% HumanEval Q3 tight (~112GB) Fast inference per active param, MoE overhead
Llama 3.3 70B 70B dense 88.4% HumanEval INT4/INT8 yes Solid generalist, not a coding specialist
Qwen3 Coder 480B (MoE) 480B/35B active 69.6% SWE-bench Q2 only, ~5 tok/s Needs two Sparks for practical use
Codestral 25.01 22B dense 86.6% HumanEval BF16 yes Not fully open-source (research license)

The Agent Layer: CLI Tools That Make It Work

A model alone isn't enough — you need an agent framework that handles file editing, terminal commands, git integration, and multi-step task orchestration. Here are the best options for local models:

OpenCode — The Direct Claude Code Replacement

The most direct open-source answer to Claude Code. Go-based terminal agent, ~95K GitHub stars, GitHub official partnership. Nearly identical interaction patterns to Claude Code but provider-agnostic.

  • Works with Ollama out of the box
  • LSP integration for code-aware context
  • Multi-session support, auto-compaction, session sharing
  • Remote Docker container sessions
  • Head-to-head benchmark: Claude Code completed tasks in 9m 9s vs OpenCode's 16m 20s, but OpenCode generated more thorough output

Best local pairing: Qwen 2.5 Coder 32B or Devstral Small 2

Aider — The Mature Terminal Coding Agent

41K stars, 4.9M pip installs, 15 billion tokens/week. The most battle-tested option.

  • Architect mode: separates reasoning (R1) from editing (Coder) — ideal for a two-model setup
  • Git-aware with automatic commits
  • Auto lint/test with error feedback loops
  • Runs its own polyglot benchmark — the standard for measuring local model coding quality
  • 88% of aider's own code was written by aider

Best local pairing: Qwen 2.5 Coder 32B (73.7% on Aider's benchmark)

Mistral Vibe — Purpose-Built for Devstral

Mistral's native CLI, released December 2025 alongside Devstral 2. End-to-end code automation designed specifically for Devstral models.

Best local pairing: Devstral Small 2 (obviously)

Other Notable Options

  • Goose (Block/Square, 30K stars) — broader automation agent, strong MCP integration, donated to Linux Foundation
  • Cline (fastest-growing GitHub project of 2025) — VS Code-based, not CLI, but excellent with local models
  • Crush (Charmbracelet, 12K stars) — beautiful TUI, mid-session model switching, broad platform support
  • OpenHands (65K stars) — research-grade, best for running autonomous agents on issue backlogs at scale

The Honest Gap

Running local models for coding is real and practical for: - Routine code edits, boilerplate, simple functions - Code explanation and documentation - Single-file refactoring - Test generation - Quick bug fixes with clear error messages

The gap versus Claude Code with Opus/Sonnet remains for: - Complex multi-file architectural reasoning - Subtle bug diagnosis requiring deep codebase understanding - Long chains of autonomous tool use without drift - Strict edit format compliance (smaller models break formatting) - Knowledge of latest APIs and frameworks (frozen at training cutoff)

The sweet spot is a hybrid approach: use local models for routine work (free, private, fast for small models), fall back to cloud APIs for hard problems. Every agent tool listed above supports this — you can switch models mid-session or configure different models for different task types.


Two Sparks: Is It Worth It?

Two DGX Sparks link via ConnectX-7 (200 GbE over QSFP, using RDMA/RoCE — not NVLink). Combined: 256GB unified memory. Price: $7,998. The question is whether that extra 128GB unlocks meaningfully better models.

What Two Sparks Actually Unlock

Model Single Spark Dual Spark Verdict
GPT-OSS 120B (MoE) 36-52 tok/s 55-75 tok/s Meaningful gain (+53-108%)
Qwen3 235B A22B (MoE) ~12 tok/s 21-25 tok/s Good gain, best dual-Spark use case
MiniMax M2.1 230B (MoE) Doesn't fit well 36 tok/s (0 ctx), 22 tok/s (32K ctx) Genuine sweet spot
Qwen3-VL-32B (FP8) 7 tok/s 12 tok/s +71%, but 32B fits fine on one
Qwen3 Coder 480B (MoE, FP8) ~43 tok/s (MoE efficient loading) Framework bugs, no real gain yet Surprisingly runs on one Spark
Llama 3.1 405B (INT4) ~2 tok/s (IQ2) 1.76 tok/s Worse. NVIDIA says "insufficient memory headroom."
DeepSeek V3/R1 671B Does not fit Does not fit Need 4+ Sparks or cloud

The sobering finding: the models that most need two Sparks (405B dense, 671B) still don't run well on two. The 405B generates slower on two Sparks than a 70B on one. DeepSeek 671B doesn't fit at any useful quantization level — even at Q2 (~168GB), you're looking at severe quality degradation. Someone tried 8 DGX Sparks with DeepSeek V3 via vLLM+Ray and hit OOM errors.

The models that actually benefit are MoE models in the 120B-235B total parameter range. Those are good models, but they also run on a single Spark (just slower or at lower quantization).

The Interconnect Reality

The ConnectX-7 link is 200 Gbps aggregate — but ServeTheHome found each QSFP port is limited to PCIe Gen 5 x4, so real-world bandwidth is ~90-100 Gbps per direction. Compare to NVLink in datacenter systems at 900 GB/s bidirectional. This means:

  • Larger models scale better — all_reduce overhead is small relative to compute
  • Smaller models barely benefit — Qwen3-30B MoE gained only 17% on dual Spark
  • Software maturity is rough — users report triton allocator errors, speculative decoding crashes at high concurrency, and significant performance differences between container versions

The $400/Day Cloud vs. Two Sparks Decision

At $400/day in cloud compute ($12,000/month, $146,000/year), two Sparks at $7,998 pay for themselves in 20 days — if they can replace the spend. Here's the reality check:

What $400/day buys in cloud tokens:

Provider Model Tokens/day for $400
Anthropic Claude Opus 4.6 ~50M tokens
Anthropic Claude Sonnet 4.5 ~89M tokens
Together AI DeepSeek V3 ~320M tokens
Groq Llama 3.3 70B ~606M tokens
DeepSeek API DeepSeek V3 ~600-700M tokens

For context, Claude Code averages $6/day for a typical developer. $400/day is 67x that — a very heavy workload or multi-agent setup.

What two Sparks can replace: - Routine agentic tasks (file reads, code edits, boilerplate) — a 70B model locally is roughly Sonnet-quality for these. If this is 50% of your spend, payback is 40 days. - Complex reasoning, architecture, hard debugging — frontier models (Opus, o1) are genuinely better. Local 70B-120B models miss things and require more back-and-forth.

What two Sparks cannot replace: - DeepSeek 671B quality — you'd need 8x H100s in the cloud (~$383/day on RunPod) to run the AWQ-quantized version - Claude Opus reasoning quality — no open-source model at any size matches it on hard problems - Speed on large models — Llama 70B decodes at 2.7 tok/s on a Spark vs. instant responses from cloud APIs

Electricity is negligible: Two Sparks at full load 24/7 cost ~$460/year. Noise compared to $146K/year in cloud spend.

The hybrid recommendation: Buy two Sparks ($8K), route routine tasks locally, keep a ~$100-150/day cloud budget for frontier-quality work. Or skip the second Spark — a single unit handles most practical models, and the dual-Spark sweet spot (235B MoE models) is a narrow band.

Blog Posts and Analyses Worth Reading

  1. LMSYS DGX Spark In-Depth Review — The most rigorous public benchmark. Finds RTX Pro 6000 is 4x faster on token generation, and 3x used RTX 3090s beat the Spark on 120B decode throughput.
  2. EXO Labs: DGX Spark + Mac Studio — Hybrid cluster (2x Sparks + M3 Ultra Mac Studio) achieves 2.8x speedup via disaggregated prefill/decode. Creative but only tested on 8B models.
  3. On-Premise LLM Deployment Cost-Benefit Analysis — Academic paper with break-even tables. A 32B model breaks even against Claude Opus in 0.3 months on a $2K GPU at 10M tokens/month.
  4. How Should We Buy Compute? — Interactive calculator. Shows a $9,900 Mac Studio M3 Ultra takes 6.4 years to pay off vs. Together AI pricing — but flips much faster vs. Claude/GPT-4o pricing.
  5. Simon Willison on DGX Spark — Honest practitioner notes: ARM64+CUDA creates ecosystem friction, software maturity was rough at launch.
  6. NVIDIA Forums: Dual Spark Performance — Real user benchmarks across multiple models on linked Sparks. The most useful primary source.

For a single DGX Spark aiming at Claude Code-like workflows:

  1. Install Ollama — the universal local model runtime
  2. Pull Devstral Small 2 (ollama pull devstral-small:24b) — your daily driver at 48GB BF16
  3. Pull Qwen 2.5 Coder 32B (ollama pull qwen2.5-coder:32b) — alternative with broader ecosystem support
  4. Install OpenCode or Aider — your agent framework
  5. Set num_ctx to at least 16384 in Ollama — the default 4K is inadequate for agentic workflows
  6. Keep a cloud API key handy — for the 20-30% of tasks where local models fall short

Total cost: $3,999 for the hardware, $0/month for inference. At Claude Code's ~$200/month heavy usage, the Spark pays for itself in ~20 months — assuming you can tolerate the quality gap for most tasks.


Sources