Research

Digital Surface Labs

"DGX Spark GB10: Resource Limits & Workload Scheduling"

DGX Spark GB10: Resource Limits & Workload Scheduling

Working reference for deciding what can run simultaneously on the DGX Spark, with a focus on the audio-lab pipeline and Tetris training workloads.


1. Hardware Specs

Component Value
Superchip NVIDIA GB10 (Grace Blackwell)
GPU Blackwell, 20 Streaming Multiprocessors (1,792 CUDA cores)
Tensor Cores 5th-gen, 20 (one per SM)
CPU 10-core ARM Neoverse V2 (Grace)
Memory 128 GB LPDDR5x unified (shared CPU + GPU pool)
Bandwidth 273 GB/s
Interconnect NVLink-C2C between CPU and GPU (coherent shared memory)
Compute 1 PFLOP FP4 sparse
Storage 4 TB NVMe (Founders Edition)
Power 240W USB-C

Key takeaways for scheduling

  • 128 GB unified means CPU and GPU draw from the same pool. Docker, the OS, and KV caches eat into GPU-available memory.
  • 20 SMs is modest. For comparison, a desktop RTX 4090 has 128 SMs. This means the Spark is a memory-rich, compute-constrained device. You can load many models, but running them concurrently will fight over SM time.
  • NVLink-C2C means zero-copy CPU-GPU transfers within the unified pool — no PCIe bottleneck for model weights. This is why large models that fit in memory still decode reasonably.
  • 273 GB/s bandwidth determines token generation speed. It is the bottleneck for autoregressive decode (each token needs to read all model weights once).

2. Memory Budget: What Each Workload Needs

Model VRAM Footprints

Workload Approx. VRAM Notes
Qwen2.5-Omni-3B (audio multimodal) ~7-8 GB Includes audio encoder + LM + audio decoder
faster-whisper large-v3 (STT) ~3 GB CTranslate2 int8/fp16
F5-TTS (text-to-speech) ~2-3 GB Lightweight flow-matching model
Qwen3-TTS 1.7B ~3-4 GB Currently in audio-lab docker-compose
Chatterbox TTS ~4 GB Resemble AI voice cloning
XTTS v2 ~2 GB Coqui multi-speaker/multilingual
Ollama — llama3.1:8b (Q4) ~5 GB Lightweight general LLM
Ollama — llama3.1:70b (Q4) ~40 GB Full-size general LLM
Ollama — Devstral Small 2 (24B BF16) ~48 GB Coding agent model
Ollama — Qwen 2.5 Coder 32B (BF16) ~64 GB Coding agent model
Screen-Self-Driving Tetris (SSD) ~8-15 GB Depends on batch size, replay buffer, CNN

System Overhead

Consumer Approx. Memory
Linux + Docker daemon ~2-4 GB
CUDA context (per process) ~0.5-1 GB
KV cache (per active model) 1-8 GB
Safe headroom target ~20 GB

Rule of thumb: keep total loaded model memory under ~80 GB to leave headroom for OS, Docker, CUDA contexts, and KV caches. You can push to ~100 GB if you are careful, but expect OOM under bursty workloads.


3. What Fits Simultaneously? (ASCII Budget Table)

128 GB Total Unified Memory
================================================================

Scenario A: Audio Pipeline (small models, sequential inference)
----------------------------------------------------------------
| faster-whisper large-v3  |  3 GB  |
| llama3.1:8b (Ollama)     |  5 GB  |
| F5-TTS                   |  3 GB  |
| System overhead           | ~4 GB  |
|                           |--------|
| TOTAL LOADED              | ~15 GB |  <-- FITS EASILY
| Remaining                 | 113 GB |
================================================================

Scenario B: Omni all-in-one (STT+LLM+TTS in single model)
----------------------------------------------------------------
| Qwen2.5-Omni-3B          |  8 GB  |
| System overhead           | ~4 GB  |
|                           |--------|
| TOTAL LOADED              | ~12 GB |  <-- FITS EASILY
| Remaining                 | 116 GB |
================================================================

Scenario C: Audio Pipeline + Coding Agent
----------------------------------------------------------------
| faster-whisper large-v3   |  3 GB  |
| llama3.1:8b (Ollama)      |  5 GB  |
| F5-TTS                    |  3 GB  |
| Devstral Small 2 (BF16)   | 48 GB  |
| System overhead            | ~6 GB  |
|                            |--------|
| TOTAL LOADED               | ~65 GB |  <-- FITS
| Remaining                  |  63 GB |
================================================================

Scenario D: Audio Pipeline + Big LLM
----------------------------------------------------------------
| faster-whisper large-v3   |  3 GB  |
| llama3.1:70b (Q4, Ollama) | 40 GB  |
| F5-TTS                    |  3 GB  |
| System overhead            | ~6 GB  |
|                            |--------|
| TOTAL LOADED               | ~52 GB |  <-- FITS
| Remaining                  |  76 GB |
================================================================

Scenario E: Tetris Training + Audio Pipeline
----------------------------------------------------------------
| SSD Tetris training       | 12 GB  |  (mid batch size)
| faster-whisper large-v3   |  3 GB  |
| llama3.1:8b (Ollama)      |  5 GB  |
| F5-TTS                    |  3 GB  |
| System overhead            | ~6 GB  |
|                            |--------|
| TOTAL LOADED               | ~29 GB |  <-- FITS IN MEMORY
| BUT: Training saturates SMs — audio inference will crawl
================================================================

Scenario F: Tetris Training + Big Coding Model
----------------------------------------------------------------
| SSD Tetris training       | 15 GB  |  (large batch)
| Qwen 2.5 Coder 32B (BF16)| 64 GB  |
| System overhead            | ~6 GB  |
|                            |--------|
| TOTAL LOADED               | ~85 GB |  <-- TIGHT BUT FITS
| Training + inference will thrash SMs badly
================================================================

Scenario G: Everything at once (DON'T DO THIS)
----------------------------------------------------------------
| SSD Tetris training       | 12 GB  |
| Qwen2.5-Omni-3B          |  8 GB  |
| Chatterbox TTS            |  4 GB  |
| llama3.1:70b (Q4)         | 40 GB  |
| Devstral Small 2          | 48 GB  |
| System overhead            | ~8 GB  |
|                            |--------|
| TOTAL LOADED               |~120 GB |  <-- OOM RISK
| Remaining                  |   8 GB |  <-- no headroom for KV cache
================================================================

4. Can STT + LLM + TTS Run Simultaneously?

Short answer: yes in memory, but think about compute.

Memory: Not a problem

With smaller models (whisper + llama3.1:8b + F5-TTS), total loaded weight is ~10-11 GB out of 128 GB. Even with a 70B LLM in the middle, you are at ~46 GB. Memory is not the constraint.

Compute: The real bottleneck

20 SMs is the limiting factor. Each SM can only execute one kernel at a time (or partition via MPS, which adds overhead). When two models try to run GPU kernels concurrently:

  • CUDA default behavior: kernels from different processes queue on the same SMs. One runs while the other waits (or they interleave with context-switch overhead).
  • MPS (Multi-Process Service): allows true SM sharing, but with 20 SMs total, splitting them (e.g., 10/10) cuts each model's throughput roughly in half.
  • Practical effect: if whisper is transcribing while the LLM is generating, both take ~2x longer than running alone.

The right approach: sequential pipeline

The audio pipeline is naturally sequential:

User speaks → [STT] → text → [LLM] → response text → [TTS] → audio

Only one model needs GPU at any given moment. This is the ideal pattern for 20 SMs:

  1. Load all three models into memory (they fit easily).
  2. Run inference sequentially — whisper finishes before LLM starts, LLM finishes before TTS starts.
  3. Each model gets all 20 SMs during its turn, maximizing throughput.
  4. No SM contention, no MPS overhead, no kernel queueing.

This is already how the omni-service (Qwen2.5-Omni-3B) works — it handles STT, reasoning, and TTS in a single forward pass, naturally sequential.

When concurrent matters

The only time concurrent GPU inference matters is if you want to handle multiple users simultaneously (e.g., user A is in the TTS stage while user B is in the STT stage). For a single-user audio pipeline, sequential is strictly better on this hardware.


5. Tetris Training vs. Audio: Don't Mix

GPU training workloads (like SSD Tetris) behave differently from inference:

Property Inference (audio pipeline) Training (SSD Tetris)
GPU utilization Bursty (high during forward pass, idle between requests) Sustained 90-100%
SM usage All SMs during burst, then releases All SMs continuously
Memory pattern Static (model weights) + dynamic (KV cache) Static + large gradient buffers + optimizer states
Duration Milliseconds to seconds Hours to days

The problem: training saturates all 20 SMs continuously. Any inference request during training must wait for SM availability, adding seconds of latency to what should be millisecond operations.

Recommendation: do not run training and audio inference simultaneously. Use Docker profiles to ensure only one workload class is active:

# Audio mode: start audio services, stop training
docker compose --profile omni up -d
# or
docker compose --profile chatterbox up -d

# Training mode: stop audio services, run training
docker compose --profile omni stop
# then start SSD training container

6. Recommendations

Memory management

Guideline Value
Max loaded model weight ~80 GB (safe) / ~100 GB (aggressive)
Always reserve for system ~20 GB minimum
KV cache budget per model 1-8 GB depending on context length

For the audio pipeline (audio-lab repo)

  1. Load all models at startup, run inference sequentially. The pipeline is naturally sequential (STT then LLM then TTS). All three small models fit in ~10-15 GB.
  2. Use Qwen2.5-Omni-3B when possible. It handles STT+LLM+TTS in a single model (~8 GB), eliminating inter-model latency.
  3. Keep one TTS model active at a time. The docker-compose already uses profiles (xtts, chatterbox, qwen, omni) — stick with this pattern. Running all TTS models simultaneously wastes memory for no benefit.
  4. Pair with a coding agent if desired. Audio pipeline (~10-15 GB) + Devstral Small 2 (~48 GB) = ~65 GB total. Fits fine. Just avoid running both under heavy concurrent load.

For Ollama LLMs

  1. Use Ollama's automatic model unloading. By default, Ollama keeps a model in memory for 5 minutes after last use, then unloads. This works well for workload switching.
  2. Set OLLAMA_MAX_LOADED_MODELS=1 if memory is tight. Forces unload before loading a new model.
  3. Prefer Q4 quantization for 70B models. Q4_K_M at ~40 GB leaves room for other services. FP16 at ~140 GB does not fit.
  4. For coding: 8b models for quick tasks, 32B for quality. The 32B models (Qwen 2.5 Coder, Devstral Small 2) are dramatically better but use 10x more memory.

For Tetris training (SSD)

  1. Run training in isolation. Stop all audio services and Ollama models before starting a training run. Training needs both memory (8-15 GB for model + gradients + replay buffer) and continuous SM access.
  2. Use smaller batch sizes to cap memory. On 20 SMs, very large batches don't help — the compute can't keep up. Find the batch size where SMs are saturated and stop there.
  3. Monitor with nvidia-smi. Watch for memory creep (Python/PyTorch memory fragmentation) and SM utilization (should be 90%+ during training).

Docker profiles cheat sheet

# See what's running
docker compose ps

# Audio: Omni mode (STT+LLM+TTS in one model)
docker compose --profile omni up -d

# Audio: Chatterbox TTS + external STT/LLM
docker compose --profile chatterbox up -d

# Audio: XTTS v2
docker compose --profile xtts up -d

# Audio: Qwen3-TTS
docker compose --profile qwen up -d

# Stop everything (before training)
docker compose --profile omni --profile chatterbox --profile xtts --profile qwen down

# Then start training container separately

7. Quick Reference Card

┌─────────────────────────────────────────────────────────────┐
│                  DGX SPARK GB10 RESOURCE MAP                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  MEMORY (128 GB unified)                                    │
│  ├── System/Docker/CUDA overhead .......... ~4-6 GB         │
│  ├── KV cache headroom ................... ~10-15 GB        │
│  ├── Available for models ................ ~107-114 GB      │
│  │                                                          │
│  │   Small audio pipeline ................ ~10-15 GB        │
│  │   + Coding agent (24-32B) ............. +48-64 GB        │
│  │   = Combined .......................... ~60-80 GB  OK    │
│  │                                                          │
│  │   Big LLM (70B Q4) ................... ~40 GB            │
│  │   + Audio pipeline .................... +10-15 GB        │
│  │   = Combined .......................... ~55 GB     OK    │
│  │                                                          │
│  │   Tetris training ..................... ~8-15 GB          │
│  │   (run alone for best perf)                              │
│  │                                                          │
│  COMPUTE (20 SMs / 1,792 CUDA cores)                       │
│  ├── Audio pipeline: sequential = full 20 SMs per stage     │
│  ├── Concurrent inference: SMs split, ~2x slowdown each    │
│  ├── Training: saturates all 20 SMs continuously            │
│  └── Training + inference: DON'T — inference stalls         │
│                                                             │
│  BANDWIDTH (273 GB/s)                                       │
│  └── Limits decode speed for large models                   │
│      70B Q4: ~2-3 tok/s  |  8B: ~20-30 tok/s               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

8. Decision Matrix

I want to... Memory OK? Compute OK? Do it?
Run audio pipeline alone Yes (15 GB) Yes Yes, ideal workload
Run Omni all-in-one Yes (8 GB) Yes Yes, simplest setup
Audio + coding agent (24B) Yes (65 GB) Careful Yes, but avoid simultaneous inference
Audio + 70B LLM Yes (55 GB) Careful Yes, run LLM inference between audio
Tetris training alone Yes (15 GB) Yes Yes, stop everything else first
Tetris + audio Yes (29 GB) No No — training starves audio SMs
Tetris + coding agent Tight No No — both need sustained SMs
Two big LLMs simultaneously Maybe No Load both, run one at a time
Everything at once No No No — OOM + SM starvation

Last updated: 2026-02-25. Specs based on NVIDIA DGX Spark GB10 with 20-SM Blackwell GPU configuration. Memory estimates are approximate and vary with quantization, context length, and batch size.