"DGX Spark GB10: Resource Limits & Workload Scheduling"
DGX Spark GB10: Resource Limits & Workload Scheduling
Working reference for deciding what can run simultaneously on the DGX Spark, with a focus on the audio-lab pipeline and Tetris training workloads.
1. Hardware Specs
| Component | Value |
|---|---|
| Superchip | NVIDIA GB10 (Grace Blackwell) |
| GPU | Blackwell, 20 Streaming Multiprocessors (1,792 CUDA cores) |
| Tensor Cores | 5th-gen, 20 (one per SM) |
| CPU | 10-core ARM Neoverse V2 (Grace) |
| Memory | 128 GB LPDDR5x unified (shared CPU + GPU pool) |
| Bandwidth | 273 GB/s |
| Interconnect | NVLink-C2C between CPU and GPU (coherent shared memory) |
| Compute | 1 PFLOP FP4 sparse |
| Storage | 4 TB NVMe (Founders Edition) |
| Power | 240W USB-C |
Key takeaways for scheduling
- 128 GB unified means CPU and GPU draw from the same pool. Docker, the OS, and KV caches eat into GPU-available memory.
- 20 SMs is modest. For comparison, a desktop RTX 4090 has 128 SMs. This means the Spark is a memory-rich, compute-constrained device. You can load many models, but running them concurrently will fight over SM time.
- NVLink-C2C means zero-copy CPU-GPU transfers within the unified pool — no PCIe bottleneck for model weights. This is why large models that fit in memory still decode reasonably.
- 273 GB/s bandwidth determines token generation speed. It is the bottleneck for autoregressive decode (each token needs to read all model weights once).
2. Memory Budget: What Each Workload Needs
Model VRAM Footprints
| Workload | Approx. VRAM | Notes |
|---|---|---|
| Qwen2.5-Omni-3B (audio multimodal) | ~7-8 GB | Includes audio encoder + LM + audio decoder |
| faster-whisper large-v3 (STT) | ~3 GB | CTranslate2 int8/fp16 |
| F5-TTS (text-to-speech) | ~2-3 GB | Lightweight flow-matching model |
| Qwen3-TTS 1.7B | ~3-4 GB | Currently in audio-lab docker-compose |
| Chatterbox TTS | ~4 GB | Resemble AI voice cloning |
| XTTS v2 | ~2 GB | Coqui multi-speaker/multilingual |
| Ollama — llama3.1:8b (Q4) | ~5 GB | Lightweight general LLM |
| Ollama — llama3.1:70b (Q4) | ~40 GB | Full-size general LLM |
| Ollama — Devstral Small 2 (24B BF16) | ~48 GB | Coding agent model |
| Ollama — Qwen 2.5 Coder 32B (BF16) | ~64 GB | Coding agent model |
| Screen-Self-Driving Tetris (SSD) | ~8-15 GB | Depends on batch size, replay buffer, CNN |
System Overhead
| Consumer | Approx. Memory |
|---|---|
| Linux + Docker daemon | ~2-4 GB |
| CUDA context (per process) | ~0.5-1 GB |
| KV cache (per active model) | 1-8 GB |
| Safe headroom target | ~20 GB |
Rule of thumb: keep total loaded model memory under ~80 GB to leave headroom for OS, Docker, CUDA contexts, and KV caches. You can push to ~100 GB if you are careful, but expect OOM under bursty workloads.
3. What Fits Simultaneously? (ASCII Budget Table)
128 GB Total Unified Memory
================================================================
Scenario A: Audio Pipeline (small models, sequential inference)
----------------------------------------------------------------
| faster-whisper large-v3 | 3 GB |
| llama3.1:8b (Ollama) | 5 GB |
| F5-TTS | 3 GB |
| System overhead | ~4 GB |
| |--------|
| TOTAL LOADED | ~15 GB | <-- FITS EASILY
| Remaining | 113 GB |
================================================================
Scenario B: Omni all-in-one (STT+LLM+TTS in single model)
----------------------------------------------------------------
| Qwen2.5-Omni-3B | 8 GB |
| System overhead | ~4 GB |
| |--------|
| TOTAL LOADED | ~12 GB | <-- FITS EASILY
| Remaining | 116 GB |
================================================================
Scenario C: Audio Pipeline + Coding Agent
----------------------------------------------------------------
| faster-whisper large-v3 | 3 GB |
| llama3.1:8b (Ollama) | 5 GB |
| F5-TTS | 3 GB |
| Devstral Small 2 (BF16) | 48 GB |
| System overhead | ~6 GB |
| |--------|
| TOTAL LOADED | ~65 GB | <-- FITS
| Remaining | 63 GB |
================================================================
Scenario D: Audio Pipeline + Big LLM
----------------------------------------------------------------
| faster-whisper large-v3 | 3 GB |
| llama3.1:70b (Q4, Ollama) | 40 GB |
| F5-TTS | 3 GB |
| System overhead | ~6 GB |
| |--------|
| TOTAL LOADED | ~52 GB | <-- FITS
| Remaining | 76 GB |
================================================================
Scenario E: Tetris Training + Audio Pipeline
----------------------------------------------------------------
| SSD Tetris training | 12 GB | (mid batch size)
| faster-whisper large-v3 | 3 GB |
| llama3.1:8b (Ollama) | 5 GB |
| F5-TTS | 3 GB |
| System overhead | ~6 GB |
| |--------|
| TOTAL LOADED | ~29 GB | <-- FITS IN MEMORY
| BUT: Training saturates SMs — audio inference will crawl
================================================================
Scenario F: Tetris Training + Big Coding Model
----------------------------------------------------------------
| SSD Tetris training | 15 GB | (large batch)
| Qwen 2.5 Coder 32B (BF16)| 64 GB |
| System overhead | ~6 GB |
| |--------|
| TOTAL LOADED | ~85 GB | <-- TIGHT BUT FITS
| Training + inference will thrash SMs badly
================================================================
Scenario G: Everything at once (DON'T DO THIS)
----------------------------------------------------------------
| SSD Tetris training | 12 GB |
| Qwen2.5-Omni-3B | 8 GB |
| Chatterbox TTS | 4 GB |
| llama3.1:70b (Q4) | 40 GB |
| Devstral Small 2 | 48 GB |
| System overhead | ~8 GB |
| |--------|
| TOTAL LOADED |~120 GB | <-- OOM RISK
| Remaining | 8 GB | <-- no headroom for KV cache
================================================================
4. Can STT + LLM + TTS Run Simultaneously?
Short answer: yes in memory, but think about compute.
Memory: Not a problem
With smaller models (whisper + llama3.1:8b + F5-TTS), total loaded weight is ~10-11 GB out of 128 GB. Even with a 70B LLM in the middle, you are at ~46 GB. Memory is not the constraint.
Compute: The real bottleneck
20 SMs is the limiting factor. Each SM can only execute one kernel at a time (or partition via MPS, which adds overhead). When two models try to run GPU kernels concurrently:
- CUDA default behavior: kernels from different processes queue on the same SMs. One runs while the other waits (or they interleave with context-switch overhead).
- MPS (Multi-Process Service): allows true SM sharing, but with 20 SMs total, splitting them (e.g., 10/10) cuts each model's throughput roughly in half.
- Practical effect: if whisper is transcribing while the LLM is generating, both take ~2x longer than running alone.
The right approach: sequential pipeline
The audio pipeline is naturally sequential:
User speaks → [STT] → text → [LLM] → response text → [TTS] → audio
Only one model needs GPU at any given moment. This is the ideal pattern for 20 SMs:
- Load all three models into memory (they fit easily).
- Run inference sequentially — whisper finishes before LLM starts, LLM finishes before TTS starts.
- Each model gets all 20 SMs during its turn, maximizing throughput.
- No SM contention, no MPS overhead, no kernel queueing.
This is already how the omni-service (Qwen2.5-Omni-3B) works — it handles STT, reasoning, and TTS in a single forward pass, naturally sequential.
When concurrent matters
The only time concurrent GPU inference matters is if you want to handle multiple users simultaneously (e.g., user A is in the TTS stage while user B is in the STT stage). For a single-user audio pipeline, sequential is strictly better on this hardware.
5. Tetris Training vs. Audio: Don't Mix
GPU training workloads (like SSD Tetris) behave differently from inference:
| Property | Inference (audio pipeline) | Training (SSD Tetris) |
|---|---|---|
| GPU utilization | Bursty (high during forward pass, idle between requests) | Sustained 90-100% |
| SM usage | All SMs during burst, then releases | All SMs continuously |
| Memory pattern | Static (model weights) + dynamic (KV cache) | Static + large gradient buffers + optimizer states |
| Duration | Milliseconds to seconds | Hours to days |
The problem: training saturates all 20 SMs continuously. Any inference request during training must wait for SM availability, adding seconds of latency to what should be millisecond operations.
Recommendation: do not run training and audio inference simultaneously. Use Docker profiles to ensure only one workload class is active:
# Audio mode: start audio services, stop training
docker compose --profile omni up -d
# or
docker compose --profile chatterbox up -d
# Training mode: stop audio services, run training
docker compose --profile omni stop
# then start SSD training container
6. Recommendations
Memory management
| Guideline | Value |
|---|---|
| Max loaded model weight | ~80 GB (safe) / ~100 GB (aggressive) |
| Always reserve for system | ~20 GB minimum |
| KV cache budget per model | 1-8 GB depending on context length |
For the audio pipeline (audio-lab repo)
- Load all models at startup, run inference sequentially. The pipeline is naturally sequential (STT then LLM then TTS). All three small models fit in ~10-15 GB.
- Use Qwen2.5-Omni-3B when possible. It handles STT+LLM+TTS in a single model (~8 GB), eliminating inter-model latency.
- Keep one TTS model active at a time. The docker-compose already uses profiles (
xtts,chatterbox,qwen,omni) — stick with this pattern. Running all TTS models simultaneously wastes memory for no benefit. - Pair with a coding agent if desired. Audio pipeline (~10-15 GB) + Devstral Small 2 (~48 GB) = ~65 GB total. Fits fine. Just avoid running both under heavy concurrent load.
For Ollama LLMs
- Use Ollama's automatic model unloading. By default, Ollama keeps a model in memory for 5 minutes after last use, then unloads. This works well for workload switching.
- Set
OLLAMA_MAX_LOADED_MODELS=1if memory is tight. Forces unload before loading a new model. - Prefer Q4 quantization for 70B models. Q4_K_M at ~40 GB leaves room for other services. FP16 at ~140 GB does not fit.
- For coding: 8b models for quick tasks, 32B for quality. The 32B models (Qwen 2.5 Coder, Devstral Small 2) are dramatically better but use 10x more memory.
For Tetris training (SSD)
- Run training in isolation. Stop all audio services and Ollama models before starting a training run. Training needs both memory (8-15 GB for model + gradients + replay buffer) and continuous SM access.
- Use smaller batch sizes to cap memory. On 20 SMs, very large batches don't help — the compute can't keep up. Find the batch size where SMs are saturated and stop there.
- Monitor with
nvidia-smi. Watch for memory creep (Python/PyTorch memory fragmentation) and SM utilization (should be 90%+ during training).
Docker profiles cheat sheet
# See what's running
docker compose ps
# Audio: Omni mode (STT+LLM+TTS in one model)
docker compose --profile omni up -d
# Audio: Chatterbox TTS + external STT/LLM
docker compose --profile chatterbox up -d
# Audio: XTTS v2
docker compose --profile xtts up -d
# Audio: Qwen3-TTS
docker compose --profile qwen up -d
# Stop everything (before training)
docker compose --profile omni --profile chatterbox --profile xtts --profile qwen down
# Then start training container separately
7. Quick Reference Card
┌─────────────────────────────────────────────────────────────┐
│ DGX SPARK GB10 RESOURCE MAP │
├─────────────────────────────────────────────────────────────┤
│ │
│ MEMORY (128 GB unified) │
│ ├── System/Docker/CUDA overhead .......... ~4-6 GB │
│ ├── KV cache headroom ................... ~10-15 GB │
│ ├── Available for models ................ ~107-114 GB │
│ │ │
│ │ Small audio pipeline ................ ~10-15 GB │
│ │ + Coding agent (24-32B) ............. +48-64 GB │
│ │ = Combined .......................... ~60-80 GB OK │
│ │ │
│ │ Big LLM (70B Q4) ................... ~40 GB │
│ │ + Audio pipeline .................... +10-15 GB │
│ │ = Combined .......................... ~55 GB OK │
│ │ │
│ │ Tetris training ..................... ~8-15 GB │
│ │ (run alone for best perf) │
│ │ │
│ COMPUTE (20 SMs / 1,792 CUDA cores) │
│ ├── Audio pipeline: sequential = full 20 SMs per stage │
│ ├── Concurrent inference: SMs split, ~2x slowdown each │
│ ├── Training: saturates all 20 SMs continuously │
│ └── Training + inference: DON'T — inference stalls │
│ │
│ BANDWIDTH (273 GB/s) │
│ └── Limits decode speed for large models │
│ 70B Q4: ~2-3 tok/s | 8B: ~20-30 tok/s │
│ │
└─────────────────────────────────────────────────────────────┘
8. Decision Matrix
| I want to... | Memory OK? | Compute OK? | Do it? |
|---|---|---|---|
| Run audio pipeline alone | Yes (15 GB) | Yes | Yes, ideal workload |
| Run Omni all-in-one | Yes (8 GB) | Yes | Yes, simplest setup |
| Audio + coding agent (24B) | Yes (65 GB) | Careful | Yes, but avoid simultaneous inference |
| Audio + 70B LLM | Yes (55 GB) | Careful | Yes, run LLM inference between audio |
| Tetris training alone | Yes (15 GB) | Yes | Yes, stop everything else first |
| Tetris + audio | Yes (29 GB) | No | No — training starves audio SMs |
| Tetris + coding agent | Tight | No | No — both need sustained SMs |
| Two big LLMs simultaneously | Maybe | No | Load both, run one at a time |
| Everything at once | No | No | No — OOM + SM starvation |
Last updated: 2026-02-25. Specs based on NVIDIA DGX Spark GB10 with 20-SM Blackwell GPU configuration. Memory estimates are approximate and vary with quantization, context length, and batch size.