← Back to Library

DGX Spark: Use It for Fine-Tuning

Hi Everyone,

In this edition of The Weekly Kaitchup, I’ll discuss only one topic: The DGX Spark.


NVIDIA’s DGX Spark Isn’t an Inference Box

Earlier this year, NVIDIA announced the “DIGITS” project, to be commercialized (soon) as DGX Spark, a compact, all-in-one “AI” box built around a GB10 Grace Blackwell (arm64) chip with 128 GB of unified LPDDR5x memory. It’s aimed at AI workloads.

The “GPU” performance is comparable to an RTX 5070/5070 Ti, which sounds limited. The generous 128 GB helps, but the ~273 GB/s memory bandwidth is obviously the main bottleneck.

NVIDIA and partners highlight “1 PFLOP of sparse FP4 tensor performance,” a marketing figure that depends on low-precision FP4 (MXFP4/NVFP4). FP4 is still niche in practice, so that metric won’t map cleanly to most real-world workloads today, though that could change later next year.

NVIDIA sent early-access units to teams behind the most-used inference engines, including Ollama, LMSYS, llama.cpp, LM Studio, and vLLM, among others. Most of them have already published their reviews.

Some notes before we dive in:

  • Most inference stacks depend on PyTorch, with an arm64 support that is still inconsistent. As someone using the GH200 (also arm64) a lot, I can tell there are still clear gaps. The release of PyTorch 2.9 this week should improve support.

  • Key frameworks, vLLM included, only recently began publishing arm64 wheels and documentation, so the ecosystem is still maturing. Unsloth now proposes a Docker container.

  • Published inference results should improve as kernels, compilers, and runtimes (PyTorch, Triton, CUDA/Transformer Engine) are optimized for arm64/Grace-Blackwell. I expect performance to improve in the coming months.

Let’s start with the negative points.

This review by LMSYS (the people behind SGLang) is one of the earliest and most complete regarding inference with the DGX Spark:

NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference

LMSYS measured the following throughput (tps: tokens per second; prefill/decode):

For example, running GPT-OSS 20B (MXFP4) in Ollama, the Spark achieved 2,053 tps prefill / 49.7 tps decode, whereas the RTX Pro 6000 Blackwell reached 10,108 tps / 215 tps, roughly 4× faster. Even the GeForce RTX 5090 delivered 8,519 tps / 205 tps, confirming that the Spark’s unified LPDDR5x memory bandwidth is the main limiting factor.

GPT-OSS seems like a good target model for the DGX Spark as it is “natively” MXFP4, so it is hardware accelerated… Yet, it’s also a

...
Read full article on The Kaitchup →