Running LLMs locally: Practical LLM Performance on DGX Spark

Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA https://video.ut0pia.org/videos/watch/45b940e3-be8e-4b1c-9bb6-1f46fe36cd14 Moving LLM workloads from the cloud to local infrastructure requires a shift in engineering strategy. In this talk, I share my journey of serving and benchmarking open-source models (1.5B to 14B) on an NVIDIA DGX Spark workstation. Using a reproducible methodology with vLLM, I analyze real-world trade-offs in throughput, latency, and the benefits of the 128GB Grace Blackwell unified memory architecture. You will leave with a clear framework for local model sizing, an understanding of quantization performance like NVFP4, and a guide for when local compute is the right choice for your AI stack. Speaker info: LinkedIn https://www.linkedin.com/in/mozhgankch/ Fri, 10 Apr 2026 16:14:48 GMT https://validator.w3.org/feed/docs/rss2.html PeerTube - https://video.ut0pia.org Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA https://video.ut0pia.org/lazy-static/avatars/0287a09a-aae7-4840-9843-b416426e7046.webp https://video.ut0pia.org/videos/watch/45b940e3-be8e-4b1c-9bb6-1f46fe36cd14 All rights reserved, unless otherwise specified in the terms specified at https://video.ut0pia.org/about and potential licenses granted by each content's rightholder.