<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA</title>
        <link>https://video.ut0pia.org/videos/watch/45b940e3-be8e-4b1c-9bb6-1f46fe36cd14</link>
        <description>Moving LLM workloads from the cloud to local infrastructure requires a shift in engineering strategy. In this talk, I share my journey of serving and benchmarking open-source models (1.5B to 14B) on an NVIDIA DGX Spark workstation. Using a reproducible methodology with vLLM, I analyze real-world trade-offs in throughput, latency, and the benefits of the 128GB Grace Blackwell unified memory architecture. You will leave with a clear framework for local model sizing, an understanding of quantization performance like NVFP4, and a guide for when local compute is the right choice for your AI stack. Speaker info: LinkedIn https://www.linkedin.com/in/mozhgankch/</description>
        <lastBuildDate>Fri, 10 Apr 2026 16:14:48 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>PeerTube - https://video.ut0pia.org</generator>
        <image>
            <title>Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA</title>
            <url>https://video.ut0pia.org/lazy-static/avatars/0287a09a-aae7-4840-9843-b416426e7046.webp</url>
            <link>https://video.ut0pia.org/videos/watch/45b940e3-be8e-4b1c-9bb6-1f46fe36cd14</link>
        </image>
        <copyright>All rights reserved, unless otherwise specified in the terms specified at https://video.ut0pia.org/about and potential licenses granted by each content's rightholder.</copyright>
        <atom:link href="https://video.ut0pia.org/feeds/video-comments.xml?videoId=45b940e3-be8e-4b1c-9bb6-1f46fe36cd14" rel="self" type="application/rss+xml"/>
    </channel>
</rss>