<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Agentic Evaluations at Scale, For Everybody — Nicholas Kang &amp; Michael Aaron, Google DeepMind</title>
        <link>https://video.ut0pia.org/videos/watch/28244658-29c3-48ce-af8c-8e20ec88b56d</link>
        <description>On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts performance by 22%. A competing lab once took a Kaggle benchmark, reran it with their own compaction settings, and published much better results. Neither number was wrong. Both were useless. The talk is from Nicholas Kang and Michael Aaron at Google DeepMind's Kaggle team, who are building the infrastructure to fix evals at the community level: an open benchmark platform anyone can contribute to, a PvP Game Arena where models play poker and chess for an ELO rating that cannot saturate, and a standardized agent exam that returned 500 plus submissions in its first week without any promotion. The wastewater treatment plant engineer from Turkey who built a novel safety benchmark from 20 years of field experience, data that does not exist anywhere else, is the use case they keep coming back to. Speaker info: https://www.linkedin.com/in/nicholaskangjj</description>
        <lastBuildDate>Tue, 26 May 2026 08:31:47 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>PeerTube - https://video.ut0pia.org</generator>
        <image>
            <title>Agentic Evaluations at Scale, For Everybody — Nicholas Kang &amp; Michael Aaron, Google DeepMind</title>
            <url>https://video.ut0pia.org/lazy-static/avatars/0287a09a-aae7-4840-9843-b416426e7046.webp</url>
            <link>https://video.ut0pia.org/videos/watch/28244658-29c3-48ce-af8c-8e20ec88b56d</link>
        </image>
        <copyright>All rights reserved, unless otherwise specified in the terms specified at https://video.ut0pia.org/about and potential licenses granted by each content's rightholder.</copyright>
        <atom:link href="https://video.ut0pia.org/feeds/video-comments.xml?videoId=28244658-29c3-48ce-af8c-8e20ec88b56d" rel="self" type="application/rss+xml"/>
    </channel>
</rss>