<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Evals Are Broken, Use Them Anyway — Ara Khan, Cline</title>
        <link>https://video.ut0pia.org/videos/watch/2e537227-fd85-4f65-93b8-3656ed3a4814</link>
        <description>Cline started at 43% on Terminal Bench. The improvements came from container CPU and memory settings, raised timeouts, and prompt engineering techniques specific to Anthropic model families that do not transfer to Codex or Gemini. Not from switching to a better model. Ara Khan's argument is that benchmark numbers are not gospel and vibes are not a system, and that the truth is inconveniently in between. The practical framework: after a run, portfolio allocate the failures by sending another agent through all the failure traces to find which small levers actually move the score. Zone one is obvious bugs. Zone two is the nuance improvements that explain why a model everyone calls great somehow does not work for your specific harness. Zone three is overfitting to the benchmark, which people do, and which Ara is explicitly telling you not to do. Speaker info: https://x.com/arafatkatze, https://www.linkedin.com/in/arafatkatze/, https://github.com/arafatkatze</description>
        <lastBuildDate>Sun, 07 Jun 2026 09:20:48 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>PeerTube - https://video.ut0pia.org</generator>
        <image>
            <title>Evals Are Broken, Use Them Anyway — Ara Khan, Cline</title>
            <url>https://video.ut0pia.org/lazy-static/avatars/0287a09a-aae7-4840-9843-b416426e7046.webp</url>
            <link>https://video.ut0pia.org/videos/watch/2e537227-fd85-4f65-93b8-3656ed3a4814</link>
        </image>
        <copyright>All rights reserved, unless otherwise specified in the terms specified at https://video.ut0pia.org/about and potential licenses granted by each content's rightholder.</copyright>
        <atom:link href="https://video.ut0pia.org/feeds/video-comments.xml?videoId=2e537227-fd85-4f65-93b8-3656ed3a4814" rel="self" type="application/rss+xml"/>
    </channel>
</rss>