Name: Would you ship code without tests? Why do your AI agents' skills need evals
Start: 2026-05-27T16:15:00+0200
End: 2026-05-27T16:45:00+0200

Would you ship code without tests? Why do your AI agents' skills need evals

FULL

Wednesday May 27, 2026 16:15 - 16:45 CEST

🤖 DATA/AI ARENA

Limited Capacity full
Adding this to your schedule will put you on the waitlist.

Everyone is writing skills. Almost nobody is testing them and most of them are AI-generated.
Skills might get "vibe-checked" with a handful of manual runs, then shipped.

## You wouldn't ship code without tests, but why ship skills without evals?

As we transition from simple chat interfaces to autonomous AI Agents, equipping LLMs with tools (APIs, functions) has become the new standard.

This talk tackles the critical missing piece in Augmented Development: Skill Engineering and its evaluation.
We will move past the "vibe check" and dive into the methodologies required to build robust, measurable agents based on industry best practices.

## What you will learn in this 30-minute session:

- LLM-Friendly design: How to write semantic schemas and tool descriptions that models actually understand, reducing baseline errors.
- TDD for AI agents: How to define success criteria and build automated tests for non-deterministic systems.
- The evals playbook: Measuring what matters by focusing on routing accuracy (did it pick the right tool?) and argument accuracy (are the parameters valid?), including how to leverage "LLM-as-a-Judge".
- Continuous refinement: Using failed evals and production telemetry to iteratively improve your skill prompts without touching the underlying business logic.

## Stop guessing if your agents work.

Join this talk to learn how to test, measure, and refine your AI skills with the same rigor as traditional software engineering.

Speakers