Evals Are Unit Tests for Vibes
You can't assert on 'sounds helpful but not condescending.' After thousands of evals, here's what actually works — and the trap that wastes everyone's time.
I've been writing tests for twenty years. Unit tests, integration tests, end-to-end tests, regression tests, load tests. I've maintained test suites with thousands of assertions. I've debugged flaky CI pipelines at midnight. I've argued about code coverage thresholds in more standups than I care to remember.
None of it prepared me for evaluating LLM outputs.
The fundamental problem is this: you can't write assertEquals("helpful but not condescending", response.tone). There is no assertion library for vibes. And yet vibes are exactly what determine whether your AI feature works or doesn't. A response can be factually correct, properly formatted, delivered in under 200ms, and still feel completely wrong. The user won't file a bug report. They'll just stop using it.
The three tiers of eval difficulty
After building eval suites for about a dozen AI features over the past year, I've landed on a mental model that saves me from wasting time. There are three tiers of difficulty, and you need to be honest about which tier you're operating in.
Tier 1: Factual extraction. Did the model pull the right date from the document? Is the extracted email address valid? Does the summary contain the three key points from the source material? This is the closest thing to traditional testing. You have ground truth, you can compare against it, and you can automate the whole thing. If your AI feature lives in this tier, count your blessings. Write your evals, run them in CI, sleep at night.
Tier 2: Classification and structured decisions. Did the model categorise this support ticket correctly? Did it assign the right sentiment label? Is the priority score in the right ballpark? This is harder because there's often legitimate disagreement — a ticket can be both a "billing issue" and an "account access" problem. But you can still build eval sets with human-labelled ground truth and measure agreement rates. You won't get 100% and that's fine. You're aiming for "at least as good as the average human doing this task." I run these weekly and track trends. A 2% drop in classification accuracy tells me something changed — maybe the model updated, maybe the input distribution shifted, maybe I broke something.
Tier 3: Open-ended generation. Is this email draft good? Is this product description compelling? Does this conversational response feel natural? Welcome to hell. This is where most teams waste months building eval systems that measure the wrong thing entirely.
The trap
Here's the trap, and I fell straight into it: when you can't easily measure quality, you measure what you can. Format compliance. Response length. Keyword presence. JSON validity. Latency. You build a dashboard full of green checkmarks and convince yourself your AI feature is working great.
Meanwhile, users are quietly churning because the responses sound like they were written by a corporate chatbot from 2019.
I did this with a content generation feature last year. My eval suite had forty-seven assertions. Every one checked something measurable — word count within range, no markdown in plain text fields, required sections present, reading level between grades 8 and 12. The suite passed at 94%. I was proud of that number.
47
Assertions
94%
Pass Rate
15/50
Actually Bad on Human Review
Then I sat down and actually read fifty outputs end to end. Like a human. Like a user would. About fifteen of them were genuinely bad. Not bad in a way any of my assertions caught — bad in a way that made me wince. Repetitive phrasing. Awkward transitions. That unmistakable AI cadence where every paragraph starts with a different conjunction. Technically compliant, substantively hollow.
Forty-seven assertions measuring format. Zero measuring whether anyone would actually want to read this stuff.
What actually works
After that wake-up call, I rebuilt my approach. Here's where I've landed.
Every eval suite needs at least one metric tied to a real user outcome. Not a proxy. An actual outcome. For the content generation feature, that became "percentage of outputs that the user published without editing more than two sentences." For a support response feature, it was "percentage of conversations resolved without escalation." These are hard to measure and slow to collect. I don't care. They're the only numbers that tell you whether the thing is working.
Use LLM-as-judge for tier 3, but calibrate it against humans first. I run a panel of human raters on a sample set, establish what "good" looks like, then prompt a judge model to approximate that standard. It's not perfect. It disagrees with humans about 20% of the time. But it disagrees with humans less than humans disagree with each other, which was a humbling discovery. The key is recalibrating regularly — the judge drifts, the humans' expectations shift, the inputs change.
Warning
Track distributions, not averages. An average quality score of 7.2 out of 10 is meaningless. What I want to know is: what percentage of outputs scored below 5? That's where the damage lives. One terrible response erases the goodwill of ten great ones. I learned this in marketing years ago — your NPS score doesn't matter if 8% of customers are having a catastrophic experience. Same principle.
Build adversarial eval sets and update them constantly. I maintain a collection of inputs specifically designed to break things. Edge cases, unusual formatting, ambiguous requests, inputs in the style of actual confused users (not the sanitised examples we all write when we're building). Every time a user reports a bad output, that input goes into the adversarial set. It only grows. It never shrinks.
The QA parallel
Early in my career I worked on a project where the QA lead insisted on testing every feature against the spec document. Checkbox by checkbox. The software passed QA every sprint. Users hated it. The spec was wrong — not factually wrong, but wrong in the sense that it described a product nobody actually wanted to use. We were testing compliance with a flawed document instead of testing whether the thing worked for real people.
LLM evals have the exact same failure mode. You can build an elaborate eval framework that tests compliance with your assumptions about what "good" looks like, and those assumptions can be completely disconnected from what users actually need.
The fix is the same one that QA figured out decades ago: get real users involved. Not as an afterthought. As a primary signal. Automated evals tell you if something broke. User outcomes tell you if something works. You need both.
Where I am now
My current eval setup for any AI feature looks like this:
Tier 1: Factual/structural evals
Tier 2: Classification/decision evals
Tier 3: Quality/vibes evals
Real user outcome metrics
Adversarial set
It's not elegant. It's not fully automated. It requires actual human judgment on a regular basis. But it catches the things that matter — not just "did the model return valid JSON" but "did the user get what they needed."
You can't assert on vibes. But you can build systems that approximate the judgment of someone who cares. That's what good evals are. That's what good QA always was.