Back PJ Onori’s blog

Testing agents on design systems

A halftone image of a robot taking a test

Look, I’m no AI expert. Meaning I have at least one thing in common with 99.9% of AI “experts” on LinkedIn. What I do know is that more and more people are using agents to make software. Meaning a design system needs to play nice with them.

It’s really easy to say agents are able to use a design system. See, I just said it–that’s how easy it is. It’s another thing to show that agents are able it use it well–with receipts to prove it.

Now, don’t get all excited–I haven’t done that. But what I’ve done is built a tool that’s helping us get there. And, as much as it doesn’t benefit me to say this, it’s not rocket science.

So, I don’t want anyone to get it twisted that I’m touting this as some ground breaking innovation. I mentioned this offhand in a thread and people asked for me to share more. So, here we are.

How it works, basically

I’m not a smart guy–to my benefit. Being not smart, I don’t try to get fancy. So, when pondering on this problem, I landed on something very not fancy.

Two groups of agents get spun up, and both are given the same prompt to make an interface. One group’s given the old design system. The other is given our new one.

Each agent provides feedback on problems faced after it’s done. Once all agents finish, the builds are evaluated on a bunch of crap and a report is generated.

That’s it. Like I said, not rocket science.

How it works, in gory(ish) detail

OK, I’ll admit there’s more that goes into this. I wouldn’t say it’s important to know unless you want to replicate this.

The tool is relatively configurable. I don’t know what I’m doing, so I need to turn knobs and levers.

All agents are run independently to avoid sharing context. That’s obviously pretty important. Each agent is instructed to confirm that the app builds and runs. If an error occurs, the error message is logged and the agent is prompted to fix. They get a finite number of chances before being told to sit in a corner and feel bad. That feature avoids the nightmare “4,568th time’s a charm” scenario along with the $10,000 bill.

Each test is locally stored and self contained. Below is basic directory structure, with different names to (hopefully) be clearer.

test/
├── old/ 
│   ├── agent-1/
│   ├── agent-2/
│   ├── agent-n/
└── new/ 
    ├── agent-1/ 
    ├── agent-2/
    ├── agent-n/

(No, I didn’t use AI to make that diagram. Yes, it look an embarrassingly long amount of time to make it.)

Agent output (generated code, error messages, screenshots etc.) is stored in its own agent-n folder. It’s basically everything that came out of the work. That’s helpful for when you want to run the code to dig into details.

All analysis happens after all agents finished the work or died trying. The report compiles all the results into group averages. So, the more agents you spin up, the larger the sample size (duh).

What it measures

A lot–but there’s still a lot of work to be done. This is what’s tracked so far:

There’s always more to track. Responsive screenshots seems like an obvious one. I’ll be adding more as they are needed.

How I use the tool

I’m still learning how to make the most use of this tool. In general, I use a basic flywheel approach. I run tests, collect data, synthesize it, then make improvements. Rinse, repeat. What this process lacks in fanciness, it makes up for in actually working.

Sometimes I approach improvements from a broad spectrum. That usually means combing through agent feedback like API issues, or install woes. Other times, I use it to improve on specific facets. Like the time I went to war on inline style usage.

I’ve gotten in the habit of running a test with a decent number of agents when a new release ships. Sometimes, nothing new shows up. Other times, the test catches bugs, or inconsistencies with documentation.

I also try to run tests when new models come out. To be honest, I should spend more time doing this.

The results are promising

But way too early to get giddy about. In a nutshell, every track metric improved. What I find most promising is the consistent trajectory of improvement. Results vary, but they’re not thrashy.

It’s worth noting the tests have only run on Claude 4.5/4.6 models. I still need to give the 4.7 models a spin. I also have zero use on any other providers. So, again, promising, but too early to get cocky.

What I learned

Now, I’ll admit–“learned” is doing some heavy lifting. Everyday, people “learn” the world is flat on YouTube. I can’t say what I’ve learned is right. All I can share is where my head’s at based on this experience. Your mileage will vary.

This isn’t a replacement for talking to people

OK, confession, I already knew that. But I want to lead with this so people don’t get wild (and stupid) ideas in their heads. This tool won’t save you from having to (god forbid) talk to actual people. Agents are a whole other animal–figuratively. Believe it or not, an agent evaluation tool is used to evaluate agents.

This can be pretty cheap

My instructions always explicitly state to only write code for the front-end. And that turns out to be a pretty light lift. Tests with 3 agents per group costs about the same as a cheap coffee.

Super big sample sizes aren’t super necessary

Most of the time the results from 3 agents versus 30 look about the same. There are cases where you get that out-of-nowhere result. Those show up more often with a larger set of agents (duh). Those can be helpful because it exposes the edge cases. But the common issues tend to show up with a low number of agents.

It’s easy to start teaching to the test

I was using the same template prompt for tests when I first got started. I was making solid progress, but it began to feel too easy. That’s when I added the ability to randomize prompts. And, yeah, all sorts of new crap started popping up.

Randomizing prompts emulates how a design system is used–which is unpredictably.

Documentation makes a huge difference

I was starting to question if documentation was making things better. Maybe component improvements was doing the heavy lifting–who knows? So, I ran a couple tests without documentation… The documentation was clearly the heavy lifter. This seems like a no brainer–but it was important for me to verify. Documentation is essential for systems that agents don’t have a lot of reps with.

Agents are somehow more stubborn than I am

Agent muscle memory is a complete pain in the ass. It can be overcome, but it’s going to piss you off the entire way there. Agents often straight-up ignore guidelines. I’ve combatted this with weapons-grade repetition. I’ll often repeat the same damned point five or six times in critical points of the documentation.

Agents (probably) need their own docs

Repeating points for agents may be fine, but it can be a pain in the ass for people. I’ve started to add a “For agents” section in the docs. That section is the dumpster for “get it in your silicon head” training. I remain a strong advocate for human-written documentation. But I could care less if documentation for agents is written by agents. As long as people don’t see that content, I’m all for it.

MCPs are hard

I’m really struggling with getting good results with MCPs. My best results come from agents getting a big context dump of all documentation. Agents that use the MCP ask all the wrong questions. They make bad assumptions. A lot more tokens get spent for worse results. All I’ve learned so far is that I need to learn more.

Documentation probably won’t cut it

Yes, I said documentation makes a big difference. But I don’t think it’s going to be enough. Agents still do dumb things for dumb reasons–or no reason. I’m pretty sure systems will have to ship with linters and other tools to unscrew basic agent stupidity.

Feedback needs to go through a BS filter

The feedback part of the analysis is helpful. Make no mistake. But it needs to heavy interpretation. A lot of the feedback is half true, or framed on a poor conclusion. There’s usually a nugget of value, but it can take a decent amount of polishing to extract it.

Don’t let that flywheel spin on its own

Now, I know what you’re thinking. You’re think it’d be great for agents to run the entire flywheel. Have them kick off tests, analyze, and then apply updates to the system. Over and over.

That’s a bad idea.

There’s a lot of noise in the output, feedback, and analysis–otherwise know as everything. That noise compounds fast. Think of the telephone game–then think about what that’d do to a design system.

I haven’t made something worth sharing

At least not yet. “Working for me” isn’t actually working. This was made on the company dime and so it’s not even my decision to begin with.

I’m not saying this is the right way to do this

I’m saying it’s how my dumb ass did. Just like 99.9% of AI “experts”, I have no idea what I’m doing. But that’s not going to stop me from trying. If I learn something worth sharing, I will. Maybe that’ll be a better way to do this. Maybe it’ll be a cautionary tale against the whole ordeal. Either way, you should give something like this a go. Then you too can join the world’s least exclusive club of AI “experts”.