Testing agents on design systems

A halftone image of a robot taking a test

Look, I’m no AI expert. Meaning I have at least one thing in common with 99.9% of AI “experts” on LinkedIn. What I do know is that more and more people are using agents to make software. Meaning a design system needs to play nice with them.

It’s really easy to say agents are able to use a design system. See, I just said it–that’s how easy it is. It’s another thing to show that agents are able it use it well–with receipts to prove it.

Now, don’t get all excited–I haven’t done that. But what I’ve done is built a tool that’s helping us get there. And, as much as it doesn’t benefit me to say this, it’s not rocket science.

So, I don’t want anyone to get it twisted that I’m touting this as some ground breaking innovation. I mentioned this offhand in a thread and people asked for me to share more. So, here we are.

How it works, basically

I’m not a smart guy–to my benefit. Being not smart, I don’t try to get fancy. So, when pondering on this problem, I landed on something very not fancy.

Two groups of agents get spun up, and both are given the same prompt to make an interface. One group’s given the old design system. The other is given our new one.

Each agent provides feedback on problems faced after it’s done. Once all agents finish, the builds are evaluated on a bunch of crap and a report is generated.

That’s it. Like I said, not rocket science.

How it works, in gory(ish) detail

OK, I’ll admit there’s more that goes into this. I wouldn’t say it’s important to know unless you want to replicate this.

The tool is relatively configurable. I don’t know what I’m doing, so I need to turn knobs and levers.

You can choose how many agents get spun up per group. I typically go with three per group–mainly because three is an awesome number.
The model is also configurable. It only supports Claude models, for a number of reasons. But there’s nothing technically stopping it from using other providers.
There’s a flag for accessing training through a static markdown file or MCP. The results are pretty dramatically different. I’m still working out why.
It allows you to add your own prompt to test from or let Robot Jesus to take the wheel. In that case, the prompt is created by an agent-generated in PRD format.

All agents are run independently to avoid sharing context. That’s obviously pretty important. Each agent is instructed to confirm that the app builds and runs. If an error occurs, the error message is logged and the agent is prompted to fix. They get a finite number of chances before being told to sit in a corner and feel bad. That feature avoids the nightmare “4,568th time’s a charm” scenario along with the $10,000 bill.

Each test is locally stored and self contained. Below is basic directory structure, with different names to (hopefully) be clearer.

test/
├── old/ 
│   ├── agent-1/
│   ├── agent-2/
│   ├── agent-n/
└── new/ 
    ├── agent-1/ 
    ├── agent-2/
    ├── agent-n/

(No, I didn’t use AI to make that diagram. Yes, it look an embarrassingly long amount of time to make it.)

Agent output (generated code, error messages, screenshots etc.) is stored in its own agent-n folder. It’s basically everything that came out of the work. That’s helpful for when you want to run the code to dig into details.

All analysis happens after all agents finished the work or died trying. The report compiles all the results into group averages. So, the more agents you spin up, the larger the sample size (duh).

What it measures

A lot–but there’s still a lot of work to be done. This is what’s tracked so far:

Timing: How long it takes to complete the task. Ideally, a well-trained agent will be able to get the job done faster.
Lines of code: How much code did it produce to complete the job. My hope is to see lines of code decrease as we build more high-order components.
Code variance: How different the code is between each iteration. This measure shows how consistent and predictable code output is across agents.
Fix attempts: How many times the agent had to go back and fix an error. Ideally, the design system allows agents to one-shot an interface.
Components used: A breakdown of all the components used, and the amount of time each is used.
Accessibility: A fairly basic set of axe accessibility tests. At best, the tests are assessing airquote-accessibility. Admittedly not great, but a start.
Performance: Baseline render time, LCP, FCP, and OPP (that last one isn’t real, but it was a good song)
Inline styles: Tracks component instances with inline styles. This one is a legit nightmare–agents love to slap inline styles on pretty much everything.
Visual diff: Just your run of the mill pixel diffing for each screenshot. This is also used to assess how consistent the output is from agent to agent.
Token usage: No, not design tokens–the robot kind. Input and output token spend is recorded to measure efficiency. Results should be good–and not break the bank.
Feedback: Last, but not least, feedback from every agent is added and broken down by general theme.

There’s always more to track. Responsive screenshots seems like an obvious one. I’ll be adding more as they are needed.

How I use the tool

I’m still learning how to make the most use of this tool. In general, I use a basic flywheel approach. I run tests, collect data, synthesize it, then make improvements. Rinse, repeat. What this process lacks in fanciness, it makes up for in actually working.

Sometimes I approach improvements from a broad spectrum. That usually means combing through agent feedback like API issues, or install woes. Other times, I use it to improve on specific facets. Like the time I went to war on inline style usage.

I’ve gotten in the habit of running a test with a decent number of agents when a new release ships. Sometimes, nothing new shows up. Other times, the test catches bugs, or inconsistencies with documentation.

I also try to run tests when new models come out. To be honest, I should spend more time doing this.

The results are promising

But way too early to get giddy about. In a nutshell, every track metric improved. What I find most promising is the consistent trajectory of improvement. Results vary, but they’re not thrashy.

It’s worth noting the tests have only run on Claude 4.5/4.6 models. I still need to give the 4.7 models a spin. I also have zero use on any other providers. So, again, promising, but too early to get cocky.

What I learned

Now, I’ll admit–“learned” is doing some heavy lifting. Everyday, people “learn” the world is flat on YouTube. I can’t say what I’ve learned is right. All I can share is where my head’s at based on this experience. Your mileage will vary.

This isn’t a replacement for talking to people

OK, confession, I already knew that. But I want to lead with this so people don’t get wild (and stupid) ideas in their heads. This tool won’t save you from having to (god forbid) talk to actual people. Agents are a whole other animal–figuratively. Believe it or not, an agent evaluation tool is used to evaluate agents.

This can be pretty cheap

My instructions always explicitly state to only write code for the front-end. And that turns out to be a pretty light lift. Tests with 3 agents per group costs about the same as a cheap coffee.

Super big sample sizes aren’t super necessary

Most of the time the results from 3 agents versus 30 look about the same. There are cases where you get that out-of-nowhere result. Those show up more often with a larger set of agents (duh). Those can be helpful because it exposes the edge cases. But the common issues tend to show up with a low number of agents.

It’s easy to start teaching to the test

I was using the same template prompt for tests when I first got started. I was making solid progress, but it began to feel too easy. That’s when I added the ability to randomize prompts. And, yeah, all sorts of new crap started popping up.

Randomizing prompts emulates how a design system is used–which is unpredictably.

Documentation makes a huge difference

I was starting to question if documentation was making things better. Maybe component improvements was doing the heavy lifting–who knows? So, I ran a couple tests without documentation… The documentation was clearly the heavy lifter. This seems like a no brainer–but it was important for me to verify. Documentation is essential for systems that agents don’t have a lot of reps with.

Agents are somehow more stubborn than I am

Agent muscle memory is a complete pain in the ass. It can be overcome, but it’s going to piss you off the entire way there. Agents often straight-up ignore guidelines. I’ve combatted this with weapons-grade repetition. I’ll often repeat the same damned point five or six times in critical points of the documentation.

Agents (probably) need their own docs

Repeating points for agents may be fine, but it can be a pain in the ass for people. I’ve started to add a “For agents” section in the docs. That section is the dumpster for “get it in your silicon head” training. I remain a strong advocate for human-written documentation. But I could care less if documentation for agents is written by agents. As long as people don’t see that content, I’m all for it.

MCPs are hard

I’m really struggling with getting good results with MCPs. My best results come from agents getting a big context dump of all documentation. Agents that use the MCP ask all the wrong questions. They make bad assumptions. A lot more tokens get spent for worse results. All I’ve learned so far is that I need to learn more.

Documentation probably won’t cut it

Yes, I said documentation makes a big difference. But I don’t think it’s going to be enough. Agents still do dumb things for dumb reasons–or no reason. I’m pretty sure systems will have to ship with linters and other tools to unscrew basic agent stupidity.

Feedback needs to go through a BS filter

The feedback part of the analysis is helpful. Make no mistake. But it needs to heavy interpretation. A lot of the feedback is half true, or framed on a poor conclusion. There’s usually a nugget of value, but it can take a decent amount of polishing to extract it.

Don’t let that flywheel spin on its own

Now, I know what you’re thinking. You’re think it’d be great for agents to run the entire flywheel. Have them kick off tests, analyze, and then apply updates to the system. Over and over.

That’s a bad idea.

There’s a lot of noise in the output, feedback, and analysis–otherwise know as everything. That noise compounds fast. Think of the telephone game–then think about what that’d do to a design system.

I haven’t made something worth sharing

At least not yet. “Working for me” isn’t actually working. This was made on the company dime and so it’s not even my decision to begin with.

I’m not saying this is the right way to do this

I’m saying it’s how my dumb ass did. Just like 99.9% of AI “experts”, I have no idea what I’m doing. But that’s not going to stop me from trying. If I learn something worth sharing, I will. Maybe that’ll be a better way to do this. Maybe it’ll be a cautionary tale against the whole ordeal. Either way, you should give something like this a go. Then you too can join the world’s least exclusive club of AI “experts”.