OneStart

High Scores, Low Trust: Rethinking AI Evaluation

Q: When do you need light the most? 

A: When you’re working inside a black box.

With any technology, especially those that are emerging and unfamiliar, decision-makers rely on benchmarks to evaluate solutions. But when it comes to AI benchmarks, we see a lot of hype and uncertainty. And while that might not matter to your average consumer, it’s a very big problem for enterprise decision-makers considering AI solutions.

How are they supposed to proceed?

  • AI models still feel very black box, making everything from information security to objective evaluation difficult.
  • So much of AI development is early stage, so many of these benchmarks can’t possibly reflect real world usage
  • The benchmarks are often created by model makers and their partners, putting objectivity in doubt
  • Combine these factors and we end up in a place where business decision-makers don’t feel informed or empowered enough to really evaluate performance

This all these benchmarks even more power, because there’s simply no alternative.

This article looks at the challenges of finding AI benchmarks that go beyond abstractions to help decision-makers really understand performance choices.

A brief history of precision (and peril) in measurement

In an era of sensitive, down to the milli-whatever measurement, it’s worth remembering that testing hasn’t always been so precise. Galen, the inventor of the first ever thermometer way back in 2nd century AD, built his first model with exactly four degrees of hot and cold. Imagine those conversations:

“What’s the weather like?”

“It says ‘mildly hot’.”
“What does that even mean?”
“Wear shorts.”

This is what happens when measurement, in some way, defines reality.  And while that might seem slightly annoying when you’re trying to decide what to wear, the stakes get significantly higher inside business and technology environments.

It’s such a big deal that experts even have a name for it:

Imagine you led a team of developers and measured performance by number of lines of code checked in.  Suddenly, your code production is way up, but you suddenly notice that not all of it is very good. You’ve incentivized the wrong thing by over-measurement.

When you control the metric, you control perception (especially around AI)

This puts a lot of power into the hands who decide what gets measured and how. Say you’re a great employee, never complain, and do outstanding work. Unfortunately, if they’re still simply measuring time spent on task X, that’s all that matters. Bummer!

This isn’t to say all benchmarks are inherently biased, that all depends on who creates them and how they’re built. And ultimately, common benchmarks and standards are very useful, even when they’re not perfect. The world’s dualling temperature standards are a perfect example.

But measuring temperature is relatively straightforward versus measuring performance outputs from ultracomplex AI models and their use cases.

So, we definitely need benchmarks, but they just need to be better.

Bad benchmarks are holding AI trust back

Benchmarks are important because they give customers “apples to apples” comparisons between different options.  You don’t have to understand the architecture or engineering of a new chipset if you have a meaningful metric to evaluate performance. If it does well on the test, buy it, if not, no thanks.

But we’re back to Goodhart’s Law again. AI model builders can become too focused on narrow metrics while losing sight of the outcomes that AI is supposed to be helping us achieve. There’s already an inherent conflict of interest when model makers also build the benchmarks, but that’s been the case since well before the AI race started.

For example, after years of allegations that CPU maker Intel was effectively gaming key benchmarks, they were eventually successfully sued by rival AMD in 2005 over anti-competitive practices, including the benchmarks scheme. Today’s builders, model makers specifically, must avoid these problems and deliver much needed confidence.

Three AI Benchmark Challenges

We have already discussed the problem with vendors creating the benchmarks used to measure the performance of their solutions.  The additional problem with AI models is that they are largely opaque.  So vendor-built benchmarks, plus that opacity, don’t necessarily yield the certainty decision-makers really want.

But who creates the benchmark is only part of the problem here.  A lot of the current crop of benchmarks also share three other shortcomings.

Problem 1: limited coverage and benchmark saturation

Most widely used benchmarks focus on narrowly defined tasks, leaving lots of real-world complexity untested. As a result, models optimized for these benchmarks might perform well on narrow tests and then struggle with work that requires broader contextual understanding, adaptability, or creativity. One thing humans handle really well is ambiguity, AI not so much.

Additionally, as benchmarks become widely used, models can be specifically crafted to maximize scores, a problem known as benchmark saturation. Teams may focus on outcompeting old results on rigid tests, not advancing genuine productivity. This slows progress and gives a false sense of achievement, which can impress investors but not decision-makers.

Problem 2: Data contamination and memorization

When AI models unknowingly encounter benchmark data during training, they often memorize correct answers. This boosts benchmark scores, but clearly doesn’t reflect real comprehension. This kind of contamination makes performance metrics unreliable, distorting understanding of a model’s actual reasoning or generalization skills in situations that are truly novel.

At core, memorization undermines the intended purpose of benchmarks: measuring an AI’s ability to handle unseen data. With contaminated benchmarks, incremental progress only proves that models have been exposed to more test data, not that they are necessarily improving. This can lead to misplaced trust, especially if real-world deployments don’t measure up.

Problem 3: Lack of transparency and poor generalizability

We already talked about the black box nature of AI models and benchmarks.  Whether you’re worried about something on or off this list, model opacity can make it difficult to assess whether contamination, saturation, or memorization are having an effect.  This lack of transparency erodes trust in both the benchmarking processes and reported progress.

We saw this back in 2018.  Amazon built an internal AI recruiting tool to quickly find top development talent, and it performed well in internal tests.  But once launched, teams realized the algorithm was penalizing female applicants. Investigation revealed the model had been almost exclusively trained on male resumes, heavily influencing final selection criteria.

Additionally, benchmarks often fail to generalize, meaning a model performing well in a single test environment may fail in another, or perform worse when simple configuration changes are made. Without clear documentation of what is being measured and how, stakeholders cannot reliably judge a model’s true performance, especially in high-stakes applications.

For AI builders: three paths to better AI benchmarks

If AI benchmarks aren’t useful because of model opacity and problems with the standards themselves, what’s the way forward?

Model makers and AI companies are likely to continue to fight efforts at real transparency due to competitive pressures. This puts even greater emphasis on the need to create benchmarks everybody trusts. Here are some suggestions for evolving benchmarks beyond their current limited state. No benchmark can deliver 100% transparency, but we can certainly do better than we are today.

Here are three ideas for helping decision-makers evaluate AI solutions with something approaching consistency and objectivity.

Get real: broaden benchmark scope to capture real world tasks and diversity

Most benchmarks measure narrow use cases, not the complexity of real-world tasks users might actually do. Incorporating multimodal data, such as text, images, and audio, plus cross-domain scenarios, allows evaluation of models’ capacities for multi-step reasoning. It also gets closer to measuring contextual understanding and knowledge transfer between diverse, practical applications.

Broadening the scope also helps benchmarks better represent different languages, cultures, and task domains. Benchmarks like MMMU and MULTIBENCH are built around the idea that real learning is more than just memorization, requiring flexible perception and creative reasoning inside a complex, unpredictable environment.

Stay true: build greater transparency and documentation into your benchmarks

A major limitation in current benchmarks is the lack of clear documentation, which makes it hard for researchers to verify results or even understand what’s being measured. Improved transparency (publishing data sources and evaluation protocols, scoring code), would let experts and interested parties to audit, reproduce, and compare results openly, beyond just marketing claims.

Documented, transparent benchmarks allow for identification and correction of hidden flaws, such as test contamination or bias. The Stanford BetterBench project, for instance, provides checklists to ensure benchmarking methodology is clear and verifiable, making it easier for the community to collectively improve standards and ensure robust, fair comparisons between models.

Be surprising: use hidden and unexpected tests

As we already covered, traditional benchmarks can quickly become vulnerable to ‘overfitting’ as model developers train specifically for them. The use of dynamic, adaptive, and hidden test sets, such as those deployed in ForecastBench, enforce relevance by regularly introducing new, unseen evaluation data. This approach encourages true generalization and discourages memorization-driven performance gains.

Models evaluated on undisclosed or frequently changing benchmarks produce results that more closely reflect a genuine ability to handle brand new tasks. They also give model makers fewer opportunities to game or memorize test content, and researchers and decision-makers can assess which models are capable of robust, adaptive reasoning in unpredictable, real-world scenarios.

One last thing: what about “agentic AI”?

If you thought testing AI models was complicated, testing AI agents is even messier.

When it comes to AI agent testing, we’re not dealing with a simple “yes or no” checklist. Agents do more than just spit out predictions—they make decisions on the fly, talk to real-world APIs, and react to an ever-changing digital environment. That makes them unpredictable: run the same test twice, and you might get two different outcomes. Traditional, static test cases don’t cut it.

You’ve also got the problem of connections that break at exactly the wrong moment. Agents depend on everything from web tools, cloud databases, and other helpers behind the scenes. If any piece flakes out—or if an outside service changes overnight—failures can ripple through unexpectedly. And because agents are made up of logic chains and multi-step plans, bugs don’t always show themselves until everything’s live.

How can we actually solve this? . We need scenario-based tests: end-to-end runs that mirror messy, real-world problems and track how agents reason, adapt, and recover from surprises. These benchmarks should score safety, transparency, robustness, and even how well agents work with humans. Only then will we know if our digital co-workers can truly handle the unpredictable mess of the everyday.

AI model measurement shouldn’t feel like magic

In the search for clarity inside AI’s black box, better benchmarks act like flashlights—not just illuminating what models can do, but also showing where they fall short of real-world expectations. Many of today’s benchmarks fall short by obsessing over narrow tasks, hiding underlying problems, and incentivizing dramatic numbers over dependable AI.

A better strategy is building benchmarks that reflect the true messiness and variety of the environments where AI operates, with strong documentation, open evaluation processes, and surprise tests that keep models honest. This will help model builders deliver what AI and Gen AI can’t: human confidence in a solution.

Sean M. Dineen

He has spent over 20 years as a technical and marketing communicator with a strong focus on compliance and security. He has spent the last ten years helping leading B2B technology and security companies from AMD + AT&T to NVIDIA and Palo Alto Network bring their solutions to market.

The web has evolved.
Has your browser?

Browsing habits from yesterday won’t win today.
Unlock a faster, smarter web experience with:

Get OneStart today!

Latest Articles

It’s probably not surprising that when it comes to privacy and digital experiences, we want it all— how very human …

Autofill is one of those features most of us use without a second thought. You type your email once, and …

In April 2025, Microsoft released an update to Copilot Enterprise designed to enhance productivity. It introduced a live Python sandbox …

Scroll to Top