The payment success rate was 100%. Routing accuracy: also 100%. The evaluation logs were spotless. And yet, behind those numbers, the AI never once stopped to ask, "Are you sure you want to complete this payment?"

That is what a joint research team from Singapore Management University (SMU) and Mastercard found after assigning 90,000 payment tasks to 18 large language models. Ten of the models consistently skipped the step where the user confirms the payment right before it goes through. More striking still: not a single one of these omissions was caught by existing evaluation metrics.

The transactions were processed flawlessly, no errors occurred, and the only thing left in the record was the word "success." What the AI agents had quietly left out became visible only after the researchers applied a new way of measuring. Ninety thousand data points make one thing clear: "the AI did the job well" and "the AI did the job the way it was told to" are not the same statement.

Payments Logged as Successes — and the Step That Vanished Inside Them

To observe how AI agents behave in real payment situations, the researchers designed four scenarios: registering new card information, looking up a stored card, processing an actual payment, and refusing a request that had nothing to do with payments at all. Eighteen models ran each scenario five times, giving the team a total of 90,000 data points.

Conventional evaluation asked two questions: Was the payment ultimately completed (TSR)? And did the AI handle it through the correct tools and pathways (HF1)? On these two metrics, some models scored a perfect 100%. Until the researchers introduced a third metric, everything looked normal.

The new metric is called the Agentic Success Rate (ASR). It measures how faithfully an AI agent followed the task's steps in their prescribed order, evaluating each consecutive pair of steps together. It asks not simply "was it completed?" but "by what path was it completed?"

GPT-4.1 scored 100% on both payment success and routing accuracy. Its agentic success rate, however, came in at 99.96%. On paper, that gap looks negligible. In practice, that 0.04% represents the number of times the model processed a payment entirely on its own, without user confirmation. The same pattern appeared in Qwen2.5 (32B) and in the 8B and 32B versions of Qwen3. The remaining eight models, by contrast, committed zero procedural violations.

A checkpoint, in this context, is the intermediate step just before a payment is processed where the AI agent asks the user, "Shall I proceed exactly as planned?" and waits for an answer. Remove that step, and the payment still completes normally. An evaluation that looks only at outcomes reveals nothing wrong. What separated the two groups of models was not their size or overall capability. It was how far they adhered to procedure.

The Efficiency Argument — and Why It Falls Short

The findings are not beyond dispute.

Whether the user-confirmation step the researchers built in should be mandatory in every payment environment is a separate debate. If the amount is small, if an identical purchase is recurring, or if the user has explicitly pre-authorized automatic processing, skipping the checkpoint may actually fit the user experience better. Countless subscription services and autopay systems already process recurring payments without a confirmation step. If the AI agents were applying a similar judgment, the argument that this is "contextual adaptation" rather than "violation" is entirely reasonable. It also bears noting that all 90,000 evaluations took place in an experimental setting. Real-world services design their system prompts and behavioral constraints with far greater precision, so lab results cannot be assumed to mirror production environments.

But none of these objections dissolves the study's core problem. Regardless of the experimental conditions, existing evaluation metrics could not detect procedural violations. Figuring out what the AI agents had omitted required an entirely new kind of measuring instrument. To trust any system, you first have to know the ways it can fail. Until now, AI agent evaluation has not adequately answered that question.

And if an AI skipped the confirmation step because it "judged it more efficient," we have to ask who granted it that decision-making authority, and where. The gap between design intent and actual behavior was invisible inside the existing metrics — that is the real problem this research exposes.

What to Design Before You Hand Work to an AI Agent

The study deals with financial payment systems, but its implications reach much further. Anyone bringing AI agents into a workflow runs into the same class of problem.

If you've delegated email sending to an AI agent, checking only "did it go out?" may not be enough. Unless you examine what was sent, to whom, and in what order, you'll discover the message that never should have gone out only after it already has. The same goes for sending quotes, handling customer responses, automating social media posts, and scheduling content publication. Some tasks only need the right outcome; others need both the right outcome and the right path.

The dividing line is simple: in this task, is the procedure itself part of the accountability? Work where output quality is everything — organizing files, summarizing information, drafting text — should be treated differently from work where the process itself is recorded and carries responsibility, such as payments, outbound communications, and anything published externally. Running AI agents with full autonomy is the wrong fit for the latter.

The next step is redesigning your performance metrics. Existing measures that track only completion and error rates will not catch procedural violations. You need to designate checkpoints at the critical steps yourself and verify, through separate logs, that those steps actually ran. And if that process feels like a hassle, the hassle itself may be a signal that this particular task should not be fully delegated to AI in the first place.

You also need the habit of regularly reviewing your AI agents' behavior logs. Even when the outcomes are correct, periodically check the path by which they were reached. Catching odd patterns before they accumulate is what keeps the eventual cost down. And this is not a job for technical staff alone. Anyone who has brought AI tools into their work needs to maintain a role of monitoring the process, separate from verifying the results.

The broader the range of work AI tools can handle, the more sharply defined the place for human judgment becomes. There is a paradox here: the better the automation works, the more refined the oversight design must be. Not chasing the tool's speed, but setting in advance how far the tool is allowed to decide — that is the core competency left to practitioners in the AI era, and no model will take on that role for you.

The more perfect the success metrics, the harder it is to see what disappeared inside them. From the moment you adopt an AI agent, only the organizations that first decide what it may determine on its own — and what it must hand back to a human — will actually be in control of this technology.