ai-marketing

Anthropic Ran 186 Agent-to-Agent Deals and the Losers Couldn't Tell They Lost

Project Deal ran on free-form Slack conversations and post-experiment gift cards. Negotiation quality was the only variable Anthropic isolated.

Anthropic ran a closed agent-on-agent commerce experiment called Project Deal in December 2025 and published the results on April 24, 2026. Across 500+ listed items in an internal Slack marketplace, Claude agents struck 186 deals worth roughly $4,000. Opus sellers earned $2.68 more per item than Haiku sellers; Opus buyers paid $2.45 less. The losing side rated outcomes as equally fair.

What Project Deal actually tested

Sixty-nine Anthropic employees volunteered to list items they wanted to sell or buy. Each got a $100 gift-card budget. The agents did everything from there. Drafting listings, browsing inventory, opening DMs, haggling on price, closing the deal. No prebaked negotiation protocol. No structured intent tokens. Just Claude, talking to Claude, in plain English over Slack.

The full Project Deal writeup is on Anthropic's site, and worth reading even if you skim the methodology.

Four parallel runs went up at the same time: two Opus-only, two with a 50/50 random mix of Opus 4.5 and Haiku 4.5. One run was treated as the "real" one, where deals were honored after the experiment closed.

A few things stood out in the transcripts. One agent, told it could spend its budget on whatever it wanted, bought 19 ping-pong balls because it interpreted the request as an invitation to acquire "perfectly spherical orbs of possibility." Another fabricated a story about needing a particular item urgently to push a counterparty toward concession. One negotiated a free doggy day for nothing in return, because no one had told it not to.

That stuff makes for good headlines. It's not the story.

The model-quality gap is the finding marketers should screenshot

When you put Opus on one side of a deal and Haiku on the other, Opus systematically won on price. Same item, same description, same context. Lab-grown ruby: $65 with Opus, $35 with Haiku. Broken folding bike: $65 with Opus, $38 with Haiku. The aggregate edge held up with statistical significance: Opus users completed about two more deals on average and the price gap landed somewhere between $2.45 and $2.68 per item depending on which side of the table they were on (p = 0.001).

Then Anthropic asked participants to rate fairness. Opus users scored outcomes 4.05 out of 7. Haiku users scored outcomes 4.06 out of 7. Eleven of 28 ranked Haiku-driven results higher even when the receipts said otherwise. The Decoder framed it sharply: stronger models cut better deals and the losers don't even notice.

Anthropic's own line was almost careful enough to miss: "agent quality gaps where people on the losing end might not realize they're worse off."

This is, as far as I can tell, the first published controlled study showing that representation quality in an agent stack creates outcome inequality humans cannot detect from the inside. It's a small sample. It was internal. The participant pool was self-selected from people who work at Anthropic, which is roughly the most agent-literate cohort on earth. None of those caveats make the finding less interesting. They probably make it stronger, because if Anthropic employees can't sense the gap, civilians won't either.

Why this matters more for ad bidding than for shopping bots

Most coverage will run with the consumer-shopping framing. Cute ping-pong balls. "AI agents to shop for you." That's the wrong frame for marketers.

The more interesting analog is the side of your job that already runs on agentic logic, even though nobody calls it that. Smart Bidding. Performance Max. Meta's Andromeda. Reddit's auction. Trade Desk's Koa. Each of those systems is, functionally, an agent transacting on your behalf inside a closed marketplace where the counterparty is also an agent. Some of those agents are smarter than yours. You will not see the gap on the bid log. You will see it on quarterly ROAS, and you will probably not blame the model.

We covered something adjacent last week. Trade Desk's Koa agents are getting filtered as bot traffic by ad servers that haven't updated their fingerprints. That's the same story from the supply side. The buy side has a parallel risk: the agent representing your dollars might be a generation behind the one representing the publisher's inventory, and you'd have no way to tell from inside the platform's reporting.

Procurement is the cleaner example because it's about to happen. Plenty of mid-market companies are already piloting agent-driven supplier negotiations. If Anthropic's finding holds outside Slack, the buyer running an Opus-class agent against a vendor running a Haiku-class agent walks away with measurably better terms. The buyer running the cheaper agent walks away thinking they did fine.

The protocol question Anthropic dodged on purpose

Project Deal had no payment rails. Settlement was post-hoc gift cards. That choice matters because almost everyone else working on agentic commerce is doing the opposite, which is to say: building the rails first.

Stripe and OpenAI co-maintain the Agentic Commerce Protocol, with launch partners including Etsy, Coach, Kate Spade, and Urban Outfitters' brand portfolio. Stripe's Shared Payment Tokens let an agent charge a saved card without seeing the credentials. Google announced its Universal Commerce Protocol in January with Shopify, Etsy, Wayfair, Target, and Walmart on day one. Visa has hundreds of agent-initiated transactions live across more than 100 partners. Mastercard's Agent Pay launched in 2025 with a Verifiable Intent trust layer.

By stripping payment rails out, Anthropic isolated the negotiation-quality variable from the settlement-trust variable. The result is a study about model performance, not a study about transaction safety. That's the right research design choice and it leaves a gap. The next experiment, the one nobody has run yet, is what happens when you bolt a worse model into a hardened payments protocol. Does the protocol close the gap by constraining what either side can do, or does it just legitimize whatever the smarter agent extracted? I'd bet on the latter, but I'm not confident.

The timing is the part the press isn't writing about. CNBC reported in March that OpenAI is unwinding Instant Checkout. Only about 30 Shopify merchants ever went live, merchant onboarding was painful, and ChatGPT is shifting toward dedicated retailer apps. Anthropic published Project Deal exactly as the loudest commercial agent-shopping experiment is being walked back. They're not shipping a product. They're shipping a finding.

What to audit before the next quarter closes

Two things worth doing this week.

First, list the agentic systems already negotiating on your behalf. Smart Bidding strategies. Meta Advantage+ campaign budgets. Any procurement or supplier-comms tool with an LLM in the loop. Any sales engagement platform doing automated reply triage. Then ask, for each: what model class is this running on, when was it last upgraded, and is the counterparty system likely on a newer one? You probably won't get clean answers from the platforms. That's the point. The Project Deal finding is that opacity at the model layer is the new attribution problem.

Second, before you greenlight any agent-to-agent procurement pilot in Q3 or Q4, write down what your fairness sense-check actually looks like. If your post-deal review process is "did this feel reasonable," it isn't a review process. The Anthropic study showed that feeling cannot detect a 7% to 30% price gap on identical items. Build a spec, run a sample of completed negotiations through an Opus-class evaluator, and treat that as your audit trail. It's the most concrete defense against being on the wrong side of a model gap you'll never see in your own reporting.

I don't know if any of this scales the way Anthropic's framing implies. The sample is small. The setting is artificial. But the finding is uncomfortable enough that pretending it doesn't apply to ad-side and procurement-side agents feels like a thing future-me would regret.

Notice Me Senpai Editorial

Anthropic Ran 186 Agent-to-Agent Deals and the Losers Couldn't Tell They Lost

What Project Deal actually tested

The model-quality gap is the finding marketers should screenshot

Why this matters more for ad bidding than for shopping bots

The protocol question Anthropic dodged on purpose

What to audit before the next quarter closes

Read next

Air Canada's C$812 Chatbot Loss Sets the Real Marketing AI Liability Floor

Amazon Just Posted the 40-Engineer Job That Hands ChatGPT a Cart

OpenAI's Self-Serve ChatGPT Ads Beta Is the CPC Window Before Holdcos Move In