Mondelez Spent $40M Unblocking the AI Crawlers It Spent Years Blocking

Mondelez Spent $40M Unblocking the AI Crawlers It Spent Years Blocking
Mondelez committed $40M to AI marketing infrastructure after the same crawlers it once blocked stopped citing Oreo in cookie queries.

Mondelez originally blocked AI crawlers from indexing its content, then watched Oreo get cited in roughly 10% of AI chatbot cookie answers despite owning the most famous cookie brand on earth. The company has since reversed course, unblocked GPTBot and friends, and committed $40 million to an AI marketing platform built with Accenture and Publicis to claw the visibility back. Cookie-query citation is now closer to 70%.

That arc is the protection paradox in one company. Bill Hunt's piece in Search Engine Journal this week put a name on the pattern, but Mondelez gave it the receipts.

The block-then-pay loop is now a documented playbook

Roughly 79% of major news publishers currently block AI training bots in robots.txt. About 25% of the top 1,000 websites do the same, up from 5% in early 2023. Most of those blocks were written in a 2023-era panic about model training, scraping, and copyright. The strategy was: starve the bots, force OpenAI and Google to come to the table, protect the IP.

What actually happened is more annoying. The bots got fed by everybody else: Reddit threads, secondary news coverage, affiliate roundups, retailer pages. Brands that blocked their own .com lost the chance to be the cited source on questions about themselves. Mondelez told Digiday that its $3.5 billion digital commerce strategy hit a wall when the team realized Oreo's product pages weren't even in the training set most LLMs were drawing from. So it had to spend more money to be visible in a place its own robots.txt was telling crawlers not to look.

The other half of the paradox is the marketing dollar going somewhere else. WPP, Omnicom, and Dentsu all committed budget to OpenAI's first ad test, and the brands inside those holdcos are now paying for placement in answers their own content could have populated for free. Adthena's early read on that test had ChatGPT ad CTR running roughly 7x below Google search benchmarks, which is the kind of return curve you accept only when there is no organic alternative. Blocking the crawler is what removes the organic alternative.

The IP logic was right. The application was wrong.

The original case for blocking GPTBot was reasonable. Models were ingesting copyrighted content with no compensation, scraping was happening at scale, and litigation was mostly theoretical at the time. I think most legal teams made the right call given what they knew. The mistake was treating "block AI crawlers" as a single switch instead of a question about which crawlers, which content, and what trade-off you accept on visibility.

The teams that have done this well now run a hybrid robots.txt: open to public marketing pages, closed on gated content libraries, login walls, customer data, and IP-sensitive technical docs. That is the right calibration. The wrong one is the all-or-nothing block that turns your homepage into a cited source for nobody. Search Engine Land has tracked the broader trend and put the hard block at roughly 7% of the top 1,000 sites in its earliest counts, climbing every quarter since.

And to be fair, this isn't entirely the marketers' fault. Most robots.txt files were written or updated by web infrastructure teams under pressure from legal, often without anybody from SEO in the room. The decision logic was framed as risk mitigation, not as a discoverability trade. By the time AI Overviews and ChatGPT search were citing brand pages as authoritative, the block was already in place and nobody owned the question of whether it should still be.

The TechTarget tell

Hunt's article points to a smaller version of the same paradox that has been running for years in B2B. Companies gate a whitepaper behind a 12-field form. Lead aggregators like TechTarget then repackage the same content with a lighter form and sell the resulting leads back at $15 to $30 a head. Brands pay for leads on their own ideas because their own version of the page was undiscoverable.

This part surprised me when I first saw the math. A lot of B2B teams know they pay for syndication. Fewer realize they're paying for a wrapper around content they already paid to produce. AI search is the consumer-side version of the same loop: somebody else summarizes your idea, somebody else gets the citation, you pay for the impression.

The reason it persists is that the cost of the gate is invisible on the marketing P&L. The lead form generates a pipeline number that goes in a deck. The undiscoverable page generates nothing measurable, so it doesn't show up. Until somebody runs a citation audit and you realize the same content is making money for an aggregator and zero for you, the math looks fine.

What a 30-minute crawler audit actually catches

The fix is not "unblock everything." It's an audit, and on most sites it takes about 30 minutes if you know where to look.

Open your robots.txt. List every user-agent block. For each one, ask: is this content I want to be the cited source on, or content I want to keep private? Most sites end up with three buckets. Public marketing and product pages should be open to GPTBot, Google-Extended, ClaudeBot, PerplexityBot, OAI-SearchBot, and CCBot. Customer data, billing, account areas, and gated docs should stay closed. The middle bucket (whitepapers, gated reports, anything sitting behind a form) is the one worth arguing about. If the goal of the gate is leads, you are paying twice. If the goal is genuine IP protection, the gate stays.

The benchmark to set: track citation rate before and after. Pull a list of 20 brand-relevant queries (your category, your product, your competitor set), run them in ChatGPT, Perplexity, Gemini, and Google AI Overviews, and log how many cite your domain. If you're under 30% and you've been blocking, expect a real lift in 6-8 weeks once crawlers re-index. Mondelez moved cookie-category citation from ~10% to ~70%, which is the high end, but a 2x lift on category queries is realistic from an unblock + content cleanup pass.

One nuance worth flagging. OpenAI shipped a separate ad-serving crawler this year, OAI-AdsBot, which doesn't honor a generic GPTBot block. If your goal is to block training but allow ad-side indexing, that distinction matters. If your goal is to block everything, you now need to maintain a longer list, and the list grows every quarter.

The cleanup before the next budget cycle

From what I've seen, the brands that figure this out earliest aren't necessarily the ones with the biggest AI budgets. They're the ones who get a single owner assigned to discoverability across organic search, AI search, and paid ad surfaces, instead of leaving robots.txt as a legal artifact and chatbot visibility as a CMO problem.

Mondelez's $40 million is the splashy number, but the cheaper version of the same move is sitting in most marketing orgs already. The audit costs nothing. The hard part is getting legal, web infra, and SEO in the same room long enough to agree on which pages are IP and which pages are inventory. Do that before your next paid-AI test goes live, and you won't be paying for placement in answers your own pages could have populated for free.

Personally, I'd put a six-week deadline on the audit and tie the citation-rate baseline to whatever AI advertising line item is already approved for next quarter. If the paid number is going up, the organic number ought to be moving with it, and the only way that happens is if the crawlers can actually read the pages you keep telling investors are best-in-class.

Notice Me Senpai Editorial