Data, the fuel

Models are built from data, and almost everything surprising about how they behave traces back to what they were fed. Where it came from, why diversity matters, and the strange problem of the internet starting to feed on itself.

If there's one word that sits underneath everything in this section, it's data. A model isn't programmed, it's grown from examples, so what you feed it is what you get. This is familiar territory for anyone at Salesforce: we've all heard "data is the new gold" for years, and there's a whole product line (Data Cloud / Data 360) built on the idea that the company with the best data wins. The same idea is the engine of modern AI, just at a scale that's hard to picture.

A quick distinction first, because it prevents a lot of confusion later. There are two completely different ways data shows up in AI:

Data as training fuel (this page): the enormous pile of examples a model learns from when it's built. You never see it; it's baked in.
Data as context (later, in Working with AI): the information you hand a model while you use it, like pasting in a document or grounding it in your records.

Right now we're talking about fuel.

Why it took this much data

Earlier AI could get by on modest, carefully labeled datasets. Large language models needed something different: mass. To learn the patterns of language well enough to be genuinely useful, they had to read a staggering fraction of everything humans have ever written down. That's a big part of why this moment arrived when it did, the internet had quietly assembled the pile of examples for the first time in history.

So the race became a race for data. AI companies scraped enormous amounts of the public internet, fast, to feed their models. That land-grab is the source of most of the legal and ethical fights you read about now: who owns the text and images these models learned from, and whether they should have been used at all. You don't need a position on those fights to understand the mechanic underneath them: more data, gathered faster, meant a more capable model, so everyone grabbed as much as they could.

Diversity of data changes the outcome

Here's the part that's easy to miss, and it matters more than raw volume. Where you gather data and how varied it is shapes what the model becomes. A model trained mostly on English text struggles in other languages. A model fed lots of code is better at code. Skew the diet and you skew the model. This is why labs guard their exact data mix like a recipe: it's one of the biggest reasons two models with similar size behave differently. We come back to this directly when we look at how a model is built, because curating that mix is literally the first step.

Hold onto this idea, "the model becomes what it eats", because it's the thread that runs through the next problem.

The internet starting to feed on itself

Now the strange part. As AI got good at writing, more and more of the internet started being written by AI: articles, comments, reviews, even accounts posing as people. So the next generation of models is increasingly trained on text the last generation produced. The internet is, in a small but growing way, beginning to feed on itself.

It's tempting to dramatize this ("soon there'll be more AI content than human content online!"), and you'll see that claim made confidently. Be careful with it. The honest version is narrower: by late 2024, roughly half of newly published web articles showed signs of being AI-generated, according to one analysis, and that figure has since leveled off. That's new articles, not the whole internet, and it comes from a single vendor study using an imperfect detector, so treat it as a rough estimate, not gospel.

The reason it actually matters isn't the headline number. It's a real, measured effect called model collapse. When researchers trained models repeatedly on the output of previous models, the models got worse over generations, losing the rare and unusual cases first, then drifting toward bland, repetitive sameness. The finding was published in Nature in 2024. The crucial nuance, and the reason "AI will inevitably poison itself" is too strong: collapse happened when AI output replaced human data. Keeping real human data in the mix largely prevented it.

Which brings us right back to the thread: diversity and quality of data decide the outcome. A model fed a narrowing diet of its own output degrades; a model kept on rich, varied, genuinely human data stays healthy. The data isn't a detail of how these systems work. It's the thing itself.

The one-line version: a model becomes what it's trained on. Volume got these models off the ground, but the mix and quality of the data is what makes one model better than another, and it's why feeding AI nothing but AI output slowly breaks it.

📝 Practice

Think about a place your team keeps data (a knowledge base, past cases, closed deals). If you trained an assistant only on that, what would it be unusually good at, and where would its blind spots be? That's the same question AI labs ask about the whole internet, just smaller. Noticing the blind spots is the skill.

Why it took this much data

Diversity of data changes the outcome

The internet starting to feed on itself

On this page