Transformers & attention

The 2017 breakthrough that made today's AI possible. What "attention" means, kept to the intuition.

This is the one piece of "how it actually works" that's worth knowing by name, because it's the design behind essentially every modern AI system, and because knowing it is what lets you say, with confidence, that the thing everyone calls intelligent is not intelligent. It's a remarkably advanced math function. Let's see why.

The transformer (not the Michael Bay movie)

In 2017, Google researchers published a paper with the now-famous title "Attention Is All You Need." It introduced a model design called the transformer. That's literally the T in GPT (Generative Pre-trained Transformer). When people say "the breakthrough that started all this," this is the thing they mean.

We're going to stay at the level of intuition on purpose. No matrices, no math. The goal is for you to understand what it does, not to build one.

The one idea: attention

Here's the problem the transformer solved. To predict the next word in a sentence, a model needs to know which earlier words actually matter. Not all words carry equal weight.

Take: "The customer who filed the case last week is still waiting for a ___." To guess the next word, the important words are "customer," "case," and "waiting", not "the" or "last." Attention is the mechanism that lets the model weigh which earlier words matter most for what it's trying to predict right now. That's the whole idea. The model learns where to "pay attention."

That sounds simple, but it was the unlock. Earlier designs read text strictly one word at a time, in order, which made them slow to train and forgetful over long passages. Attention let models look at a whole passage at once and train in parallel across enormous amounts of data. That parallelism is exactly what let AI scale up to the sizes that made it suddenly useful, the "enough compute, enough data" story from why now, finally with a design that could use both.

Why you actually need to know this

Because it demystifies the magic. A transformer, underneath, is weighing which tokens matter and predicting a likely next token. It is not reasoning, not understanding, not conscious. There's a name for this, which we met in how a model is built: a stochastic parrot, something that produces fluent language by statistics without grasping meaning. The model mimics intelligence astonishingly well. But once you know it's attention plus prediction, you can stop being dazzled and start using it well.

If you want to go deeper

This is the edge of what's worth explaining in words. The actual mechanics of attention are genuinely mathematical, and the best explanation I know of is visual: 3Blue1Brown's Attention in transformers walks through it step by step, with the matrices drawn out. If you're curious past the intuition above, start there. (A transformer, for the record, is one particular kind of neural network, the one tuned for language.)

📝 Practice

Take any sentence and underline the two or three words that matter most for guessing what comes next. That's the job attention does, automatically, for every token. Notice how often the little words ("the," "of," "is") aren't the ones you underlined.

The transformer (not the Michael Bay movie)

The one idea: attention

Why you actually need to know this

On this page