Transformers & attention
The 2017 breakthrough that made today's AI possible. What "attention" means, kept to the intuition.
This is the one piece of "how it actually works" that's worth knowing by name, because it's the design behind essentially every modern AI system, and because knowing it is what lets you say, with confidence, that the thing everyone calls intelligent is not intelligent. It's a remarkably advanced math function. Let's see why.
The transformer (not the Michael Bay movie)
In 2017, Google researchers published a paper with the now-famous title "Attention Is All You Need." It introduced a model design called the transformer. That's literally the T in GPT (Generative Pre-trained Transformer). When people say "the breakthrough that started all this," this is the thing they mean.
We're going to stay at the level of intuition on purpose. No matrices, no math. The goal is for you to understand what it does, not to build one.
The one idea: attention
Here's the problem the transformer solved. To predict the next word in a sentence, a model needs to know which earlier words actually matter. Not all words carry equal weight.
Take: "The customer who filed the case last week is still waiting for a ___." To guess the next word, the important words are "customer," "case," and "waiting", not "the" or "last." Attention is the mechanism that lets the model weigh which earlier words matter most for what it's trying to predict right now. That's the whole idea. The model learns where to "pay attention."
That sounds simple, but it was the unlock. Earlier designs read text strictly one word at a time, in order, which made them slow to train and forgetful over long passages. Attention let models look at a whole passage at once and train in parallel across enormous amounts of data. That parallelism is exactly what let AI scale up to the sizes that made it suddenly useful, the "enough compute, enough data" story from why now, finally with a design that could use both.
Why you actually need to know this
Because it demystifies the magic. A transformer, underneath, is weighing which tokens matter and predicting a likely next token. It is not reasoning, not understanding, not conscious. There's a name for this, which we met in how a model is built: a stochastic parrot, something that produces fluent language by statistics without grasping meaning. The model mimics intelligence astonishingly well. But once you know it's attention plus prediction, you can stop being dazzled and start using it well.
If you want to go deeper
This is the edge of what's worth explaining in words. The actual mechanics of attention are genuinely mathematical, and the best explanation I know of is visual: 3Blue1Brown's Attention in transformers walks through it step by step, with the matrices drawn out. If you're curious past the intuition above, start there. (A transformer, for the record, is one particular kind of neural network, the one tuned for language.)
๐ Practice
Take any sentence and underline the two or three words that matter most for guessing what comes next. That's the job attention does, automatically, for every token. Notice how often the little words ("the," "of," "is") aren't the ones you underlined.
How a model is built
The single biggest mystery-killer in AI: a model is trained, not programmed. Here are the actual steps that turn a blank slate into something you can talk to, and why knowing them explains almost every strange thing AI does.
What AI can and can't do
Genuine strengths, real limits, and why AI is sometimes confidently, fluently wrong.