What is a large language model?

This article builds on the concepts from:

A plain-English explanation for people who run companies, not computer science departments.

You type a question into ChatGPT. A few seconds later, a coherent, useful answer streams back. What just happened? What is the machine actually doing?

It's doing the same thing at every scale, from the simplest possible version all the way up to the real thing. Let me show you.

Start with one letter

Take the letter "t." If you had to guess what letter comes after "t" in English, what would you pick?

You could actually figure this out. Grab a big pile of English text — a few books, say — and count every pair of letters. Every time you see "t" followed by "h," make a tally mark next to "th." Every time you see "t" followed by "o," tally "to." Do this for every letter that follows "t."

When you're done, you'd find something like: "h" follows "t" about 35% of the time. "o" about 15%. "e" about 12%. "i" about 10%. And so on, trailing off into the rare ones.

That table of percentages is a model. A tiny, bad one. But it's a model. Given one letter of input, it produces a probability distribution over what letter comes next. Text in, text out.

🔤 Interactive Demo: Single-Letter Prediction

Given the letter "t", see the probability distribution of what comes next. Generate text one letter at a time.

Coming soon

Now use two letters

One letter of context isn't enough to produce anything useful. But what about two?

Given "th," what comes next? Probably "e" (as in "the"), or "a" (as in "that"), or "i" (as in "this"). The predictions get a lot sharper when you have more context.

Given "qu"? Almost always "i" or "e." The model barely needs to guess.

You build this model the same way: count every three-letter sequence in a big pile of text, and turn the counts into percentages. Now your model takes two letters as input and predicts the third.

🔤🔤 Interactive Demo: Two-Letter Prediction

Given two letters like "th", see how much sharper the predictions get. Compare with single-letter output.

Coming soon

See the pattern? More context makes better predictions. Three letters of context is better than two. Four is better than three. A whole sentence is better than four. A whole paragraph, better still.

But there's a problem. With one letter of context, you need to store 26 × 26 = 676 probabilities. With two letters, 26 × 26 × 26 = 17,576. With three, 456,976. It grows exponentially. By the time you get to ten letters of context, you'd need more storage than atoms in the observable universe. You can't just count every possible sequence.

This is where the "large" in "large language model" comes in.

From counting to learning

Instead of storing an impossibly large table of counts, you train a neural network to approximate the table. The network takes in a sequence of text and outputs a probability distribution over what comes next — the same thing our letter-counting model did, but without needing to have seen that exact sequence before.

Remember gradient descent from the What is AI? article? Same deal here. The model has billions of internal dials (parameters). You show it a chunk of text with the last word hidden, ask it to predict the hidden word, compare its guess to the real answer, measure how wrong it was, and nudge all the dials slightly to make it less wrong next time. Do this trillions of times with trillions of words of text, and the model gets remarkably good at predicting what comes next.

The key insight is that to get good at predicting the next word, the model has to develop something like understanding. To predict the word after "The capital of France is," it has to have encoded the fact that the capital of France is Paris somewhere in its billions of parameters. To predict the next word in a logic puzzle, it has to have encoded something like logical reasoning. The model isn't programmed with facts or rules. It develops them as a side effect of learning to predict text.

What the "transformer" does

The specific type of neural network that made all of this work is called a transformer, invented by researchers at Google in 2017. Its key innovation is an attention mechanism: a way for the model to look at all the words in its input and figure out which ones matter most for predicting the next word.

If the input is "The dog chased the cat up the tree, and then it got stuck," the model needs to figure out that "it" refers to "the cat," not "the dog" or "the tree." The attention mechanism lets the model weigh each earlier word's relevance to the current prediction. For the word after "it," the model pays a lot of attention to "cat" and very little to "chased."

Stack many layers of attention on top of each other and the model develops increasingly abstract representations. Early layers might notice letter patterns and word boundaries. Middle layers might pick up grammar and sentence structure. Deep layers might encode facts, reasoning patterns, and something that looks a lot like common sense.

Nobody programmed any of this. It emerged from the training process — from billions of rounds of "predict the next word, check how wrong you were, adjust the dials."

👁️ Interactive Demo: Attention

Click a word to see which other words the model pays attention to when predicting the next one.

Coming soon

The shape stays the same

Here's what I want you to take away: from our toy letter-counting model all the way up to GPT-4 and Claude, the fundamental shape of the thing is the same.

Text goes in. Probabilities come out. Pick the next word. Repeat.

That's it. The difference between our two-letter model and a frontier LLM like GPT-4 or Claude is scale and architecture — more context, more parameters, more training data, more layers, cleverer ways of combining the information. But the basic operation hasn't changed. It's next-token prediction all the way down.

This also tells you what an LLM is not:

It's not a database. It doesn't look things up in a table. Knowledge is smeared across billions of parameters in ways that are hard to inspect or correct.
It's not searching the internet (unless someone gives it a search tool — more on that in a moment).
It's not thinking the way you do. It has no inner monologue, no goals, no experience. It produces text that looks like thinking because it was trained on text written by people who were thinking.

The best analogy I've found: an LLM is like a very well-read intern who has read everything but experienced nothing. They can synthesize, summarize, draft, and brainstorm at a high level. But they can also confidently make things up, because producing plausible-sounding text is literally what they were optimized to do.

From language model to agent

An LLM by itself can only produce text. You ask it a question, it generates an answer. That's useful, but limited. It can't check today's stock price, because its training data has a cutoff date. It can't send an email, because it has no hands. It can't run a calculation it isn't sure about, because it has no calculator.

But what if you gave it tools?

This is what an agent is: an LLM with the ability to take actions. You give it access to tools — a web browser, a code interpreter, a calculator, your email, your calendar, your company's database — and instead of just answering your question, it can go do things to find or create the answer.

The loop looks like this:

Think: "The user wants to know last quarter's revenue by region. I should query the database."
Act: Call the database tool with a SQL query.
Observe: Look at the results that came back.
Think: "Got the data. Now I should format it as a table and highlight the regions that grew."
Respond: Present the formatted answer to the user.

The LLM is still just predicting text. But some of that predicted text happens to be tool calls, and the system around the LLM actually executes those calls and feeds the results back in. The LLM predicts "I should search for X," the system performs the search, pastes the results into the conversation, and the LLM continues from there.

🔄 Interactive Demo: The Agent Loop

Watch an LLM think, pick a tool, use it, observe the result, and respond. Text prediction + real-world actions.

Coming soon

This is why agents feel like such a leap. An LLM alone is a very smart person locked in a room with no phone, no computer, and no internet. An agent is that same person with a full office setup. The intelligence didn't change — the access did.

Where this leaves you

You now have a mental model for the entire stack:

A computer is billions of tiny switches flipping on and off (What is a computer?)
AI is the broad idea of putting intelligence outside a human brain
Machine learning is the technique of letting data write the rules, using gradient descent to tune millions of dials
An LLM is a machine learning model trained to predict the next word, at a scale where useful capabilities emerge
An agent is an LLM with tools, able to take actions in the real world

Each layer builds on the one below it. No magic at any level — just the next abstraction up.

When you're evaluating AI for your business, this stack is your cheat sheet. Most of the value right now is in that top layer: agents that can do real work with real tools. That's where things get practical, and that's where the biggest opportunities are.

If you want help figuring out where agents could make the biggest difference in your company, let's talk.