Lesson 4

How Large Language Models Actually Work

AI Confidence: 85%

AI-generated

The Core Mechanism: Next Token Prediction

Large language models work by predicting the next word (technically, the next token) in a sequence. Given "The cat sat on the," the model predicts "mat" is the most likely next word. That is the entire training objective - predict the next token - applied billions of times across trillions of words.

What makes this remarkable is that the simple objective of "predict what comes next" forces the model to learn grammar, facts, reasoning, coding patterns, mathematical relationships, and even social dynamics. All of this emerges from the training process without being explicitly programmed.

Training: The Three Phases

Phase 1 - Pre-training: The model reads trillions of words from the internet, books, code, and other sources. It adjusts its billions of parameters to get better at predicting the next token. This phase costs millions of dollars in compute and takes weeks on thousands of GPUs. The result is a base model that can complete text but is not yet a useful assistant.

Phase 2 - Fine-tuning: The base model is trained on curated examples of helpful conversations. Human trainers write both the prompts and ideal responses. This teaches the model to behave like an assistant rather than a text completion engine.

Phase 3 - Alignment: Techniques like RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI refine the model's behavior. Human raters compare pairs of responses and indicate which is better. This feedback is used to steer the model toward being helpful, honest, and safe.

What the Model "Knows"

An LLM does not store facts in a database. Its knowledge is encoded implicitly in the billions of numerical parameters (weights) that were adjusted during training. When you ask "What is the capital of France?" the model does not look up the answer - it generates "Paris" because that token is statistically likely given the pattern of tokens in your question and the patterns it learned during training.

This has important implications:

The model cannot cite its sources because it does not have sources in the traditional sense.

It can confidently state incorrect information (hallucinate) because it is generating statistically likely text, not looking up verified facts.

Its knowledge has a cutoff date - it only knows what was in its training data.

Why Scale Matters

Research has shown that larger models trained on more data develop qualitatively new capabilities. A model with 1 billion parameters cannot write good code. A model with 100 billion parameters can. A model with a trillion parameters can reason through multi-step problems.

These sudden capability jumps at certain scales are called emergent capabilities, and they are one of the most fascinating (and debated) findings in AI research.

Common misconception: LLMs do not "understand" text the way humans do. They perform extraordinarily sophisticated pattern matching. Whether this constitutes understanding is a philosophical debate, but for practical purposes, treat them as powerful but potentially unreliable text generation engines.

tokenization-example.py

# Simple illustration of how tokenization works
# "Artificial intelligence" might become:
tokens = ["Art", "ificial", " intelligence"]
# Each token maps to a number the model processes
token_ids = [9683, 14803, 11478]

# The model predicts the probability of each possible next token
# Given context "The capital of France is"
# → "Paris" gets highest probability

Sources & Further Reading

Andrej Karpathy: "Intro to Large Language Models" (1hr video) - https://www.youtube.com/watch?v=zjkBMFhNj_g

3Blue1Brown: "How LLMs Work" - https://www.youtube.com/watch?v=wjZofJX0v4M

Stephen Wolfram: "What Is ChatGPT Doing and Why Does It Work?" - https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

Anthropic: Claude model card - https://docs.anthropic.com/en/docs/about-claude/models