HELGE SVERREAll-stack Developer
Bergen, Norwayv13.0
est. 2012  |  300+ repos  |  4000+ contributions
Tools  |   Theme:
How LLMs Actually Work
What large language models are, how they generate text, and why they behave the way they do.

What is a language model?

A language model is a program that predicts what text comes next. Given the words "The cat sat on the," it assigns probabilities to every possible next word -- "mat" might get 12%, "floor" might get 8%, "roof" might get 3%, and so on.

A large language model (LLM) is a language model with billions of internal parameters, trained on massive amounts of text. The "large" refers to the size of the model, not the size of its output.

At its core, an LLM does one thing: given a sequence of text, predict the most likely continuation. Everything else -- conversations, code generation, summarization, translation -- emerges from this single capability applied at scale.

Tokens

LLMs don't read words or characters. They work with tokens, which are chunks of text that fall somewhere between characters and words. A common word like "hello" might be one token. A less common word like "tokenization" might be split into "token" + "ization." A rare or technical word might be split into even smaller pieces.

Simple sentence6 tokens
The cat sat on the mat
Subword splitting6 tokens
unbelievably impressive tokenization
Code6 tokens
console.log('hello')
Numbers & symbols9 tokens
The price is $42.99 per unit

Tokenization is the process of converting raw text into these chunks. Each model has its own tokenizer -- a fixed vocabulary of tokens it was trained with. GPT-4 has roughly 100,000 tokens in its vocabulary. The tokenizer splits any input text into a sequence of tokens from this vocabulary, and the model processes those tokens one at a time.

This is why models sometimes struggle with character-level tasks like counting the letters in a word or reversing a string. They don't see individual characters -- they see tokens.

A rough rule of thumb: one token is approximately 3/4 of a word in English. So 1,000 tokens is roughly 750 words.

Training

Training an LLM happens in stages.

Pre-training is the expensive part. The model is shown enormous amounts of text -- books, websites, code, articles, forum posts -- and learns to predict the next token. It does this trillions of times, gradually adjusting its billions of internal parameters to get better at prediction. This process takes weeks or months on thousands of specialized chips (GPUs or TPUs) and costs tens to hundreds of millions of dollars.

What the model learns during pre-training is not a database of facts. It learns patterns: grammar, reasoning structures, factual associations, coding conventions, rhetorical styles, and much more. These patterns are encoded as numerical weights distributed across the model's parameters. No single parameter stores a fact. Knowledge is spread across the network in ways that are not directly interpretable.

The training data typically includes a significant portion of the publicly available internet, books, academic papers, and code repositories. The exact composition varies by model and is usually not fully disclosed.

Context window

When you interact with an LLM, every piece of text in the conversation -- your messages, the model's replies, any system instructions -- gets converted into tokens and fed to the model as one long sequence. The context window is the maximum number of tokens the model can process at once.

Early models had context windows of 2,048 or 4,096 tokens. Modern models range from 128,000 to over 1,000,000 tokens.

The context window is the model's entire working memory for a conversation. It has no memory of previous conversations. If something isn't in the current context window, the model doesn't know about it. When a conversation exceeds the context window, the oldest tokens get dropped or the conversation has to be summarized.

This is why long conversations can feel like the model "forgot" something you said earlier. It literally did -- that text is no longer in the window.

Temperature

When the model predicts the next token, it produces a probability distribution over its entire vocabulary. Temperature is a setting that controls how this distribution gets used.

  • Temperature 0 (or very low): The model almost always picks the highest-probability token. Output is deterministic and repetitive. Good for factual tasks where you want consistent answers.
  • Temperature 1: The model samples from the distribution as-is. Output is varied and sometimes surprising.
  • High temperature (above 1): The distribution gets flattened, giving lower-probability tokens a better chance. Output becomes more random and creative, but also more likely to be incoherent.

Think of it like a dial between "always pick the safe answer" and "take more risks." Most applications use a temperature between 0 and 1.

Prompts and system prompts

A prompt is the text you send to the model. It is the entirety of what the model has to work with. The quality and specificity of your prompt directly affects the quality of the output.

A system prompt is a special message placed at the beginning of the context window, before the user's message. It sets the model's behavior, tone, and constraints. For example, a system prompt might say "You are a helpful customer support agent. Only answer questions about our product. Be concise."

The model doesn't have a personality or preferences of its own. The system prompt is how product builders steer the model to behave in a specific way. Without a system prompt, the model defaults to whatever general patterns it learned during training.

System prompts are not magical -- they are just text that the model processes along with everything else. They work because the model was trained (and fine-tuned) to follow instructions.

Why LLMs hallucinate

Hallucination is when a model generates text that sounds plausible but is factually wrong. This is not a bug that will be fixed -- it is a natural consequence of how these models work.

An LLM does not look up facts. It predicts text that is statistically likely to follow from the input. If you ask "Who wrote the novel Blueberry Fields?" and no such novel exists, the model will still produce a confident-sounding answer, because it has learned the pattern of answering questions with author names.

The model has no mechanism for distinguishing between "I know this" and "this sounds right." It generates plausible continuations of text. Sometimes plausible and correct overlap. Sometimes they don't.

This is why LLMs should not be trusted as authoritative sources without verification, especially for specific facts, dates, URLs, citations, or legal/medical claims.

Why they feel human

LLMs produce text that reads as natural, conversational, and sometimes emotionally resonant. This happens because they were trained on human-written text -- billions of examples of how people write, argue, explain, joke, and express themselves.

The model learned to mirror human patterns of communication. When it says "I think" or "I'm not sure," it is reproducing a linguistic pattern, not reporting an internal state. When it writes a compelling argument, it is generating text that matches the statistical patterns of compelling arguments it saw during training.

This can be useful -- the output is readable and natural. But it can also be misleading, because human readers instinctively attribute understanding, intention, and awareness to text that reads as if it was written by a thinking person.

Why they are not sentient

LLMs have no subjective experience, no awareness, no desires, and no continuity between conversations.

Each time you start a new conversation, the model has no memory of previous interactions. There is no persistent self. The "personality" you experience is reconstructed from scratch each time, based on the system prompt and conversation history in the context window.

The model does not "want" anything. It does not "try" to help you. It executes a mathematical function that maps input tokens to output token probabilities. The appearance of intention is a product of training on human text that expresses intention.

This is a meaningful distinction. A thermostat "wants" the room to be 72 degrees in the same way an LLM "wants" to help you -- which is to say, not at all. Both are systems that produce outputs based on inputs, with no inner experience of the process.

Fine-tuning and RLHF

The pre-trained model is a powerful text predictor, but it is not a good assistant out of the box. It might complete your question with another question, or continue your text in an unexpected direction, because it was trained to predict text, not to be helpful.

Fine-tuning is the process of further training the model on a smaller, curated dataset to adjust its behavior. For example, training it on thousands of examples of helpful question-answer pairs teaches it to respond in a conversational format.

RLHF (Reinforcement Learning from Human Feedback) is a specific fine-tuning technique. Human raters compare pairs of model outputs and indicate which one is better. The model is then trained to produce outputs more like the preferred ones. This is how models learn to be helpful, refuse harmful requests, follow instructions, and maintain a consistent tone.

RLHF is what turns a raw text predictor into something that feels like an assistant. It is also what introduces the model's safety behaviors -- its tendency to decline certain requests or add caveats.

Inference

Inference is what happens when you send a prompt and get a response. It is the process of running input through the trained model to generate output.

The model processes all input tokens in parallel to build an internal representation of the context. Then it generates output tokens one at a time, left to right. Each new token is predicted based on the entire input plus all previously generated output tokens. This is why responses appear to stream in word by word -- the model is literally producing them one token at a time.

Inference is computationally expensive but far cheaper than training. Training might cost $100 million and take months. A single inference call costs a fraction of a cent and takes seconds.

The cost of inference scales with two factors: the number of input tokens (the length of the prompt and context) and the number of output tokens (the length of the response). This is why API pricing is measured per token.

Embeddings

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. The model converts text into these numerical representations as part of its internal processing, and this capability can be extracted and used directly.

Two pieces of text with similar meaning will have similar embeddings -- their vectors will point in roughly the same direction. "How do I reset my password?" and "I forgot my login credentials" will have embeddings that are close together, even though they share almost no words.

This makes embeddings useful for similarity search. You can convert a library of documents into embeddings, then convert a user's question into an embedding, and find the documents whose embeddings are closest to the question's embedding. This is the foundation of retrieval-augmented generation (RAG), where relevant documents are fetched and inserted into the context window to give the model specific knowledge.

Embeddings are typically generated by specialized models or by a specific component of a larger model. They are much cheaper to compute than full text generation.

The difference between a model and a product

A model is the trained neural network -- the mathematical function that takes tokens in and produces token probabilities out. Examples: GPT-4, Claude, Llama 3, Gemini.

A product is the application built around a model. It includes the user interface, the system prompt, safety filters, conversation management, tool integrations, file handling, memory features, and everything else that turns a raw model into something useful. Examples: ChatGPT (built on GPT-4), Claude.ai (built on Claude), Claude Code (built on Claude).

The same model can power very different products. GPT-4 powers ChatGPT (a chat interface), GitHub Copilot (a code assistant), and thousands of custom applications via API. Claude powers Claude.ai (a web chat), Claude Code (a terminal-based coding tool), and countless applications built by other companies.

When people say "ChatGPT hallucinated" or "Claude is good at coding," they are often conflating the model and the product. The model provides the core capability. The product shapes how that capability is presented, constrained, and augmented. A model might be capable of something that the product intentionally restricts, or a product might compensate for model weaknesses through clever prompting, tool use, or retrieval.

Understanding this distinction matters because it changes what questions you ask. "Why did ChatGPT give a wrong answer?" might be a model problem (hallucination) or a product problem (bad system prompt, missing context, outdated information).




<!-- generated with nested tables and zero regrets -->