🧩 What are Tokens?

Tokens are the “Lego bricks” of AI (the building blocks of data, not bits or bytes which are units of raw data storage). Instead of reading full words, AI models break text into fragments.

  • Definition: They describe tokens as the fundamental units that AI models use to process data, whether that data is text, images, or audio.
  • Analogy: One blog compares tokenization to turning language into Lego blocks, where each block represents a small piece of a whole that can be snapped together to build meaning.
  • Tokenizer Variations: Different models use different “dictionaries.” A sentence might be 10 tokens for GPT-4 but 12 tokens for Llama 3 because their tokenizers are built differently.
  • Why fragments? It is the “Goldilocks” zone: more efficient than reading letter by letter, but more flexible than a rigid dictionary of whole words.
  • The Rule of Thumb: In English, 1,000 tokens ≈ 750 words.
  • Typo Handling: If you type “helo,” the OpenAI Tokenizer chops it into he + lo.

⚙️ The Conversion Pipeline

How your text becomes “intelligence”:

  1. Tokenization: Text is split into fragments.
  2. Token IDs: Each fragment is assigned a unique integer (e.g., “Hello” = 15496).
  3. Embeddings: These IDs are swapped for Vectors (lists of numbers). These vectors place words in a “map” where similar meanings are physically close together.
    • Vector Math: This mathematical map allows the AI to understand relationships. For example: .
  4. Contextualization: This is where the Transformer takes over.

⚡ The Power of the Transformer (The “Context Engine”)

While embeddings give a word a “general” meaning, the Transformer Architecture uses Self-Attention to give it a “specific” meaning based on the sentence.

FeatureFunction
Self-AttentionAllows the AI to “look” at every other token in the sentence simultaneously.
Positional EncodingTells the AI the order of the tokens (so it knows the difference between “Dog bites man” and “Man bites dog”).
Contextual EmbeddingsChanges the mathematical value of a token based on its neighbors (e.g., “bank” near “river” vs. “bank” near “money”).

The Context Window

Think of this as the AI’s “Short-Term Memory.” It is the maximum number of tokens the model can “keep in mind” at one time. Once a conversation exceeds this limit, the AI starts “forgetting” the earliest parts of the chat.


🚀 The New Era: Reasoning Tokens

In standard models (like GPT-4o), the AI predicts the next token instantly. In “Reasoning” models (like OpenAI o1 or DeepSeek-R1), the process includes Hidden Reasoning Tokens.

  • Internal Monologue: The AI generates a “Chain of Thought” (CoT) in its “head” before giving you the final answer.
  • Problem Solving: These tokens are used to double-check facts, correct logic, and try different strategies.
  • Cost Factor: You are billed for these internal tokens even though you do not always see them in the final output.
  • The Benefit: These tokens allow the AI to solve complex PhD-level science and coding problems that standard transformers would fail.

The "Typo Tax"

When you misspell a word, the AI uses more tokens to “build” that word from fragments. This uses up more of the Context Window and costs more if you are using a paid API.


🔗 Deep Dive Resources