Skip to main content

Command Palette

Search for a command to run...

Decoding AI Jargons with Chai

Understanding AI Transformers in a Simple Way

Updated
12 min read

Transformers:

In general, to transform means changing something from one form to another. In artificial intelligence, Transformers are models that transform raw input data such as text, audio, or images into meaningful representations that machines can understand and use. Transformers became popular with the rapid growth of modern AI.

A Transformer is a neural network architecture that understands sequential data like text or audio by learning how each part of the input relates to every other part through a mechanism called self-attention.

Some common applications of Transformers include machine translation (Language Translation), text generation, summarization, and speech processing.

Most modern AI systems such as ChatGPT, Gemini, Claude, Cursor, GitHub Copilot, and Meta AI are built on top of the Transformer architecture.

The Transformer architecture used in today’s AI era was originally introduced in the “Attention Is All You Need” research paper published by Google and its research team in 2017.

The following figure illustrates how Transformers have evolved over time by stacking additional layers and enhancements on top of the original architecture.

Understanding the transformers with the real-world example:

Imagine you attend a meeting and later want to write clear notes.
You cannot write notes sentence by sentence without thinking. When you write each line, you remember everything that was discussed in the meeting who said what, what topic was ongoing, and how points were connected. Lets say, when you write “The deadline was moved”, you already know which project, why, and who decided it, because you consider the full meeting context. likewise, you write every line, using the entire meeting as context.

How this maps to a Transformer

  • Meeting → Input sequence

  • Each note line → Token representation

  • Remembering the whole discussion → Self-Attention

  • Focusing on relevant parts → Attention weights

  • Clear, contextual notes → Output embeddings / predictions

Why Transformers?
Before Transformers, language models processed text one word at a time, which made them slow and caused them to forget important information in long sentences. This made it hard for models to understand long-range relationships between words and difficult to train them efficiently.

Transformers solved this by introducing self-attention, allowing the model to look at all words in a sentence at once and understand how they relate to each other. This made language understanding faster, more accurate, and scalable, which is why Transformers became the foundation of modern AI models.

Transformer Architecture:

Left side is the encoder

Right side is the decoder

In order to understand Transformer Architecture we must know the following:

  1. Tokenization

  2. Vector Embedding

  3. Positional Encoding

  4. Self Attention

  5. Multi-head Attention

  6. Feed forward neural network

  7. Linear

  8. Soft Max

  9. Semantic meaning

  10. Add & Norm

  11. Encoder

  12. Decoder


  1. Tokenization:

    Tokenization is the process of breaking input text into smaller units called tokens and converting those tokens into numerical identifiers that a model can process.

    A token can be a word, subword, character, or even part of a word depending on the tokenization strategy used by the model. Different models use different tokenizers

    Lets consider the following sentence “You are so lucky to read this”. Each word will split as “You”, “are”, “so”, “lucky, “to”, “read”, “this” ,”.” and each individual words will transform into the numbers

    Here is an example:
    The sentence will split into words. Each word has associated with a color which represents they got split and finally, Each token is then mapped to a unique number called as token ID.
    Internally, those numbers are converted into binary(1’s and 0’s)

  • If you want to experiment and understand how tokens are generated for individual words, try the Tiktokenizer Tool. You can also select different models in the top-right corner each model may generate different token counts for the same text.
  1. Vector Embeddings:

    Vector embeddings give meaning to tokens by converting their token IDs into numerical vectors. These vectors place words in a 3d or high-dimentional space where the distance between them shows how closely their meanings are related.

    • Each token ID is converted into a vector (list of numbers)

    • The vector represents the meaning of the token

    • Similar words have similar vectors

    •       "lucky" → [23767] -> [0.42, -0.18, 0.77, ...]
      
    • Closer vectors → more related meanings

    • Farther vectors → less related meanings

    • Examples:

    • king ↔ queen → close

    • dog ↔ puppy → close

    • dog ↔ car → far

    • Click here to check the Visual Vector Embeddings

  1. Positional Encoding:

    Positional encoding helps the Transformer understand meaning by knowing the order of words in a sentence. A Transformer sees all words at once, so without positional encoding: It would know which words exist.

    Positional encoding adds position information to each token so that the sentence meaning remains accurate during further processing. Each token receives a positional vector that tells the model where the word appears in the sentence.

    Example 1:

    1. She helped her friend move

    2. Her friend helped move her

      Without positions → same tokens

      With positions → very different meaning

      Same words, different order, different meaning.

  1. Self Attention:

    Self-attention means words look at each other to identify which words are important, and based on that, importance (attention weights) is assigned.

    Example

    Sentence:

    “She put the book on the table because it was heavy.”

    What does “it” refer to?

    To understand “it”, it check all the words in a sentence

    • book ✔️

    • table ✖️

    • Self-attention is responsible for all this process.

    • When the model tries to understand “it”, it looks at all other words and assigns attention weights based on relevance(calculates it internally).

Attention weights for “it” (example)

WordAttention weightWhy
book0.60A book can be heavy
table0.10A table is usually not described this way
put0.05Verb, not an object
because0.05Connector word
she0.05Person, not relevant
was heavy0.15Describes the reason
  • The highest attention weight goes to “book”, so the model understands:

  • “it” = “book”

  1. Multi-Head Attention:

    Multi-head attention means looking at the same sentence in multiple ways at the same time. Each head focuses on a different type of relationship with the help of multiple heads

    Real-world Example:

    Imagine you are hosting an event for any talk. one will be given a task of audio system. Second will be given a task to focus on people and their needs. third on the food and environment. all these people has one agenda make the event best. likewise, the multi-head attention focuses on different things for the same sentence.

    What different heads focus on

Attention HeadWhat it focuses onExample
Head 1Reference“it” → book
Head 2Action“put” → book, table
Head 3Reason / cause“because” → was heavy
Head nGrammar / structuresubject–verb–object
  1. Each head looks at the same sentence, but notices different relationships.

  2. Feed Forward Neural Network

    The Feed-Forward Neural Network refines the output produced by multi-head attention by processing each token individually. It transforms and improves the attended information before passing it to the next block

    Example:

    Think of self-attention as the chef who cooks the food by combining all ingredients together.
    The feed-forward neural network is like the waiter who takes each dish and arranges it neatly on a plate or bowl before serving it to the customer.

    The waiter does not change the recipe or mix dishes together again. Instead, they refine and present each dish individually

  3. Linear:

    Inorder to understand linear, lets go back to the vector embedding we have vector embedding for all the words the linear contains some weights and bias where embedded vectors multiply with weights and add the bias at last. This transformation is applied independently to every word.

    Think of it like this

    A linear layer re-expresses the same information in a different numeric form, like converting currency.
    Imagine you have money written in one currency, say USD, and you want to express the same value in EUR.

    Example:

     USD → EUR = (USD × exchange_rate) + fee
    

    Internal process:

    Step 1:
    You already have a word represented as a vector.

    Example: consider the following word
    "book" →[2341] → [0.6, -0.2, 0.9] //example of how a word changes to the vector embeddings

    Step 2:
    Linear layer comes in

    A linear layer has:

    • Weights (numbers the model learned)

    • Bias (Transformer/model learned number)

    • Example:

    • consider, weights = [2, 1, -1] bias = 0.5

    • What linear actually does

    • It does only two things:

    • Multiply the input numbers by weights

    • Add the bias

    • Example: we have the vector embedding value on the top which is [0.6, -0.2, 0.9] and weights as [2,1,-1] and bias = 0.5

    • From Step 1 and Step 2

    • Linear = (vector embeddings*weights) + Bias

    • (0.6 × 2) + (-0.2 × 1) + (0.9 × -1) + 0.5 = 0.6

  1. Soft Max:

    Softmax helps decide the next predicted word based on the raw scores (logits) produced by the model’s final linear layer. It converts these raw scores into values between 0 and 1 that sum to 1, forming a probability distribution. The model then uses these probabilities to select the next word (either by choosing the highest one or by sampling).

    Sampling: Sampling is how the model chooses the next word from the probability distribution instead of always picking the top one.

    Example: Search Engines like google
    Search engine assigns relevance scores to pages:

    • Result 1 → very high

    • Result 2 → medium

    • Result 3 → low

    • These are converted into:

    • Which result deserves most attention

    • Ranking + weighting attention = Softmax

  1. Semantic Meaning:

    Semantic meaning is the actual meaning of words and sentences based on context. The same word can convey different meanings depending on how it is used in a sentence.

    For Example:

    River bank → refers to the land beside a river

    Bank of America → refers to a financial institution

    This bag is light → low weight

    Turn on the light → illumination

  2. Add and Norm:

    Add: Add responsibility is to keep the original information and processed information.
    There are n layers in a model and the model don’t want to mess up and get confused with many layers
    If the layer learns something useful → great

    If the layer messes up → original info is still there

    The layer only needs to learn small improvements

    Its main aim is to not to lose any information
    It provides a safe fallback path so mistakes don’t erase useful information.

    Norm: Normalization works on the numeric values inside token vectors to keep them stable as they pass through multiple layers. After tokenization and vector embedding, words are represented as numeric vectors such as:

     "lucky" → [23767] → [0.42, -0.18, 0.77, ...]
    

    As these vectors are repeatedly transformed by attention and feed-forward layers, their values can grow too large or uneven. Normalization rescales these values at each layer so the representations remain balanced and easy for the next layer to process.

  3. Encoder:

    The encoder reads the input and understands it. its main aim is to It takes the input sentence; allow all words look or talk each other and builds context-aware meaning for every word and passes the information to decoder.

    After the encoder: Each word knows what it means in this sentence so ambiguities are resolved like word bank refers to what sentence river or finance

    Example: “She put the book on the table because it was heavy.”
    After the encoder: “it” clearly refers to book

  4. Decoder:

    The decoder uses that understanding to generate output step by step. It looks at the encoder’s output then Predicts the next word and Generate text one token at a time

    The decoder decides: What word comes next In what order

Additional AI Jargons:

Loss Calculation:
Loss Calculation is the difference between the expected(correct) answer and the predicted answer. A model needs feedback to learn. Loss answers one question: How far was my prediction from the truth?

Small loss → prediction was good

Large loss → prediction was bad

Example:
Suppose the correct word is “book”. But, consider the model predicts: “table” then Loss calculation compares predicted output with the expected output and Produces a number that represents the error

If the prediction is close → low loss

If the prediction is far → high loss

Back Propagation:
Backpropagation is the process of updating the model’s weights based on the loss so the model makes better predictions next time.

After the loss calculation based on value it moves backward through the Neural Network and figure outs which weight contributed to the error and it adjust those weights slightly to reduce the loss.

From this word: She put the book on the table because it was heavy.
the correct word is “book” but if the model predicts: “table” then Loss is high.

Here, the back propagation reduces weights that favored “table”; Increases weights that favor “book” Next time, the model is less likely to repeat the same mistake

Knowledge Cutoff:
Knowledge cutoff is the point in time after which a model has no information about new events or updates.

Example: Textbook published in 2022 doesn’t include the information of 2023 or later. it needs to be published again. its like an data update in a model. how much past knowledge the model holds up to a specific time frame.

Vocab Size:
Vocab size is the total number of unique tokens a model can recognize. it includes words, sub-words, punctuation and special tokens.

Example:
["I", "love", "read", "reading", "book", "##ing", ".", ","]
The model knows only this words. Apart from that it doesn’t know outside words

Example:
Vocab Size of GPT and Gemini
GPT based ChatGPT models has ~100000 token range based on older earlier versions
Gemini by Google has ~256000 tokens from Gemma Variant

Temperature:
Temperature controls the balance between determinism and creativity in model outputs.
Low temperature → safer, more predictable responses
High temperature → more varied and creative responses

Example:
Low to medium temperature
User: “Hi, how are you?”
Model: “I’m fine, thanks for asking. What about you?”

Medium to high temperature
User: “Hi, how are you?”
Model: “I’m absolutely great! I really appreciate you asking. How are things going on your side, and is there anything I can help you with?