Decoding AI Jargons with Chai

Transformers:

In general, to transform means changing something from one form to another. In artificial intelligence, Transformers are models that transform raw input data such as text, audio, or images into meaningful representations that machines can understand and use. Transformers became popular with the rapid growth of modern AI.

A Transformer is a neural network architecture that understands sequential data like text or audio by learning how each part of the input relates to every other part through a mechanism called self-attention.

Some common applications of Transformers include machine translation (Language Translation), text generation, summarization, and speech processing.

Most modern AI systems such as ChatGPT, Gemini, Claude, Cursor, GitHub Copilot, and Meta AI are built on top of the Transformer architecture.

The Transformer architecture used in today’s AI era was originally introduced in the “Attention Is All You Need” research paper published by Google and its research team in 2017.

The following figure illustrates how Transformers have evolved over time by stacking additional layers and enhancements on top of the original architecture.

Understanding the transformers with the real-world example:

Imagine you attend a meeting and later want to write clear notes.
You cannot write notes sentence by sentence without thinking. When you write each line, you remember everything that was discussed in the meeting who said what, what topic was ongoing, and how points were connected. Lets say, when you write “The deadline was moved”, you already know which project, why, and who decided it, because you consider the full meeting context. likewise, you write every line, using the entire meeting as context.

How this maps to a Transformer

Meeting → Input sequence
Each note line → Token representation
Remembering the whole discussion → Self-Attention
Focusing on relevant parts → Attention weights
Clear, contextual notes → Output embeddings / predictions

Why Transformers?
Before Transformers, language models processed text one word at a time, which made them slow and caused them to forget important information in long sentences. This made it hard for models to understand long-range relationships between words and difficult to train them efficiently.

Transformers solved this by introducing self-attention, allowing the model to look at all words in a sentence at once and understand how they relate to each other. This made language understanding faster, more accurate, and scalable, which is why Transformers became the foundation of modern AI models.

Transformer Architecture:

Left side is the encoder

Right side is the decoder

In order to understand Transformer Architecture we must know the following:

Tokenization
Vector Embedding
Positional Encoding
Self Attention
Multi-head Attention
Feed forward neural network
Linear
Soft Max
Semantic meaning
Add & Norm
Encoder
Decoder

Tokenization:

Tokenization is the process of breaking input text into smaller units called tokens and converting those tokens into numerical identifiers that a model can process.

A token can be a word, subword, character, or even part of a word depending on the tokenization strategy used by the model. Different models use different tokenizers

Lets consider the following sentence “You are so lucky to read this”. Each word will split as “You”, “are”, “so”, “lucky, “to”, “read”, “this” ,”.” and each individual words will transform into the numbers

Here is an example:
The sentence will split into words. Each word has associated with a color which represents they got split and finally, Each token is then mapped to a unique number called as token ID.
Internally, those numbers are converted into binary(1’s and 0’s)

If you want to experiment and understand how tokens are generated for individual words, try the Tiktokenizer Tool. You can also select different models in the top-right corner each model may generate different token counts for the same text.

Vector Embeddings:

Vector embeddings give meaning to tokens by converting their token IDs into numerical vectors. These vectors place words in a 3d or high-dimentional space where the distance between them shows how closely their meanings are related.
- Each token ID is converted into a vector (list of numbers)
- The vector represents the meaning of the token
- Similar words have similar vectors
- ```
      "lucky" → [23767] -> [0.42, -0.18, 0.77, ...]
```
- Closer vectors → more related meanings
- Farther vectors → less related meanings
- Examples:
- king ↔ queen → close
- dog ↔ puppy → close
- dog ↔ car → far
- Click here to check the Visual Vector Embeddings

Positional Encoding:

Positional encoding helps the Transformer understand meaning by knowing the order of words in a sentence. A Transformer sees all words at once, so without positional encoding: It would know which words exist.

Positional encoding adds position information to each token so that the sentence meaning remains accurate during further processing. Each token receives a positional vector that tells the model where the word appears in the sentence.

Example 1:
1. She helped her friend move
2. Her friend helped move her
  
  Without positions → same tokens
  
  With positions → very different meaning
  
  Same words, different order, different meaning.

Self Attention:

Self-attention means words look at each other to identify which words are important, and based on that, importance (attention weights) is assigned.

Example

Sentence:

“She put the book on the table because it was heavy.”

What does “it” refer to?

To understand “it”, it check all the words in a sentence
- book ✔️
- table ✖️
- Self-attention is responsible for all this process.
- When the model tries to understand “it”, it looks at all other words and assigns attention weights based on relevance(calculates it internally).

Attention weights for “it” (example)

Word	Attention weight	Why
book	0.60 ✅	A book can be heavy
table	0.10	A table is usually not described this way
put	0.05	Verb, not an object
because	0.05	Connector word
she	0.05	Person, not relevant
was heavy	0.15	Describes the reason

The highest attention weight goes to “book”, so the model understands:
“it” = “book”

Multi-Head Attention:

Multi-head attention means looking at the same sentence in multiple ways at the same time. Each head focuses on a different type of relationship with the help of multiple heads

Real-world Example:

Imagine you are hosting an event for any talk. one will be given a task of audio system. Second will be given a task to focus on people and their needs. third on the food and environment. all these people has one agenda make the event best. likewise, the multi-head attention focuses on different things for the same sentence.

What different heads focus on

Attention Head	What it focuses on	Example
Head 1	Reference	“it” → book
Head 2	Action	“put” → book, table
Head 3	Reason / cause	“because” → was heavy
Head n	Grammar / structure	subject–verb–object

Each head looks at the same sentence, but notices different relationships.
Feed Forward Neural Network

The Feed-Forward Neural Network refines the output produced by multi-head attention by processing each token individually. It transforms and improves the attended information before passing it to the next block

Example:

Think of self-attention as the chef who cooks the food by combining all ingredients together.
The feed-forward neural network is like the waiter who takes each dish and arranges it neatly on a plate or bowl before serving it to the customer.

The waiter does not change the recipe or mix dishes together again. Instead, they refine and present each dish individually
Linear:

Inorder to understand linear, lets go back to the vector embedding we have vector embedding for all the words the linear contains some weights and bias where embedded vectors multiply with weights and add the bias at last. This transformation is applied independently to every word.

Think of it like this

A linear layer re-expresses the same information in a different numeric form, like converting currency.
Imagine you have money written in one currency, say USD, and you want to express the same value in EUR.

Example:
```
 USD → EUR = (USD × exchange_rate) + fee
```
Internal process:

Step 1:
You already have a word represented as a vector.

Example: consider the following word
"book" →[2341] → [0.6, -0.2, 0.9] //example of how a word changes to the vector embeddings

Step 2:
Linear layer comes in

A linear layer has:
- Weights (numbers the model learned)
- Bias (Transformer/model learned number)
- Example:
- consider, weights = [2, 1, -1] bias = 0.5
- What linear actually does
- It does only two things:
- Multiply the input numbers by weights
- Add the bias
- Example: we have the vector embedding value on the top which is [0.6, -0.2, 0.9] and weights as [2,1,-1] and bias = 0.5
- From Step 1 and Step 2
- Linear = (vector embeddings*weights) + Bias
- (0.6 × 2) + (-0.2 × 1) + (0.9 × -1) + 0.5 = 0.6

Soft Max:

Softmax helps decide the next predicted word based on the raw scores (logits) produced by the model’s final linear layer. It converts these raw scores into values between 0 and 1 that sum to 1, forming a probability distribution. The model then uses these probabilities to select the next word (either by choosing the highest one or by sampling).

Sampling: Sampling is how the model chooses the next word from the probability distribution instead of always picking the top one.

Example: Search Engines like google
Search engine assigns relevance scores to pages:
- Result 1 → very high
- Result 2 → medium
- Result 3 → low
- These are converted into:
- Which result deserves most attention
- Ranking + weighting attention = Softmax

Semantic Meaning:

Semantic meaning is the actual meaning of words and sentences based on context. The same word can convey different meanings depending on how it is used in a sentence.

For Example:

River bank → refers to the land beside a river

Bank of America → refers to a financial institution

This bag is light → low weight

Turn on the light → illumination
Add and Norm:

Add: Add responsibility is to keep the original information and processed information.
There are n layers in a model and the model don’t want to mess up and get confused with many layers
If the layer learns something useful → great

If the layer messes up → original info is still there

The layer only needs to learn small improvements

Its main aim is to not to lose any information
It provides a safe fallback path so mistakes don’t erase useful information.

Norm: Normalization works on the numeric values inside token vectors to keep them stable as they pass through multiple layers. After tokenization and vector embedding, words are represented as numeric vectors such as:
```
 "lucky" → [23767] → [0.42, -0.18, 0.77, ...]
```
As these vectors are repeatedly transformed by attention and feed-forward layers, their values can grow too large or uneven. Normalization rescales these values at each layer so the representations remain balanced and easy for the next layer to process.
Encoder:

The encoder reads the input and understands it. its main aim is to It takes the input sentence; allow all words look or talk each other and builds context-aware meaning for every word and passes the information to decoder.

After the encoder: Each word knows what it means in this sentence so ambiguities are resolved like word bank refers to what sentence river or finance

Example: “She put the book on the table because it was heavy.”
After the encoder: “it” clearly refers to book
Decoder:

The decoder uses that understanding to generate output step by step. It looks at the encoder’s output then Predicts the next word and Generate text one token at a time

The decoder decides: What word comes next In what order

Additional AI Jargons:

Loss Calculation:
Loss Calculation is the difference between the expected(correct) answer and the predicted answer. A model needs feedback to learn. Loss answers one question: How far was my prediction from the truth?

Small loss → prediction was good

Large loss → prediction was bad

Example:
Suppose the correct word is “book”. But, consider the model predicts: “table” then Loss calculation compares predicted output with the expected output and Produces a number that represents the error

If the prediction is close → low loss

If the prediction is far → high loss

Back Propagation:
Backpropagation is the process of updating the model’s weights based on the loss so the model makes better predictions next time.

After the loss calculation based on value it moves backward through the Neural Network and figure outs which weight contributed to the error and it adjust those weights slightly to reduce the loss.

From this word: She put the book on the table because it was heavy.
the correct word is “book” but if the model predicts: “table” then Loss is high.

Here, the back propagation reduces weights that favored “table”; Increases weights that favor “book” Next time, the model is less likely to repeat the same mistake

Knowledge Cutoff:
Knowledge cutoff is the point in time after which a model has no information about new events or updates.

Example: Textbook published in 2022 doesn’t include the information of 2023 or later. it needs to be published again. its like an data update in a model. how much past knowledge the model holds up to a specific time frame.

Vocab Size:
Vocab size is the total number of unique tokens a model can recognize. it includes words, sub-words, punctuation and special tokens.

Example:
["I", "love", "read", "reading", "book", "##ing", ".", ","]
The model knows only this words. Apart from that it doesn’t know outside words

Example:
Vocab Size of GPT and Gemini
GPT based ChatGPT models has ~100000 token range based on older earlier versions
Gemini by Google has ~256000 tokens from Gemma Variant

Temperature:
Temperature controls the balance between determinism and creativity in model outputs.
Low temperature → safer, more predictable responses
High temperature → more varied and creative responses

Example:
Low to medium temperature
User: “Hi, how are you?”
Model: “I’m fine, thanks for asking. What about you?”

Medium to high temperature
User: “Hi, how are you?”
Model: “I’m absolutely great! I really appreciate you asking. How are things going on your side, and is there anything I can help you with?

Decoding AI Jargons with Chai

Transformers:

Understanding the transformers with the real-world example:

Transformer Architecture:

Tokenization:

Vector Embeddings:

Positional Encoding:

Self Attention:

Multi-Head Attention:

Feed Forward Neural Network

Linear:

Soft Max:

Semantic Meaning:

Add and Norm:

Encoder:

Decoder:

Additional AI Jargons:

Comments

More from this blog

Unveiling the Secrets of AI Prompting

Command Palette

Transformers:

Understanding the transformers with the real-world example:

Transformer Architecture:

Tokenization:

Vector Embeddings:

Positional Encoding:

Self Attention:

Multi-Head Attention:

Feed Forward Neural Network

Linear:

Soft Max:

Semantic Meaning:

Add and Norm:

Encoder:

Decoder:

Additional AI Jargons:

Comments

More from this blog