Important Keywords
Token
A token is a piece of text — could be a word, part of a word, or even punctuation.
Example:
The sentence:
"ChatGPT is smart!"
Breaks down into tokens like:
- "Chat", "G", "PT", " is", " smart", "!"
Each model uses its own tokenizer. GPT usually breaks words into sub-words.
Why it matters?
- Models process tokens, not raw text.
- More tokens = higher cost and slower response.
- There are limits (e.g., GPT-4 can handle ~128k tokens max).
Prompt
A prompt is the input or question you give to the AI.
Example:
“Write a story about a robot who makes coffee.”
The AI takes your prompt and generates a response.
Completion / Output
The AI’s response to your prompt.
Example:
If prompt is:
“Tell me a joke”
The completion might be:
“Why did the computer go to therapy? Because it had too many bugs!”
Temperature
It is a float value (e.g., 0.0 to 2.0) that adjusts the probability distribution over possible next words when the model is generating text.
Simple Explanation: Think of temperature like a creativity knob:
- Low temperature → the model plays it safe (predictable, accurate).
- High temperature → the model becomes more creative (or chaotic!).
How It Works (in simple terms):
Language models generate the next word by picking from many possible words, each with a probability.
Temperature changes how sharp or flat that probability curve is.
Examples:
Let’s say the model is trying to generate the next word after:
"The cat sat on the"
1. temperature = 0.0 (deterministic)
- Always picks the highest probability word.
- 👉 Output:
"mat"
2. temperature = 0.7 (balanced)
- A bit of randomness.
- 👉 Output:
"mat"
,"sofa"
, or"floor"
3. temperature = 1.5 (high creativity)
- Very random.
- 👉 Output:
"rocket"
,"cloud"
, or"spoon"
Common Settings
Temperature | Behavior | Use Case |
---|---|---|
0.0 | Deterministic | Facts, math, code generation |
0.5 | Balanced | General-purpose conversation |
1.0 | Creative | Storytelling, poem generation |
>1.2 | Very creative | Wild ideas, brainstorming |
In Short:
Temperature controls how boring or bold your AI's response is.
Top-k Sampling
Top-k sampling is a method where the model:
Only considers the top k most likely next words, and randomly picks one from them based on their probabilities.
Why do we use it?
To control randomness and reduce weird outputs by not letting the model choose from all possible words (some of which have tiny, junky probabilities).
How It Works:
Imagine the model predicts the next word in a sentence, and it gives probabilities for 50,000 possible words.
- Without Top-k: it can choose from all 50,000, even if some are very unlikely.
- With Top-k = 5: it picks only from the top 5 most likely words, and samples randomly among those.
Example:
The model is generating the next word for:
"The pizza tastes"
Top predicted probabilities:
Word | Probability |
---|---|
delicious | 0.45 |
great | 0.20 |
amazing | 0.15 |
awful | 0.10 |
burnt | 0.05 |
wooden | 0.01 |
spicy | 0.01 |
... | ... |
- Top-k = 3 → consider only: delicious, great, amazing
- Pick one of them randomly, weighted by their probabilities.
- "awful" or "burnt" will not be considered.
When to use:
Top-k Value | Behavior |
---|---|
k = 1 | Always picks the top choice (deterministic) |
k = 10 | Balanced randomness |
k = 50+ | More creative or surprising |
Bonus: Often used with Temperature
- First apply Top-k to get a shortlist.
- Then apply Temperature to adjust randomness within that shortlist.
In Simple Words:
Top-k sampling = “Only pick from the top k best options”, then choose one based on probability.**
Max Tokens
Controls how long the output can be from a language model like GPT.
In Simple Terms:
“Max tokens” = the maximum number of words or pieces (tokens) the model is allowed to generate.
What is a Token?
A token is not exactly a word — it's a piece of text.
Text | Tokens |
---|---|
Hello | 1 |
ChatGPT | 1 |
unbelievable | 2 |
I love pizza. | 4 |
😊 (emoji) | 1 |
2025-07-02 | 4 |
So max_tokens
limits the number of tokens, not characters or full words.
How It Works
If you set max_tokens = 50
, the model will stop generating after 50 tokens, even if it hasn’t finished its sentence.
This helps:
- 🚫 Avoid super long or endless outputs
- 💰 Control costs (API pricing is often token-based)
- 📦 Fit within token limits (e.g., 4096 or 8192 total)
Important:
The input + output tokens together must stay within the model’s total token limit:
Model | Token Limit |
---|---|
GPT-3.5 | ~4,096 tokens |
GPT-4 | ~8,192 to 32,768 |
Example
Prompt:
"Write a short poem about cats."
And you set max_tokens = 20, the output might be:
Possible Output:
"Cats in sunbeams play,
Softly purring through the day..."
Then it stops even if the poem isn’t finished because it hit the 20-token limit.
Use Cases
Use Case | Recommended Max Tokens |
---|---|
Short answers (FAQs) | 10–50 |
Chatbots | 50–200 |
Story/essay generation | 200–1000+ |
Code generation | Depends, usually 100–800 |
Stop Sequence
A stop sequence is a custom string or token that tells the language model:
"Stop generating text once you see this."
It’s like saying:
“As soon as you see this word/phrase, cut off the output!”
Why Use Stop Sequences?
- To control where the output ends
- To avoid unnecessary or repeated text
- To simulate structured conversation (like ending after one message)
Example 1: Chatbot Message
You give this prompt:
And set stop = ["User:"]
The model might generate:
It stops before printing "User:" again — avoiding generating the next turn in the conversation.
Example 2: Multi-Choice Question
Prompt:
Set stop = ["\n"]
It stops as soon as it hits the first newline (\n) — short, sweet answer ✅
** Example 3: JSON Completion**
Prompt:
Set stop = ["}"]
Output:
It stops right before closing brace — useful for structured outputs.
Summary Table
Feature | What it Does |
---|---|
Stop Sequence | Halts generation when the model outputs a match |
Type | String or list of strings (e.g., ["User:", "\n"] ) |
Common Use | Chatbots, JSON, code, Q&A, structured text |
“Stop Sequence tells the model: “Stop writing when you hit this word or phrase.”
Fine-tuning
Fine-tuning is the process of training a pre-trained language model on your own custom dataset, so it learns to give more specific, domain-relevant, or personalized responses.
** In Simple Words:**
You're teaching a smart AI a special skill or style, on top of what it already knows.
Analogy
Imagine GPT is like a chef who can cook all kinds of food.
With fine-tuning, you're teaching the chef to cook your grandma’s secret recipes perfectly.
Now, the chef (GPT) still knows everything but becomes super good at your specific style.
Why Fine-tune a Model?
To make it:
- Talk in your brand voice
- Answer in domain-specific knowledge (e.g., medicine, law, finance)
- Follow specific response formats
- Speak in a different language or tone
- Act like a custom assistant or bot
How Fine-tuning Works (Step-by-Step)
- Start with a base model (like GPT-3.5 or LLaMA)
- Prepare a dataset of input-output pairs (called prompts and completions)
- Train the model on this data using a few passes (called epochs)
- The model updates its internal weights slightly to favor your examples
Example Dataset
After fine-tuning, your model will always respond in this style and tone — even to similar but not identical questions.
Fine-tuning vs Prompt Engineering
Feature | Fine-tuning | Prompt Engineering |
---|---|---|
Changes model? | Yes updates internal weights | No just changes the prompt |
Custom training? | Needs your dataset | Just uses clever wording |
Cost? | Higher (training & hosting) | Lower (just inference) |
Flexibility | More control over behavior | Limited, but easier |
When to NOT Fine-tune
- If you just want minor tweaks → use prompt engineering or function calling
- If data is confidential → be careful about what you upload
- If your use case is simple or short-lived