Fine Tuning Llama 2 with QLoRA
Want to learn how to fine-tune Llama 2 efficiently, then you should be familiar with the following parameters.
Model and Dataset Parameters
1
2
3
model_name = "NousResearch/Llama-2-7b-chat-hf"
dataset_name = "mlabonne/guanaco-llama2-1k"
new_model = "llama-2-7b-miniguanaco"
model_name
: Specifies the base model from the Hugging Face hub (LLaMA-2-7B).dataset_name
: Instruction-tuning dataset used for fine-tuning.new_model
: Name of the fine-tuned model.
QLoRA Parameters
QLoRA (Quantized LoRA) is a memory-efficient fine-tuning method.
1
2
3
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1
lora_r
: LoRA rank; controls the number of trainable parameters. Higherr
increases expressiveness but uses more memory.lora_alpha
: Scaling factor that adjusts the LoRA weights.lora_dropout
: Dropout probability in LoRA layers to prevent overfitting.
bitsandbytes Parameters (4-bit Quantization)
Reduces memory usage by quantizing weights to 4-bit precision.
1
2
3
4
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
use_4bit
: Enables loading the model in 4-bit precision.bnb_4bit_compute_dtype
: Data type for computation (float16
for efficiency).bnb_4bit_quant_type
:"fp4"
: Standard 4-bit quantization."nf4"
: Normalized Float 4 (better for LLMs, retains more information).
use_nested_quant
: Applies additional quantization (usuallyFalse
).
Training Arguments
Defines how the training process runs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 0
logging_steps = 25
Key Training Parameters
output_dir
: Where trained models and logs are stored.num_train_epochs
: Total number of passes over the dataset.fp16
/bf16
: Mixed precision training (usebf16=True
for A100 GPUs).per_device_train_batch_size
: Batch size per GPU.per_device_eval_batch_size
: Batch size for evaluation.
Optimization and Efficiency
gradient_accumulation_steps
: Updates weights after accumulating gradients overn
steps (useful for small batch sizes).gradient_checkpointing
: Saves GPU memory by recomputing activations.max_grad_norm
: Clips gradients to prevent large updates.learning_rate
: Controls how fast the model learns (2e-4
is standard).weight_decay
: Prevents overfitting by penalizing large weights.optim
: Uses Paged AdamW optimizer (memory-efficient).lr_scheduler_type
:"cosine"
decays learning rate smoothly.
Training Steps and Logging
max_steps
: If-1
, training is determined bynum_train_epochs
.warmup_ratio
: Linearly increases learning rate at the start.group_by_length
: Groups inputs of similar length to improve training speed.save_steps
: Controls checkpoint saving (set to0
to disable).logging_steps
: Logs progress every 25 steps.
Supervised Fine-Tuning (SFT) Parameters
1
2
3
max_seq_length = None
packing = False
device_map = {"": 0}
max_seq_length
: Maximum token length (defaults to model’s setting).packing
: Packs multiple short examples together (set toFalse
here).device_map
: Loads the model on GPU0
.
Summary
Your setup: ✅ Uses QLoRA for efficient fine-tuning.
✅ Loads 4-bit quantized LLaMA-2-7B model with NF4 quantization.
✅ Uses gradient checkpointing to save memory.
✅ Uses Paged AdamW optimizer with a cosine learning rate schedule.
Let’s break these down with hands-on, simple calculations for better intuition.
1. LoRA Rank (lora_r
)
LoRA (Low-Rank Adaptation) reduces the number of trainable parameters by decomposing weight matrices into low-rank matrices. Instead of updating the full weight matrix $W$ , we introduce two small matrices $A$ and $B$ such that:
\[W' = W + A B\]where:
- $A$ has shape (original_dim, r)
- $B$ has shape (r, original_dim)
- $r$ is the LoRA rank (
lora_r
).
Example: LoRA Rank Calculation
Assume the original weight matrix $W$ is 1000 × 1000.
Without LoRA:
- The total trainable parameters = $1000 \times 1000 = 1,000,000$ .
With LoRA (
lora_r = 64
):- Matrix $A$ (1000 × 64) → 64,000 parameters
- Matrix $B$ (64 × 1000) → 64,000 parameters
- Total parameters = 64,000 + 64,000 = 128,000 (much smaller!)
Thus, instead of updating 1,000,000 parameters, we update only 128,000, reducing memory usage by 87.2%.
👉 Higher lora_r
increases the number of trainable parameters, making the model more expressive but requiring more memory.
2. LoRA Scaling Factor (lora_alpha
)
LoRA scales the newly learned adaptation matrices using lora_alpha
to control their impact on the model.
lora_alpha
controls how much influence the new LoRA weights have.- Higher
lora_alpha
→ more weight on LoRA updates → model adapts more aggressively.
Example: LoRA Alpha Calculation
Let’s say:
- $W = 1.0$ (original weight value)
- $A B = 0.5$ (LoRA update)
lora_r = 64
lora_alpha = 16
LoRA scaling factor:
\[\frac{\alpha}{r} = \frac{16}{64} = 0.25\]Final weight:
\[W' = 1.0 + (0.25 \times 0.5) = 1.0 + 0.125 = 1.125\]👉 If lora_alpha
was 32 instead of 16:
🔹 Higher lora_alpha
makes the model adapt faster but might cause instability.
🔹 Lower lora_alpha
makes adaptation slower but more stable.
3. LoRA Dropout (lora_dropout
)
LoRA applies dropout to its low-rank updates to prevent overfitting.
- If
lora_dropout = 0.1
, then 10% of values inA B
are randomly set to zero during training. - This prevents LoRA updates from dominating and helps generalization.
Example: LoRA Dropout Calculation
Assume LoRA produces:
\[A B = [0.4, 0.2, 0.3, 0.6, 0.8]\]With
\[A B = [0.4, 0, 0.3, 0.6, 0]\]lora_dropout = 0.4
, we randomly drop 40% (2 values):
👉 Higher dropout = less risk of overfitting but slower adaptation.
🚀 Summary
Parameter | Meaning | Effect |
---|---|---|
lora_r = 64 | LoRA rank (size of low-rank matrices) | Higher r → More trainable params → More memory usage |
lora_alpha = 16 | Scaling factor for LoRA updates | Higher alpha → LoRA updates affect model more |
lora_dropout = 0.1 | Dropout rate for LoRA layers | Higher dropout → Prevents overfitting but slows adaptation |
Great question! The 1000×1000 example is a weight matrix in a neural network, representing model parameters. Let’s break it down further.
Where Does 1000×1000 Come From?
In a transformer-based model like LLaMA, weight matrices appear in layers like attention heads, MLPs, and embedding layers.
Each weight matrix $W$ maps input features (dimensionality) to output features (next layer). The size of $W$ depends on:
- Input dimension (e.g., hidden size in transformers)
- Output dimension (e.g., next layer’s size)
Example: Transformer Layer Weight Matrix
A transformer layer has query, key, and value projections for attention:
\[Q = X W_Q, \quad K = X W_K, \quad V = X W_V\]If:
- Hidden size = 1000 (feature size of each token representation)
- Weight matrix = 1000×1000 (mapping from 1000 input features to 1000 output features)
Then, each of $W_Q, W_K, W_V$ is 1000×1000.
Similarly, the feedforward layer (MLP block) has:
- Weight matrices of shape 1000 × 4000 (expansion)
- 4000 × 1000 (projection back)
Each transformer layer contains multiple such weight matrices.
How Does LoRA Reduce Parameters?
Instead of updating all weights in these matrices, LoRA replaces a full update with **two smaller matrices $A$ and $B$ **:
\[W' = W + A B\]where:
- $A$ is 1000×r (e.g.,
r=64
, so 1000×64) - $B$ is r×1000 (e.g.,
r=64
, so 64×1000)
Thus:
- Original parameters per matrix: 1000×1000 = 1,000,000
- LoRA trainable parameters: (1000×64) + (64×1000) = 128,000
- Parameter reduction: 87.2% fewer trainable parameters!
Why Use LoRA?
- Fewer trainable parameters → Saves memory.
- Efficient fine-tuning → You don’t need to modify the entire model.
- Works with large models → Ideal for LLaMA-2, GPT, etc.
You’re on the right track, but there’s a slight misunderstanding. Let’s clarify how LoRA modifies weight matrices in large models like LLaMA-2.
LoRA Breakdown with Example
In a neural network, weights are usually stored as matrices. Instead of modifying the full weight matrix $W$ , LoRA adds a low-rank decomposition:
\[W' = W + A B\]where:
- ** $A$ (d × r)**: Small matrix (reduces dimensionality)
- ** $B$ (r × d)**: Small matrix (projects back)
- ** $r$ (LoRA rank)**: Controls how much of $W$ can be learned.
1B Parameters Model Example
A model with 1 billion (1B) parameters means:
- The model has multiple layers, each with several weight matrices.
- Not all 1B parameters belong to a single weight matrix!
Instead, let’s assume a **single layer weight matrix $W$ ** has:
- Input size = 4096
- Output size = 4096 (same hidden size as LLaMA-2)
Thus, the full weight matrix $W$ is:
\[W \in \mathbb{R}^{4096 \times 4096}\]which contains:
\[4096 \times 4096 = 16,777,216 \text{ (16.8M parameters)}\]**Applying LoRA with $r = 5$ **
LoRA replaces the update with two low-rank matrices:
- $A$ is ** $4096 \times 5$ **
- $B$ is ** $5 \times 4096$ **
Total trainable parameters in LoRA:
\[(4096 \times 5) + (5 \times 4096) = 40,960\]This is 99.75% fewer trainable parameters compared to 16.8M!
**Your Example: $A = 5 \times 1B$ and $B = 1B \times 5$ **
That setup would mean:
- ** $A$ ** has 5 billion parameters
- ** $B$ ** has 5 billion parameters
- Together: 10 billion trainable parameters, which defeats LoRA’s efficiency.
Instead, LoRA is applied to individual weight matrices inside the model, not the entire model at once.
Summary
- Each model has multiple weight matrices (not a single giant matrix).
- LoRA is applied per matrix, reducing trainable parameters.
- ** $r$ (LoRA rank) is much smaller than full model size** to save memory.
How LoRA is Applied to Attention Layers in LLaMA-2
LoRA is mainly applied to the query, key, and value projection matrices inside the self-attention mechanism of transformer models like LLaMA-2.
1️⃣ Self-Attention Overview
In LLaMA-2 (and other transformers), the attention mechanism transforms an input sequence into three matrices:
\[Q = X W_Q, \quad K = X W_K, \quad V = X W_V\]where:
- $X$ = input embeddings (batch size × sequence length × hidden dimension)
- $W_Q, W_K, W_V$ = weight matrices for query, key, and value projections.
- Each weight matrix size is typically ** $d \times d$ ** (e.g., 4096 × 4096 for LLaMA-2-7B).
Thus, the total parameters in a single attention head (without LoRA) is:
\[3 \times (4096 \times 4096) = 49.1M \text{ parameters}\]For multi-head attention (32 heads in LLaMA-2-7B), this scales up significantly!
2️⃣ How LoRA Modifies These Layers
Instead of fine-tuning all 49M parameters in attention layers, LoRA only fine-tunes a low-rank update:
\[W'_Q = W_Q + A_Q B_Q\] \[W'_K = W_K + A_K B_K\] \[W'_V = W_V + A_V B_V\]Where:
$A$ is ** $d \times r$ ** (e.g., 4096 × 64 for LoRA rank $r=64$ )
$B$ is ** $r \times d$ ** (e.g., 64 × 4096)
Total trainable parameters per weight matrix:
\[(4096 \times 64) + (64 \times 4096) = 524,288 \text{ parameters}\]Compared to full tuning (16.8M per matrix), this is a huge reduction!
For all three matrices (Q, K, V): 1.57M trainable parameters instead of 49M.
3️⃣ Why LoRA Works for Attention
- Freezes the pre-trained model – Only the small $A$ and $B$ matrices get trained.
- Saves memory & compute – Much smaller updates compared to full fine-tuning.
- Scales well to large models – Fine-tunes large models efficiently without high GPU costs.
4️⃣ Visualization of LoRA Applied to Self-Attention
Without LoRA: Full Fine-Tuning
\[W_Q = \begin{bmatrix} w_{11} & w_{12} & ... & w_{1d} \\ w_{21} & w_{22} & ... & w_{2d} \\ ... & ... & ... & ... \\ w_{d1} & w_{d2} & ... & w_{dd} \end{bmatrix}\]✅ All ** $d \times d$ ** parameters are trainable → High memory, slow training.
With LoRA: Low-Rank Updates
\[A_Q = \begin{bmatrix} a_{11} & a_{12} & ... & a_{1r} \\ a_{21} & a_{22} & ... & a_{2r} \\ ... & ... & ... & ... \\ a_{d1} & a_{d2} & ... & a_{dr} \end{bmatrix}\] \[B_Q = \begin{bmatrix} b_{11} & b_{12} & ... & b_{1d} \\ b_{21} & b_{22} & ... & b_{2d} \\ ... & ... & ... & ... \\ b_{r1} & b_{r2} & ... & b_{rd} \end{bmatrix}\]✅ Only ** $d \times r + r \times d$ ** trainable parameters → Efficient fine-tuning!
5️⃣ Summary: Why LoRA is Powerful
Method | Trainable Parameters (per layer) | Total for QKV (per layer) |
---|---|---|
Full Fine-Tuning | 16.8M | 49M |
LoRA (r=64) | 524K | 1.57M |
Reduction | 96.9% fewer parameters | 96.9% fewer parameters |
🚀 LoRA makes fine-tuning large models practical while maintaining performance!
Next Steps
- A hands-on code example applying LoRA to a LLaMA-2 model? 🔥
- A deeper dive into LoRA hyperparameters like
lora_alpha
andlora_dropout
? 🎯
4-bit Quantization in LLaMA-2 Fine-Tuning
Quantization is crucial for reducing memory usage when fine-tuning large models like LLaMA-2-7B. bitsandbytes (bnb) enables efficient 4-bit quantization, significantly reducing GPU requirements.
1️⃣ What is use_4bit
?
🚀 use_4bit = True
enables 4-bit quantization, meaning each weight in the model is stored using only 4 bits instead of 16 or 32 bits.
Memory Reduction Example
Suppose your model has 7 billion parameters (LLaMA-2-7B).
Normally, in fp16 (16-bit precision):
\[7B \times 16\text{-bit} = 14GB\]In 4-bit quantization:
\[7B \times 4\text{-bit} = 3.5GB\]
🔹 ✅ 4× memory reduction → Fits on smaller GPUs (e.g., a single A100).
2️⃣ What is bnb_4bit_compute_dtype
?
📌 bnb_4bit_compute_dtype
specifies the data type for actual calculations (even if weights are stored in 4-bit).
- Common options:
"float16"
→ Fast and memory-efficient (default, best for most GPUs)"bfloat16"
→ More stable on newer GPUs like A100"float32"
→ More precise but high memory usage
💡 Why is this needed?
- Weights are stored in 4-bit precision, but computations need a higher precision.
- Example:
- Stored 4-bit weights are converted to float16 or bfloat16 before being used in matrix multiplications.
3️⃣ What is bnb_4bit_quant_type
?
🚀 4-bit quantization means fewer bits, but how they are stored affects model quality.
There are two main formats:
(1) FP4 (Float 4)
- Standard 4-bit quantization using a small floating-point representation.
- Has lower precision and can lead to some performance loss.
(2) NF4 (Normalized Float 4)
- Better for LLMs!
- NF4 preserves more information by distributing values better across 4-bit representation.
- Improves model performance and reduces loss of expressiveness compared to FP4.
📌 Example Intuition:
Imagine representing the numbers [0.1, 0.5, 1.0, 2.0, 4.0] with just 4 bits.
- FP4: Might store them as [0.1, 0.5, 1.0, 2.0, 4.0] (less optimized).
- NF4: Might normalize and distribute them better, capturing more useful information.
✅ bnb_4bit_quant_type = "nf4"
is recommended for LLaMA-2 fine-tuning.
4️⃣ What is use_nested_quant
?
📌 use_nested_quant = True
applies double quantization (quantizing the quantized weights).
- Default is
False
, because:- Nested quantization saves even more memory but can reduce precision.
- Usually not needed unless working with very low-memory environments.
🔹 Example:
- Standard 4-bit quantization converts weights from 16-bit → 4-bit.
- Nested quantization further compresses the 4-bit representation, saving more memory.
🚀 Recommendation: Keep use_nested_quant = False
unless GPU memory is extremely limited.
5️⃣ Why 4-bit Quantization Matters
Precision | Storage Size (LLaMA-2-7B) | Benefits |
---|---|---|
FP32 (default training) | 28GB | Best accuracy, highest memory usage |
FP16 (half-precision) | 14GB | Saves memory, faster computation |
4-bit FP4 | 3.5GB | 4× smaller, some performance loss |
4-bit NF4 | 3.5GB | 4× smaller, better accuracy than FP4 |
📌 Using bnb_4bit_compute_dtype="float16"
with nf4
provides the best balance between memory and model quality.
🚀 TL;DR - Recommended Settings for LoRA Fine-Tuning
1
2
3
4
use_4bit = True # Enable 4-bit quantization
bnb_4bit_compute_dtype = "float16" # Efficient computation
bnb_4bit_quant_type = "nf4" # Normalized Float 4 (better than FP4)
use_nested_quant = False # Avoid extra quantization unless needed
✅ This setup allows fine-tuning LLaMA-2-7B on a single A100 (40GB) GPU! 🚀
How NF4 (Normalized Float 4) Represents Values
NF4 (Normalized Float 4) is a specialized 4-bit floating-point quantization format designed to retain more useful information compared to standard FP4 (Float 4). It does this by using a non-uniform distribution of quantized values.
1️⃣ Why Standard 4-bit FP4 Fails?
4-bit floating-point quantization (FP4) evenly distributes numbers in the range [min, max], but most weights in LLMs are centered around small values (near 0).
Example Problem in FP4:
- Let’s say the actual weights range from -1 to +1.
- Standard FP4 might evenly assign 16 possible values linearly.
- But in real LLMs, most values are concentrated around 0.0 (small magnitudes).
2️⃣ How NF4 Fixes This?
Instead of using linear spacing like FP4, NF4 clusters more values around 0, making it more adaptive to LLM weight distributions.
💡 NF4 uses 16 pre-selected floating-point values, chosen based on LLM weight statistics.
NF4 16 Discrete Values
Index | NF4 Value |
---|---|
0000 | -1.00 |
0001 | -0.696 |
0010 | -0.525 |
0011 | -0.394 |
0100 | -0.274 |
0101 | -0.160 |
0110 | -0.082 |
0111 | -0.020 |
1000 | +0.020 |
1001 | +0.082 |
1010 | +0.160 |
1011 | +0.274 |
1100 | +0.394 |
1101 | +0.525 |
1110 | +0.696 |
1111 | +1.00 |
🔹 Why is this better?
- The smallest values are closer to zero (capturing fine details).
- Weights close to zero get better precision, unlike FP4, which wastes resolution on large values.
- This better represents LLM weights, improving accuracy over FP4.
3️⃣ NF4 vs. FP4 Intuition
Imagine compressing these values into 4-bit precision:
📌 Example Weight Distribution (Before Quantization)
1
[-0.9, -0.5, -0.3, -0.1, 0.0, +0.1, +0.3, +0.5, +0.9]
🔴 FP4 (Linear Quantization)
1
[-1.0, -0.75, -0.5, -0.25, 0.0, +0.25, +0.5, +0.75, +1.0]
❌ Problem: Most LLM weights are near zero, but FP4 wastes precision on large numbers.
🟢 NF4 (Adaptive Quantization)
1
2
[-1.0, -0.696, -0.525, -0.394, -0.274, -0.160, -0.082, -0.020,
+0.020, +0.082, +0.160, +0.274, +0.394, +0.525, +0.696, +1.0]
✅ Advantage: NF4 prioritizes small values, making it better suited for LLMs! 🚀
4️⃣ TL;DR - Why NF4 is Better?
Feature | FP4 (Float 4-bit) | NF4 (Normalized Float 4-bit) |
---|---|---|
Spacing | Linear | More values near 0 |
Adapted to LLMs? | ❌ No | ✅ Yes |
Preserves small weights? | ❌ No | ✅ Yes |
Accuracy vs. FP16? | Lower | Higher |
📌 NF4 is specifically designed for LLM quantization, keeping accuracy high even with aggressive compression.
How NF4 Quantization is Calculated?
NF4 (Normalized Float 4) maps full-precision values (e.g., FP16 or FP32) to one of 16 predefined NF4 values. Unlike standard 4-bit floating point, which distributes values linearly, NF4 assigns more precision to small magnitudes.
1️⃣ Steps in NF4 Quantization Process
Step 1: Define the NF4 Quantization Table
NF4 only supports 16 fixed values, selected based on real LLM weight distributions:
\[\text{NF4 Values} = \{-1.00, -0.696, -0.525, -0.394, -0.274, -0.160, -0.082, -0.020, 0.020, 0.082, 0.160, 0.274, 0.394, 0.525, 0.696, 1.00\}\]Each number is assigned a 4-bit code (0000 to 1111).
Step 2: Map Each Weight to the Closest NF4 Value
For each weight $w$ in the model:
- Find the nearest NF4 value from the table.
- Replace $w$ with the corresponding NF4 value.
- Store only the 4-bit code instead of the full number.
2️⃣ Example Calculation (Manually)
Given a set of real model weights:
1
[-0.85, -0.52, -0.1, 0.0, 0.15, 0.7]
Step 1: Find the Closest NF4 Value for Each Weight
Original Weight | Closest NF4 Value | 4-bit Encoding |
---|---|---|
-0.85 | -1.00 | 0000 |
-0.52 | -0.525 | 0010 |
-0.10 | -0.082 | 0110 |
0.00 | -0.020 | 0111 |
0.15 | 0.160 | 1010 |
0.70 | 0.696 | 1110 |
3️⃣ Python Code for NF4 Quantization
Now, let’s write a Python function to perform NF4 quantization.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
# NF4 predefined values
NF4_VALUES = np.array([
-1.00, -0.696, -0.525, -0.394, -0.274, -0.160, -0.082, -0.020,
0.020, 0.082, 0.160, 0.274, 0.394, 0.525, 0.696, 1.00
])
def quantize_to_nf4(weights):
"""
Quantizes an array of weights to NF4 format by mapping each weight
to the closest NF4 predefined value.
"""
quantized_weights = []
quantized_codes = []
for w in weights:
# Find the closest NF4 value
closest_index = np.argmin(np.abs(NF4_VALUES - w))
closest_value = NF4_VALUES[closest_index]
quantized_weights.append(closest_value)
quantized_codes.append(closest_index) # Store the 4-bit code (0-15)
return np.array(quantized_weights), np.array(quantized_codes)
# Example weights
weights = np.array([-0.85, -0.52, -0.1, 0.0, 0.15, 0.7])
# Apply NF4 quantization
quantized_weights, quantized_codes = quantize_to_nf4(weights)
print("Original Weights:", weights)
print("Quantized Weights:", quantized_weights)
print("4-bit Codes:", quantized_codes)
4️⃣ How is NF4 Dequantized?
Once the model is quantized using NF4, dequantization simply replaces the stored 4-bit code with the original NF4 value.
1
2
3
4
5
6
7
8
9
def dequantize_nf4(quantized_codes):
"""
Dequantizes NF4 codes back to NF4 values.
"""
return NF4_VALUES[quantized_codes]
# Dequantize to get back NF4 values
dequantized_weights = dequantize_nf4(quantized_codes)
print("Dequantized Weights:", dequantized_weights)
This restores the weights using the NF4 representation.
5️⃣ TL;DR: Key Takeaways
✅ NF4 is non-uniform: More precision around small values (close to zero).
✅ Quantization process: Replace full-precision weights with the nearest NF4 value.
✅ Dequantization: Convert back using stored 4-bit codes.
✅ NF4 reduces memory by 4x while keeping accuracy high for LLMs.
Understanding per_device_train_batch_size
& per_device_eval_batch_size
These parameters define how many samples (or examples) are processed per GPU during training and evaluation.
1️⃣ What is Batch Size?
Batch size determines how many examples from the dataset are processed at once before updating model weights.
- Larger batch sizes → Faster training but more GPU memory usage.
- Smaller batch sizes → More stable training but slower convergence.
💡 Example:
If you have a dataset with 10,000 examples and set batch size = 4, the model will process 4 examples at a time before updating weights.
2️⃣ How per_device_train_batch_size
Works
This is the number of samples per GPU per training step.
🔹 Example with 1 GPU:
1
2
3
4
per_device_train_batch_size = 4
num_gpus = 1
effective_batch_size = per_device_train_batch_size * num_gpus
# 4 * 1 = 4
👉 The model processes 4 samples per step.
🔹 Example with 4 GPUs:
1
2
3
4
per_device_train_batch_size = 4
num_gpus = 4
effective_batch_size = per_device_train_batch_size * num_gpus
# 4 * 4 = 16
👉 Each GPU processes 4 samples, so the total batch size is 16.
3️⃣ How per_device_eval_batch_size
Works
This is the number of samples per GPU per evaluation step.
🔹 Example:
If you set:
1
2
per_device_eval_batch_size = 8
num_gpus = 2
👉 The evaluation batch size will be:
\[8 (\text{per GPU}) \times 2 (\text{GPUs}) = 16\]The model processes 16 examples at a time during evaluation.
4️⃣ Key Takeaways
✅ Training batch size (per_device_train_batch_size
) defines how many samples are used per GPU during training.
✅ Evaluation batch size (per_device_eval_batch_size
) defines how many samples are used per GPU during model validation.
✅ Total effective batch size = per_device_batch_size × num_gpus × gradient_accumulation_steps
.
5️⃣ Example Calculation With Gradient Accumulation
Scenario:
- 2 GPUs
per_device_train_batch_size = 4
gradient_accumulation_steps = 2
🔹 Total batch size per step:
\[4 (\text{per GPU}) \times 2 (\text{GPUs}) = 8\]🔹 Effective batch size (after accumulation):
\[8 \times 2 (\text{grad accumulation}) = 16\]👉 The model updates after seeing 16 examples instead of 8.
Optimization and Efficiency in Fine-Tuning LLaMA 2
When fine-tuning a large language model, optimizing memory and training efficiency is critical. Let’s break down these parameters with detailed explanations and examples.
1️⃣ gradient_accumulation_steps
👉 What it does:
Normally, the model updates its weights after each batch.
With gradient accumulation, instead of updating after every batch, it accumulates gradients over multiple steps before updating.
🔹 Why is it useful?
- Allows training with smaller batch sizes (helpful when GPU memory is limited).
- Effectively increases the batch size without requiring more GPU memory.
🔹 Example Calculation:
Imagine:
per_device_train_batch_size = 4
gradient_accumulation_steps = 2
- 2 GPUs available
Total effective batch size:
\[4 (\text{per GPU}) \times 2 (\text{GPUs}) \times 2 (\text{grad accumulation}) = 16\]👉 The model sees 16 samples before updating weights, even though the actual batch size is just 4 per GPU.
🔹 Effect:
- ✅ Allows training larger models with limited memory.
- ✅ Reduces variance in updates, leading to smoother training.
2️⃣ gradient_checkpointing
👉 What it does:
Instead of storing all intermediate activations in memory, it recomputes them on demand to save GPU memory.
🔹 Why is it useful?
- Saves memory at the cost of extra computation.
- Useful when fine-tuning very large models (e.g., LLaMA-2-13B, LLaMA-2-65B).
🔹 Example: Without gradient checkpointing:
- All intermediate activations are stored → High memory usage.
With gradient checkpointing:
- Only a subset of activations is stored → Memory savings but extra computation.
💡 Trade-off
- ✅ More memory-efficient.
- ❌ Slightly slower training due to recomputation.
3️⃣ max_grad_norm
(Gradient Clipping)
👉 What it does:
Clips the gradient if it gets too large, preventing unstable updates.
🔹 Why is it useful?
- Prevents exploding gradients (which can cause training to fail).
- Stabilizes training, especially for large models.
🔹 Example Calculation:
- Suppose a model computes a gradient norm of 5.0, but
max_grad_norm = 0.3
. The gradient is scaled down proportionally: \(\text{new gradient} = \frac{0.3}{5.0} \times \text{original gradient}\)
- This prevents large updates that can destabilize training.
4️⃣ learning_rate
👉 What it does:
Controls how much the model updates its weights after each step.
🔹 Why is it important?
- Too high (e.g., 1e-2) → Model diverges (fails to learn).
- Too low (e.g., 1e-6) → Training is too slow.
🔹 Example:
If learning_rate = 2e-4
:
- The model updates its weights by 0.0002 of the computed gradient each step.
💡 Tip:
- ✅ Start with
2e-4
(standard for LLaMA 2 fine-tuning). - ✅ Use learning rate scheduling for gradual adjustments.
5️⃣ weight_decay
👉 What it does:
Adds a penalty to large weights to prevent overfitting.
🔹 Why is it useful?
- Encourages smaller weight values, which generalize better.
🔹 Example:
- Suppose weight decay = 0.001.
- The optimizer shrinks the weights slightly at each step: \(\text{updated weight} = \text{original weight} - 0.001 \times \text{original weight}\)
💡 Tip:
- ✅ Use
0.001
for general fine-tuning. - ✅ Set lower values (
0.0001
) if the dataset is small.
6️⃣ optim
(Optimizer)
👉 What it does:
Defines how gradients are used to update model weights.
🔹 Why is paged_adamw_32bit
used?
- Memory-efficient AdamW optimizer (optimized for large models).
- “Paged” version helps with offloading unused data to CPU to save GPU memory.
🔹 Example: Regular AdamW optimizer stores all parameters in GPU memory.
Paged AdamW moves unused data to CPU → less GPU memory usage.
💡 Tip:
- ✅ Best optimizer for large LLM fine-tuning.
- ✅ Works well with 4-bit quantization.
7️⃣ lr_scheduler_type
(Learning Rate Scheduler)
👉 What it does:
Controls how the learning rate changes during training.
🔹 Why is “cosine” used?
- Gradually reduces learning rate over time for better convergence.
🔹 Example:
- At step 0, learning rate =
2e-4
. - Midway through training → learning rate is reduced.
- At the end of training, learning rate → close to 0.
📉 Cosine Decay Formula:
\[\eta_t = \eta_0 \times 0.5 \times \left(1 + \cos\left(\frac{t}{T} \pi\right)\right)\]where:
- $\eta_t$ = learning rate at step $t$
- $\eta_0$ = initial learning rate
- $T$ = total training steps
🔹 Graph of Cosine Decay:
📈 Starts high → 📉 gradually decreases → 📉📉 flattens near zero.
💡 Why use cosine?
- ✅ Helps the model converge smoothly.
- ✅ Avoids sudden drops in learning rate.
🔑 Summary Table
Parameter | Purpose | Example Effect |
---|---|---|
gradient_accumulation_steps | Updates weights after multiple steps instead of every batch | Allows using smaller batch sizes |
gradient_checkpointing | Saves memory by recomputing activations | Uses less memory but increases computation |
max_grad_norm | Clips large gradients to prevent unstable updates | Stops exploding gradients |
learning_rate | Controls how fast the model learns | Higher = faster training, but unstable |
weight_decay | Penalizes large weights to prevent overfitting | Regularizes model weights |
optim | Optimizer type (Paged AdamW) | Memory-efficient training |
lr_scheduler_type | Adjusts learning rate over time | Cosine decay = smooth reduction |
🚀 Key Takeaways
- Gradient accumulation helps train models with small batch sizes.
- Gradient checkpointing saves memory but slows down training slightly.
- Gradient clipping (
max_grad_norm
) prevents unstable training. - Learning rate scheduling (
cosine
) helps smooth training. - Paged AdamW optimizer is efficient for large-scale fine-tuning.
Gradient Norm and Gradient Clipping (max_grad_norm)
Gradient norms play a critical role in stabilizing the training of deep learning models, especially large-scale models like LLaMA 2. Let’s go deep into what gradient norm is, why it matters, and how gradient clipping (via max_grad_norm
) helps.
1️⃣ What is the Gradient Norm?
👉 Definition:
The gradient norm is the magnitude of the gradient vector (L2 norm) across all parameters of the model. It tells us how large the overall update to the model weights will be in a given training step.
Mathematically, if $\theta$ represents all model parameters and ** $\nabla L(\theta)$ ** is the gradient of the loss function $L$ , then the L2 norm (Euclidean norm) of the gradient is:
\[\|\nabla L(\theta)\|_2 = \sqrt{\sum_{i} g_i^2}\]where $g_i$ represents the gradient of the $i$ -th parameter.
2️⃣ Why is the Gradient Norm Important?
During training, very large gradients can cause instability:
- Exploding gradients: If the gradient norm becomes too large, the model diverges (weights change too drastically, and loss may become NaN).
- Vanishing gradients: If gradients become too small, the model learns too slowly or stops learning.
💡 Example:
- If the gradient norm = 1000, the model makes huge updates, potentially overshooting the optimal solution.
- If the gradient norm = 0.0001, updates are too small, making training inefficient.
Exploding Gradient Example
Consider training a deep neural network where the gradients keep growing:
Training Step | Gradient Norm |
---|---|
Step 1 | 0.5 |
Step 10 | 5.0 |
Step 20 | 500.0 |
Step 30 | 50,000.0 (exploding!) |
Here, the model diverges, and training becomes unstable.
3️⃣ What is Gradient Clipping (max_grad_norm
)?
👉 Gradient clipping limits the gradient norm to a fixed value (max_grad_norm
), preventing extreme updates.
🔹 How it works:
- If the gradient norm is below
max_grad_norm
, do nothing. - If it exceeds
max_grad_norm
, scale all gradients down proportionally.
Mathematically, if the gradient norm exceeds a given threshold $c$ , the gradient is rescaled:
\[\nabla L(\theta) = \nabla L(\theta) \times \frac{c}{\|\nabla L(\theta)\|_2}\]where $c$ is max_grad_norm
.
4️⃣ Example of Gradient Clipping
Let’s assume:
- Gradient norm (before clipping): 8.0
- max_grad_norm: 2.0 (we want to cap it at 2)
The gradients are scaled down by:
\[\frac{2.0}{8.0} = 0.25\]So each gradient component is multiplied by 0.25, reducing the total norm to 2.0.
💡 Before vs. After Clipping
Parameter | Original Gradient | After Clipping |
---|---|---|
$g_1$ | 4.0 | 1.0 |
$g_2$ | 3.0 | 0.75 |
$g_3$ | 1.0 | 0.25 |
👉 Now, the new gradient norm is 2.0, preventing an unstable update.
5️⃣ Implementing Gradient Clipping in PyTorch
In PyTorch, torch.nn.utils.clip_grad_norm_()
is used to apply gradient clipping:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import torch
# Dummy model
model = torch.nn.Linear(10, 2)
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
# Sample loss function
loss_fn = torch.nn.MSELoss()
# Simulated training step
for step in range(100):
optimizer.zero_grad() # Reset gradients
# Dummy input and target
input_data = torch.randn(32, 10) # Batch size = 32
target = torch.randn(32, 2)
# Forward pass
output = model(input_data)
loss = loss_fn(output, target)
# Backward pass
loss.backward()
# Apply gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # Clips gradients
# Optimizer step
optimizer.step()
6️⃣ Effect of Different max_grad_norm
Values
Choosing the right max_grad_norm
is critical:
max_grad_norm | Effect |
---|---|
No clipping | Risk of exploding gradients, unstable training |
1.0 | Conservative updates, stable learning |
5.0 | Faster learning, but risk of instability |
0.1 | Very slow learning, might hinder progress |
💡 Best Practice:
- ✅ 1.0 - 2.0 is safe for large models (e.g., LLaMA 2).
- ✅ If you notice instability, reduce
max_grad_norm
. - ✅ If training is too slow, increase slightly but not above 5.
7️⃣ Gradient Clipping in Hugging Face Transformers
In Hugging Face Trainer
, gradient clipping is set using max_grad_norm
:
1
2
3
4
5
6
7
8
training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=2e-4,
max_grad_norm=1.0, # Gradient clipping
num_train_epochs=3,
)
🔑 Key Takeaways
- Gradient norm measures update magnitude—too high leads to instability, too low slows learning.
- Gradient clipping (
max_grad_norm
) caps large gradients, preventing training divergence. - Scaling is proportional—all gradients are adjusted uniformly to maintain direction.
- Common values:
1.0 - 2.0
is ideal for LLaMA 2 fine-tuning. - PyTorch & Hugging Face provide built-in methods to clip gradients easily.
Training Steps and Logging – Detailed Explanation 🚀
When fine-tuning large models like LLaMA 2, understanding training steps and logging settings helps optimize training efficiency and monitoring. Let’s break each setting down with examples and intuition.
🔹 max_steps
: Maximum Training Steps
📌 Definition:
- Controls the total number of training steps (batches) before training stops.
- If
max_steps = -1
, training is controlled bynum_train_epochs
instead.
📌 Formula for Total Training Steps:
\[\text{Total Steps} = \frac{\text{Dataset Size} \times \text{Num Epochs}}{\text{Batch Size} \times \text{Gradient Accumulation Steps}}\]💡 Example 1: Using num_train_epochs
(default behavior)
- Dataset size: 10,000 samples
- Batch size: 4
- Epochs: 3
- Gradient Accumulation: 2
💡 Example 2: Setting a fixed max_steps
1
2
3
training_args = TrainingArguments(
max_steps=5000 # Stops after 5000 steps, ignoring num_train_epochs
)
👉 Use Case: If your dataset is too large, you can limit steps instead of epochs.
🔹 warmup_ratio
: Learning Rate Warmup
📌 Definition:
- Gradually increases the learning rate from
0
to the target learning rate over a small fraction of training steps. - Prevents the model from making large updates too early, which can cause instability.
📌 Formula:
\text{Warmup Steps} = \text{warmup_ratio} \times \text{Total Training Steps}
💡 Example:
warmup_ratio = 0.1
max_steps = 10,000
For the first 1,000 steps, the learning rate gradually increases, avoiding sudden large updates.
1
2
3
4
training_args = TrainingArguments(
learning_rate=2e-4,
warmup_ratio=0.1, # 10% of steps used for warmup
)
📊 Why is warmup important? Without warmup:
- The model starts training with a full learning rate → unstable updates 🚨
With warmup: - The model slowly adjusts to the learning rate → stable training ✅
🔹 group_by_length
: Efficient Batching
📌 Definition:
- Groups input samples of similar length into the same batch.
- Reduces the need for padding, making training faster and more memory-efficient.
💡 Example:
Consider training a model on sentences of different lengths:
Sentence | Length (Tokens) |
---|---|
“Hi” | 2 |
“Hello, how are you?” | 10 |
“The quick brown fox jumps over the lazy dog.” | 45 |
- Without
group_by_length
:- A batch might contain short and long sentences together, leading to a lot of padding (wasting memory).
- With
group_by_length
:- The model groups similar-length sentences together → less padding → more efficient training.
1
2
3
training_args = TrainingArguments(
group_by_length=True # Enables length-based batching
)
📊 When to Use? ✅ If your dataset has variable-length sequences (e.g., text, conversations).
❌ Not needed if your input size is fixed.
🔹 save_steps
: Checkpoint Saving Frequency
📌 Definition:
- Saves the model every
save_steps
training steps to avoid losing progress. - Setting
save_steps=0
disables checkpoint saving.
💡 Example:
If save_steps=500
, the model saves a checkpoint every 500 steps.
1
2
3
training_args = TrainingArguments(
save_steps=500 # Save model every 500 steps
)
📊 Why is this useful?
- Prevents data loss if training crashes.
- Allows resuming training from a checkpoint.
- But too frequent saving consumes disk space.
👉 Best Practice:
- Small models: Save every few hundred steps.
- Large models (LLaMA 2, GPT-3, etc.): Save every few thousand steps to avoid disk overflow.
🔹 logging_steps
: Log Training Progress
📌 Definition:
- Controls how often the training logs are printed.
- Default:
logging_steps=25
→ logs every 25 steps.
💡 Example Output:
If logging_steps=25
, you’ll see logs like this:
1
2
3
Step 25 | Loss: 1.234 | Learning Rate: 1.5e-4
Step 50 | Loss: 1.201 | Learning Rate: 1.8e-4
Step 75 | Loss: 1.190 | Learning Rate: 2.0e-4
1
2
3
training_args = TrainingArguments(
logging_steps=25 # Log progress every 25 steps
)
📊 Why is this useful?
- Helps monitor training progress in real-time.
- If loss starts increasing, you can stop training early.
- If training crashes, you have logs to debug issues.
🔑 Key Takeaways
Setting | What It Does | Best Practice |
---|---|---|
max_steps | Controls training duration (use -1 for epochs) | Use fixed steps for large datasets |
warmup_ratio | Slowly increases learning rate at start | Use 0.05 - 0.1 for stable training |
group_by_length | Groups similar-length inputs for efficiency | Use for NLP/text tasks |
save_steps | Saves model checkpoints every N steps | Use 500-2000 for large models |
logging_steps | Logs progress every N steps | Use 10-50 for quick monitoring |
If your training crashes and you want to resume from the last saved checkpoint, follow these steps:
🔹 1. Locate Your Last Checkpoint
By default, Hugging Face saves checkpoints in the output_dir
folder (e.g., ./checkpoint-500
, ./checkpoint-1000
).
- Find the latest saved checkpoint:
1
ls -d output_dir/checkpoint-*
Example output:
1
2
3
output_dir/checkpoint-500
output_dir/checkpoint-1000
output_dir/checkpoint-1500 <-- Most recent checkpoint
💡 Note: If training crashes, always resume from the latest checkpoint.
🔹 2. Resume Training from Checkpoint
Modify your training script to load from the latest checkpoint using resume_from_checkpoint
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
training_args = TrainingArguments(
output_dir="./output_dir", # Where checkpoints are saved
save_steps=500, # Save every 500 steps
resume_from_checkpoint=True # Resume from latest checkpoint
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
✅ This automatically detects the latest checkpoint and resumes training.
🔹 3. Manually Resume from a Specific Checkpoint
If you want to resume from a specific checkpoint (e.g., checkpoint-1500
), use:
1
trainer.train(resume_from_checkpoint="./output_dir/checkpoint-1500")
💡 This is useful if you suspect the last checkpoint was corrupted.
🔹 4. Ensure Consistency in Hyperparameters
When resuming training, do not change critical settings like:
✅ learning_rate
✅ batch_size
✅ gradient_accumulation_steps
Changing them may lead to unexpected behavior.
🔹 5. Verify Resume is Working
Check the logs when restarting training. You should see something like:
1
Resuming training from checkpoint ./output_dir/checkpoint-1500
This confirms that training continues from where it stopped instead of restarting.
🔹 6. Use save_total_limit
to Manage Checkpoints
Checkpoints take up a lot of disk space. To keep only the latest N checkpoints, use:
1
2
3
training_args = TrainingArguments(
save_total_limit=3 # Keeps only the last 3 checkpoints
)
💡 This prevents your disk from filling up while still allowing resumption.
🚀 Summary
Scenario | Solution |
---|---|
Resume from the latest checkpoint | resume_from_checkpoint=True |
Resume from a specific checkpoint | trainer.train(resume_from_checkpoint="path/to/checkpoint") |
Prevent disk overflow | save_total_limit=3 |
Ensure settings are consistent | Keep batch_size , lr , etc. the same |
Let’s go through these parameters in detail with examples and intuition:
1️⃣ max_seq_length: Maximum Token Length
- Defines the maximum number of tokens the model can process per input.
- If
None
, it defaults to the model’s predefined limit (e.g., 4096 tokens for Llama-2-7B). - If a sequence is too long, it gets truncated; if too short, it gets padded.
Example:
1
max_seq_length = 512
- If an input sentence has 700 tokens, it gets truncated to 512.
- If an input sentence has 300 tokens, it may be padded to 512 (depending on the tokenizer).
Why is this important?
- Higher
max_seq_length
→ More context but higher memory usage. - Lower
max_seq_length
→ Faster training but less context available.
💡 Practical Tip: If using group_by_length=True
, try reducing max_seq_length
to fit more examples per batch.
2️⃣ packing: Packs Multiple Short Examples Together
- If
False
(default), each example is processed separately. - If
True
, multiple short examples are concatenated into a single long sequence to increase efficiency.
Example:
1
packing = True
If max_seq_length = 512
, and we have three short examples:
- Example 1 → 150 tokens
- Example 2 → 200 tokens
- Example 3 → 162 tokens
Without packing
(default):
- Each example is processed separately, wasting space.
With packing=True
:
- These examples combine into a single 512-token batch.
- Reduces padding and wastes less memory.
When to use packing?
- When working with many short text examples (e.g., question-answer pairs).
- When you want to improve training efficiency.
💡 Practical Tip: Only use packing=True
if your dataset has many short sequences.
3️⃣ device_map: Loads Model on Specific GPUs
- Controls which GPU(s) the model is loaded on.
- Helpful when using multiple GPUs or limited memory.
Example (Single GPU on GPU 0):
1
device_map = {"": 0} # Load model on GPU 0
Example (Automatically Distribute Across GPUs):
1
device_map = "auto" # Let HF automatically distribute across GPUs
- If you have multiple GPUs, it spreads layers across available GPUs.
Example (Manually Assign GPUs):
1
device_map = {"transformer.wte": 0, "transformer.h.0": 1, "transformer.h.1": 2}
- Layer 1 → GPU 0
- Layer 2 → GPU 1
- Layer 3 → GPU 2
- Useful when managing memory manually.
💡 Practical Tip:
- If using a single GPU,
{"": 0}
is fine. - If using multiple GPUs,
device_map="auto"
simplifies training.
🚀 Summary
Parameter | Meaning | Example | When to Use? |
---|---|---|---|
max_seq_length | Limits tokens per input | max_seq_length = 512 | Large values retain more context but use more memory |
packing | Packs short examples together | packing = True | If dataset has many short examples, improves efficiency |
device_map | Controls GPU usage | device_map="auto" | Distributes model across GPUs automatically |