Fine Tuning Llama 2 with QLoRA

Posted Mar 5, 2025 Updated Mar 5, 2025

By Mohammed Vasim

36 min read

Want to learn how to fine-tune Llama 2 efficiently, then you should be familiar with the following parameters.

Model and Dataset Parameters

  
model_name = "NousResearch/Llama-2-7b-chat-hf"
dataset_name = "mlabonne/guanaco-llama2-1k"
new_model = "llama-2-7b-miniguanaco"

model_name: Specifies the base model from the Hugging Face hub (LLaMA-2-7B).
dataset_name: Instruction-tuning dataset used for fine-tuning.
new_model: Name of the fine-tuned model.

QLoRA Parameters

QLoRA (Quantized LoRA) is a memory-efficient fine-tuning method.

  
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1

lora_r: LoRA rank; controls the number of trainable parameters. Higher r increases expressiveness but uses more memory.
lora_alpha: Scaling factor that adjusts the LoRA weights.
lora_dropout: Dropout probability in LoRA layers to prevent overfitting.

bitsandbytes Parameters (4-bit Quantization)

Reduces memory usage by quantizing weights to 4-bit precision.

  
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

use_4bit: Enables loading the model in 4-bit precision.
bnb_4bit_compute_dtype: Data type for computation (float16 for efficiency).
bnb_4bit_quant_type:
- "fp4": Standard 4-bit quantization.
- "nf4": Normalized Float 4 (better for LLMs, retains more information).
use_nested_quant: Applies additional quantization (usually False).

Training Arguments

Defines how the training process runs.

  
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 0
logging_steps = 25

Key Training Parameters

output_dir: Where trained models and logs are stored.
num_train_epochs: Total number of passes over the dataset.
fp16 / bf16: Mixed precision training (use bf16=True for A100 GPUs).
per_device_train_batch_size: Batch size per GPU.
per_device_eval_batch_size: Batch size for evaluation.

Optimization and Efficiency

gradient_accumulation_steps: Updates weights after accumulating gradients over n steps (useful for small batch sizes).
gradient_checkpointing: Saves GPU memory by recomputing activations.
max_grad_norm: Clips gradients to prevent large updates.
learning_rate: Controls how fast the model learns (2e-4 is standard).
weight_decay: Prevents overfitting by penalizing large weights.
optim: Uses Paged AdamW optimizer (memory-efficient).
lr_scheduler_type: "cosine" decays learning rate smoothly.

Training Steps and Logging

max_steps: If -1, training is determined by num_train_epochs.
warmup_ratio: Linearly increases learning rate at the start.
group_by_length: Groups inputs of similar length to improve training speed.
save_steps: Controls checkpoint saving (set to 0 to disable).
logging_steps: Logs progress every 25 steps.

Supervised Fine-Tuning (SFT) Parameters

  
max_seq_length = None
packing = False
device_map = {"": 0}

max_seq_length: Maximum token length (defaults to model’s setting).
packing: Packs multiple short examples together (set to False here).
device_map: Loads the model on GPU 0.

Summary

Your setup: ✅ Uses QLoRA for efficient fine-tuning.
✅ Loads 4-bit quantized LLaMA-2-7B model with NF4 quantization.
✅ Uses gradient checkpointing to save memory.
✅ Uses Paged AdamW optimizer with a cosine learning rate schedule.

Let’s break these down with hands-on, simple calculations for better intuition.

1. LoRA Rank (`lora_r`)

LoRA (Low-Rank Adaptation) reduces the number of trainable parameters by decomposing weight matrices into low-rank matrices. Instead of updating the full weight matrix $W$ , we introduce two small matrices $A$ and $B$ such that:

\[W' = W + A B\]

where:

$A$ has shape (original_dim, r)
$B$ has shape (r, original_dim)
$r$ is the LoRA rank (lora_r).

Example: LoRA Rank Calculation

Assume the original weight matrix $W$ is 1000 × 1000.

Without LoRA:
- The total trainable parameters = $1000 \times 1000 = 1,000,000$ .
With LoRA (lora_r = 64):
- Matrix $A$ (1000 × 64) → 64,000 parameters
- Matrix $B$ (64 × 1000) → 64,000 parameters
- Total parameters = 64,000 + 64,000 = 128,000 (much smaller!)

Thus, instead of updating 1,000,000 parameters, we update only 128,000, reducing memory usage by 87.2%.

👉 Higher lora_r increases the number of trainable parameters, making the model more expressive but requiring more memory.

2. LoRA Scaling Factor (`lora_alpha`)

LoRA scales the newly learned adaptation matrices using lora_alpha to control their impact on the model.

\[W' = W + \frac{\alpha}{r} A B\]

lora_alpha controls how much influence the new LoRA weights have.
Higher lora_alpha → more weight on LoRA updates → model adapts more aggressively.

Example: LoRA Alpha Calculation

Let’s say:

$W = 1.0$ (original weight value)
$A B = 0.5$ (LoRA update)
lora_r = 64
lora_alpha = 16

LoRA scaling factor:

\[\frac{\alpha}{r} = \frac{16}{64} = 0.25\]

Final weight:

\[W' = 1.0 + (0.25 \times 0.5) = 1.0 + 0.125 = 1.125\]

👉 If lora_alpha was 32 instead of 16:

\[\frac{32}{64} = 0.5\] \[W' = 1.0 + (0.5 \times 0.5) = 1.0 + 0.25 = 1.25\]

🔹 Higher lora_alpha makes the model adapt faster but might cause instability.
🔹 Lower lora_alpha makes adaptation slower but more stable.

3. LoRA Dropout (`lora_dropout`)

LoRA applies dropout to its low-rank updates to prevent overfitting.

If lora_dropout = 0.1, then 10% of values in A B are randomly set to zero during training.
This prevents LoRA updates from dominating and helps generalization.

Example: LoRA Dropout Calculation

Assume LoRA produces:
\[A B = [0.4, 0.2, 0.3, 0.6, 0.8]\]
With lora_dropout = 0.4, we randomly drop 40% (2 values):
\[A B = [0.4, 0, 0.3, 0.6, 0]\]

👉 Higher dropout = less risk of overfitting but slower adaptation.

🚀 Summary

Parameter	Meaning	Effect
`lora_r = 64`	LoRA rank (size of low-rank matrices)	Higher `r` → More trainable params → More memory usage
`lora_alpha = 16`	Scaling factor for LoRA updates	Higher `alpha` → LoRA updates affect model more
`lora_dropout = 0.1`	Dropout rate for LoRA layers	Higher dropout → Prevents overfitting but slows adaptation

Great question! The 1000×1000 example is a weight matrix in a neural network, representing model parameters. Let’s break it down further.

Where Does 1000×1000 Come From?

In a transformer-based model like LLaMA, weight matrices appear in layers like attention heads, MLPs, and embedding layers.

Each weight matrix $W$ maps input features (dimensionality) to output features (next layer). The size of $W$ depends on:

Input dimension (e.g., hidden size in transformers)
Output dimension (e.g., next layer’s size)

Example: Transformer Layer Weight Matrix

A transformer layer has query, key, and value projections for attention:

\[Q = X W_Q, \quad K = X W_K, \quad V = X W_V\]

If:

Hidden size = 1000 (feature size of each token representation)
Weight matrix = 1000×1000 (mapping from 1000 input features to 1000 output features)

Then, each of $W_Q, W_K, W_V$ is 1000×1000.

Similarly, the feedforward layer (MLP block) has:

Weight matrices of shape 1000 × 4000 (expansion)
4000 × 1000 (projection back)

Each transformer layer contains multiple such weight matrices.

How Does LoRA Reduce Parameters?

Instead of updating all weights in these matrices, LoRA replaces a full update with **two smaller matrices $A$ and $B$ **:

\[W' = W + A B\]

where:

$A$ is 1000×r (e.g., r=64, so 1000×64)
$B$ is r×1000 (e.g., r=64, so 64×1000)

Thus:

Original parameters per matrix: 1000×1000 = 1,000,000
LoRA trainable parameters: (1000×64) + (64×1000) = 128,000
Parameter reduction: 87.2% fewer trainable parameters!

Why Use LoRA?

Fewer trainable parameters → Saves memory.
Efficient fine-tuning → You don’t need to modify the entire model.
Works with large models → Ideal for LLaMA-2, GPT, etc.

You’re on the right track, but there’s a slight misunderstanding. Let’s clarify how LoRA modifies weight matrices in large models like LLaMA-2.

LoRA Breakdown with Example

In a neural network, weights are usually stored as matrices. Instead of modifying the full weight matrix $W$ , LoRA adds a low-rank decomposition:

\[W' = W + A B\]

where:

** $A$ (d × r)**: Small matrix (reduces dimensionality)
** $B$ (r × d)**: Small matrix (projects back)
** $r$ (LoRA rank)**: Controls how much of $W$ can be learned.

1B Parameters Model Example

A model with 1 billion (1B) parameters means:

The model has multiple layers, each with several weight matrices.
Not all 1B parameters belong to a single weight matrix!

Instead, let’s assume a **single layer weight matrix $W$ ** has:

Input size = 4096
Output size = 4096 (same hidden size as LLaMA-2)

Thus, the full weight matrix $W$ is:

\[W \in \mathbb{R}^{4096 \times 4096}\]

which contains:

\[4096 \times 4096 = 16,777,216 \text{ (16.8M parameters)}\]

Applying LoRA with $r = 5$

LoRA replaces the update with two low-rank matrices:

$A$ is ** $4096 \times 5$ **
$B$ is ** $5 \times 4096$ **

Total trainable parameters in LoRA:

\[(4096 \times 5) + (5 \times 4096) = 40,960\]

This is 99.75% fewer trainable parameters compared to 16.8M!

Your Example: $A = 5 \times 1B$ and $B = 1B \times 5$

That setup would mean:

** $A$ ** has 5 billion parameters
** $B$ ** has 5 billion parameters
Together: 10 billion trainable parameters, which defeats LoRA’s efficiency.

Instead, LoRA is applied to individual weight matrices inside the model, not the entire model at once.

Summary

Each model has multiple weight matrices (not a single giant matrix).
LoRA is applied per matrix, reducing trainable parameters.
** $r$ (LoRA rank) is much smaller than full model size** to save memory.

How LoRA is Applied to Attention Layers in LLaMA-2

LoRA is mainly applied to the query, key, and value projection matrices inside the self-attention mechanism of transformer models like LLaMA-2.

1️⃣ Self-Attention Overview

In LLaMA-2 (and other transformers), the attention mechanism transforms an input sequence into three matrices:

\[Q = X W_Q, \quad K = X W_K, \quad V = X W_V\]

where:

$X$ = input embeddings (batch size × sequence length × hidden dimension)
$W_Q, W_K, W_V$ = weight matrices for query, key, and value projections.
Each weight matrix size is typically ** $d \times d$ ** (e.g., 4096 × 4096 for LLaMA-2-7B).

Thus, the total parameters in a single attention head (without LoRA) is:

\[3 \times (4096 \times 4096) = 49.1M \text{ parameters}\]

For multi-head attention (32 heads in LLaMA-2-7B), this scales up significantly!

2️⃣ How LoRA Modifies These Layers

Instead of fine-tuning all 49M parameters in attention layers, LoRA only fine-tunes a low-rank update:

\[W'_Q = W_Q + A_Q B_Q\] \[W'_K = W_K + A_K B_K\] \[W'_V = W_V + A_V B_V\]

Where:

$A$ is ** $d \times r$ ** (e.g., 4096 × 64 for LoRA rank $r=64$ )
$B$ is ** $r \times d$ ** (e.g., 64 × 4096)
Total trainable parameters per weight matrix:
\[(4096 \times 64) + (64 \times 4096) = 524,288 \text{ parameters}\]
Compared to full tuning (16.8M per matrix), this is a huge reduction!
For all three matrices (Q, K, V): 1.57M trainable parameters instead of 49M.

3️⃣ Why LoRA Works for Attention

Freezes the pre-trained model – Only the small $A$ and $B$ matrices get trained.
Saves memory & compute – Much smaller updates compared to full fine-tuning.
Scales well to large models – Fine-tunes large models efficiently without high GPU costs.

4️⃣ Visualization of LoRA Applied to Self-Attention

Without LoRA: Full Fine-Tuning

\[W_Q = \begin{bmatrix} w_{11} & w_{12} & ... & w_{1d} \\ w_{21} & w_{22} & ... & w_{2d} \\ ... & ... & ... & ... \\ w_{d1} & w_{d2} & ... & w_{dd} \end{bmatrix}\]

✅ All ** $d \times d$ ** parameters are trainable → High memory, slow training.

With LoRA: Low-Rank Updates

\[A_Q = \begin{bmatrix} a_{11} & a_{12} & ... & a_{1r} \\ a_{21} & a_{22} & ... & a_{2r} \\ ... & ... & ... & ... \\ a_{d1} & a_{d2} & ... & a_{dr} \end{bmatrix}\] \[B_Q = \begin{bmatrix} b_{11} & b_{12} & ... & b_{1d} \\ b_{21} & b_{22} & ... & b_{2d} \\ ... & ... & ... & ... \\ b_{r1} & b_{r2} & ... & b_{rd} \end{bmatrix}\]

✅ Only ** $d \times r + r \times d$ ** trainable parameters → Efficient fine-tuning!

5️⃣ Summary: Why LoRA is Powerful

Method	Trainable Parameters (per layer)	Total for QKV (per layer)
Full Fine-Tuning	16.8M	49M
LoRA (r=64)	524K	1.57M
Reduction	96.9% fewer parameters	96.9% fewer parameters

🚀 LoRA makes fine-tuning large models practical while maintaining performance!

Next Steps

A hands-on code example applying LoRA to a LLaMA-2 model? 🔥
A deeper dive into LoRA hyperparameters like lora_alpha and lora_dropout? 🎯

4-bit Quantization in LLaMA-2 Fine-Tuning

Quantization is crucial for reducing memory usage when fine-tuning large models like LLaMA-2-7B. bitsandbytes (bnb) enables efficient 4-bit quantization, significantly reducing GPU requirements.

1️⃣ What is `use_4bit`?

🚀 use_4bit = True enables 4-bit quantization, meaning each weight in the model is stored using only 4 bits instead of 16 or 32 bits.

Memory Reduction Example

Suppose your model has 7 billion parameters (LLaMA-2-7B).
Normally, in fp16 (16-bit precision):
\[7B \times 16\text{-bit} = 14GB\]
In 4-bit quantization:
\[7B \times 4\text{-bit} = 3.5GB\]

🔹 ✅ 4× memory reduction → Fits on smaller GPUs (e.g., a single A100).

2️⃣ What is `bnb_4bit_compute_dtype`?

📌 bnb_4bit_compute_dtype specifies the data type for actual calculations (even if weights are stored in 4-bit).

Common options:
- "float16" → Fast and memory-efficient (default, best for most GPUs)
- "bfloat16" → More stable on newer GPUs like A100
- "float32" → More precise but high memory usage

💡 Why is this needed?

Weights are stored in 4-bit precision, but computations need a higher precision.
Example:
- Stored 4-bit weights are converted to float16 or bfloat16 before being used in matrix multiplications.

3️⃣ What is `bnb_4bit_quant_type`?

🚀 4-bit quantization means fewer bits, but how they are stored affects model quality.
There are two main formats:

(1) FP4 (Float 4)

Standard 4-bit quantization using a small floating-point representation.
Has lower precision and can lead to some performance loss.

(2) NF4 (Normalized Float 4)

Better for LLMs!
NF4 preserves more information by distributing values better across 4-bit representation.
Improves model performance and reduces loss of expressiveness compared to FP4.

📌 Example Intuition:
Imagine representing the numbers [0.1, 0.5, 1.0, 2.0, 4.0] with just 4 bits.

FP4: Might store them as [0.1, 0.5, 1.0, 2.0, 4.0] (less optimized).
NF4: Might normalize and distribute them better, capturing more useful information.

✅ bnb_4bit_quant_type = "nf4" is recommended for LLaMA-2 fine-tuning.

4️⃣ What is `use_nested_quant`?

📌 use_nested_quant = True applies double quantization (quantizing the quantized weights).

Default is False, because:
- Nested quantization saves even more memory but can reduce precision.
- Usually not needed unless working with very low-memory environments.

🔹 Example:

Standard 4-bit quantization converts weights from 16-bit → 4-bit.
Nested quantization further compresses the 4-bit representation, saving more memory.

🚀 Recommendation: Keep use_nested_quant = False unless GPU memory is extremely limited.

5️⃣ Why 4-bit Quantization Matters

Precision	Storage Size (LLaMA-2-7B)	Benefits
FP32 (default training)	28GB	Best accuracy, highest memory usage
FP16 (half-precision)	14GB	Saves memory, faster computation
4-bit FP4	3.5GB	4× smaller, some performance loss
4-bit NF4	3.5GB	4× smaller, better accuracy than FP4

📌 Using bnb_4bit_compute_dtype="float16" with nf4 provides the best balance between memory and model quality.

🚀 TL;DR - Recommended Settings for LoRA Fine-Tuning

  
use_4bit = True  # Enable 4-bit quantization
bnb_4bit_compute_dtype = "float16"  # Efficient computation
bnb_4bit_quant_type = "nf4"  # Normalized Float 4 (better than FP4)
use_nested_quant = False  # Avoid extra quantization unless needed

✅ This setup allows fine-tuning LLaMA-2-7B on a single A100 (40GB) GPU! 🚀

How NF4 (Normalized Float 4) Represents Values

NF4 (Normalized Float 4) is a specialized 4-bit floating-point quantization format designed to retain more useful information compared to standard FP4 (Float 4). It does this by using a non-uniform distribution of quantized values.

1️⃣ Why Standard 4-bit FP4 Fails?

4-bit floating-point quantization (FP4) evenly distributes numbers in the range [min, max], but most weights in LLMs are centered around small values (near 0).

Example Problem in FP4:

Let’s say the actual weights range from -1 to +1.
Standard FP4 might evenly assign 16 possible values linearly.
But in real LLMs, most values are concentrated around 0.0 (small magnitudes).

2️⃣ How NF4 Fixes This?

Instead of using linear spacing like FP4, NF4 clusters more values around 0, making it more adaptive to LLM weight distributions.

💡 NF4 uses 16 pre-selected floating-point values, chosen based on LLM weight statistics.

NF4 16 Discrete Values

Index	NF4 Value
0000	-1.00
0001	-0.696
0010	-0.525
0011	-0.394
0100	-0.274
0101	-0.160
0110	-0.082
0111	-0.020
1000	+0.020
1001	+0.082
1010	+0.160
1011	+0.274
1100	+0.394
1101	+0.525
1110	+0.696
1111	+1.00

🔹 Why is this better?

The smallest values are closer to zero (capturing fine details).
Weights close to zero get better precision, unlike FP4, which wastes resolution on large values.
This better represents LLM weights, improving accuracy over FP4.

3️⃣ NF4 vs. FP4 Intuition

Imagine compressing these values into 4-bit precision:

📌 Example Weight Distribution (Before Quantization)

[-0.9, -0.5, -0.3, -0.1, 0.0, +0.1, +0.3, +0.5, +0.9]

🔴 FP4 (Linear Quantization)

[-1.0, -0.75, -0.5, -0.25, 0.0, +0.25, +0.5, +0.75, +1.0]

❌ Problem: Most LLM weights are near zero, but FP4 wastes precision on large numbers.

🟢 NF4 (Adaptive Quantization)

[-1.0, -0.696, -0.525, -0.394, -0.274, -0.160, -0.082, -0.020, 
 +0.020, +0.082, +0.160, +0.274, +0.394, +0.525, +0.696, +1.0]

✅ Advantage: NF4 prioritizes small values, making it better suited for LLMs! 🚀

4️⃣ TL;DR - Why NF4 is Better?

Feature	FP4 (Float 4-bit)	NF4 (Normalized Float 4-bit)
Spacing	Linear	More values near 0
Adapted to LLMs?	❌ No	✅ Yes
Preserves small weights?	❌ No	✅ Yes
Accuracy vs. FP16?	Lower	Higher

📌 NF4 is specifically designed for LLM quantization, keeping accuracy high even with aggressive compression.

How NF4 Quantization is Calculated?

NF4 (Normalized Float 4) maps full-precision values (e.g., FP16 or FP32) to one of 16 predefined NF4 values. Unlike standard 4-bit floating point, which distributes values linearly, NF4 assigns more precision to small magnitudes.

1️⃣ Steps in NF4 Quantization Process

Step 1: Define the NF4 Quantization Table

NF4 only supports 16 fixed values, selected based on real LLM weight distributions:

\[\text{NF4 Values} = \{-1.00, -0.696, -0.525, -0.394, -0.274, -0.160, -0.082, -0.020, 0.020, 0.082, 0.160, 0.274, 0.394, 0.525, 0.696, 1.00\}\]

Each number is assigned a 4-bit code (0000 to 1111).

Step 2: Map Each Weight to the Closest NF4 Value

For each weight $w$ in the model:

Find the nearest NF4 value from the table.
Replace $w$ with the corresponding NF4 value.
Store only the 4-bit code instead of the full number.

2️⃣ Example Calculation (Manually)

Given a set of real model weights:

[-0.85, -0.52, -0.1, 0.0, 0.15, 0.7]

Step 1: Find the Closest NF4 Value for Each Weight

Original Weight	Closest NF4 Value	4-bit Encoding
-0.85	-1.00	`0000`
-0.52	-0.525	`0010`
-0.10	-0.082	`0110`
0.00	-0.020	`0111`
0.15	0.160	`1010`
0.70	0.696	`1110`

3️⃣ Python Code for NF4 Quantization

Now, let’s write a Python function to perform NF4 quantization.

  
import numpy as np

# NF4 predefined values
NF4_VALUES = np.array([
    -1.00, -0.696, -0.525, -0.394, -0.274, -0.160, -0.082, -0.020,
    0.020, 0.082, 0.160, 0.274, 0.394, 0.525, 0.696, 1.00
])

def quantize_to_nf4(weights):
    """
    Quantizes an array of weights to NF4 format by mapping each weight
    to the closest NF4 predefined value.
    """
    quantized_weights = []
    quantized_codes = []

    for w in weights:
        # Find the closest NF4 value
        closest_index = np.argmin(np.abs(NF4_VALUES - w))
        closest_value = NF4_VALUES[closest_index]
        quantized_weights.append(closest_value)
        quantized_codes.append(closest_index)  # Store the 4-bit code (0-15)
    
    return np.array(quantized_weights), np.array(quantized_codes)

# Example weights
weights = np.array([-0.85, -0.52, -0.1, 0.0, 0.15, 0.7])

# Apply NF4 quantization
quantized_weights, quantized_codes = quantize_to_nf4(weights)

print("Original Weights:", weights)
print("Quantized Weights:", quantized_weights)
print("4-bit Codes:", quantized_codes)

4️⃣ How is NF4 Dequantized?

Once the model is quantized using NF4, dequantization simply replaces the stored 4-bit code with the original NF4 value.

  
def dequantize_nf4(quantized_codes):
    """
    Dequantizes NF4 codes back to NF4 values.
    """
    return NF4_VALUES[quantized_codes]

# Dequantize to get back NF4 values
dequantized_weights = dequantize_nf4(quantized_codes)
print("Dequantized Weights:", dequantized_weights)

This restores the weights using the NF4 representation.

5️⃣ TL;DR: Key Takeaways

✅ NF4 is non-uniform: More precision around small values (close to zero).
✅ Quantization process: Replace full-precision weights with the nearest NF4 value.
✅ Dequantization: Convert back using stored 4-bit codes.
✅ NF4 reduces memory by 4x while keeping accuracy high for LLMs.

Understanding `per_device_train_batch_size` & `per_device_eval_batch_size`

These parameters define how many samples (or examples) are processed per GPU during training and evaluation.

1️⃣ What is Batch Size?

Batch size determines how many examples from the dataset are processed at once before updating model weights.

Larger batch sizes → Faster training but more GPU memory usage.
Smaller batch sizes → More stable training but slower convergence.

💡 Example:
If you have a dataset with 10,000 examples and set batch size = 4, the model will process 4 examples at a time before updating weights.

2️⃣ How `per_device_train_batch_size` Works

This is the number of samples per GPU per training step.

🔹 Example with 1 GPU:

  
per_device_train_batch_size = 4
num_gpus = 1
effective_batch_size = per_device_train_batch_size * num_gpus
# 4 * 1 = 4

👉 The model processes 4 samples per step.

🔹 Example with 4 GPUs:

  
per_device_train_batch_size = 4
num_gpus = 4
effective_batch_size = per_device_train_batch_size * num_gpus
# 4 * 4 = 16

👉 Each GPU processes 4 samples, so the total batch size is 16.

3️⃣ How `per_device_eval_batch_size` Works

This is the number of samples per GPU per evaluation step.

🔹 Example:
If you set:

  
per_device_eval_batch_size = 8
num_gpus = 2

👉 The evaluation batch size will be:

\[8 (\text{per GPU}) \times 2 (\text{GPUs}) = 16\]

The model processes 16 examples at a time during evaluation.

4️⃣ Key Takeaways

✅ Training batch size (per_device_train_batch_size) defines how many samples are used per GPU during training.
✅ Evaluation batch size (per_device_eval_batch_size) defines how many samples are used per GPU during model validation.
✅ Total effective batch size = per_device_batch_size × num_gpus × gradient_accumulation_steps.

5️⃣ Example Calculation With Gradient Accumulation

Scenario:

2 GPUs
per_device_train_batch_size = 4
gradient_accumulation_steps = 2

🔹 Total batch size per step:

\[4 (\text{per GPU}) \times 2 (\text{GPUs}) = 8\]

🔹 Effective batch size (after accumulation):

\[8 \times 2 (\text{grad accumulation}) = 16\]

👉 The model updates after seeing 16 examples instead of 8.

Optimization and Efficiency in Fine-Tuning LLaMA 2

When fine-tuning a large language model, optimizing memory and training efficiency is critical. Let’s break down these parameters with detailed explanations and examples.

1️⃣ `gradient_accumulation_steps`

👉 What it does:
Normally, the model updates its weights after each batch.
With gradient accumulation, instead of updating after every batch, it accumulates gradients over multiple steps before updating.

🔹 Why is it useful?

Allows training with smaller batch sizes (helpful when GPU memory is limited).
Effectively increases the batch size without requiring more GPU memory.

🔹 Example Calculation:
Imagine:

per_device_train_batch_size = 4
gradient_accumulation_steps = 2
2 GPUs available

Total effective batch size:

\[4 (\text{per GPU}) \times 2 (\text{GPUs}) \times 2 (\text{grad accumulation}) = 16\]

👉 The model sees 16 samples before updating weights, even though the actual batch size is just 4 per GPU.

🔹 Effect:

✅ Allows training larger models with limited memory.
✅ Reduces variance in updates, leading to smoother training.

2️⃣ `gradient_checkpointing`

👉 What it does:
Instead of storing all intermediate activations in memory, it recomputes them on demand to save GPU memory.

🔹 Why is it useful?

Saves memory at the cost of extra computation.
Useful when fine-tuning very large models (e.g., LLaMA-2-13B, LLaMA-2-65B).

🔹 Example: Without gradient checkpointing:

All intermediate activations are stored → High memory usage.

With gradient checkpointing:

Only a subset of activations is stored → Memory savings but extra computation.

💡 Trade-off

✅ More memory-efficient.
❌ Slightly slower training due to recomputation.

3️⃣ `max_grad_norm` (Gradient Clipping)

👉 What it does:
Clips the gradient if it gets too large, preventing unstable updates.

🔹 Why is it useful?

Prevents exploding gradients (which can cause training to fail).
Stabilizes training, especially for large models.

🔹 Example Calculation:

Suppose a model computes a gradient norm of 5.0, but max_grad_norm = 0.3.
The gradient is scaled down proportionally: $\text{new gradient} = \frac{0.3}{5.0} \times \text{original gradient}$
This prevents large updates that can destabilize training.

4️⃣ `learning_rate`

👉 What it does:
Controls how much the model updates its weights after each step.

🔹 Why is it important?

Too high (e.g., 1e-2) → Model diverges (fails to learn).
Too low (e.g., 1e-6) → Training is too slow.

🔹 Example:
If learning_rate = 2e-4:

The model updates its weights by 0.0002 of the computed gradient each step.

💡 Tip:

✅ Start with 2e-4 (standard for LLaMA 2 fine-tuning).
✅ Use learning rate scheduling for gradual adjustments.

5️⃣ `weight_decay`

👉 What it does:
Adds a penalty to large weights to prevent overfitting.

🔹 Why is it useful?

Encourages smaller weight values, which generalize better.

🔹 Example:

Suppose weight decay = 0.001.
The optimizer shrinks the weights slightly at each step: $\text{updated weight} = \text{original weight} - 0.001 \times \text{original weight}$

💡 Tip:

✅ Use 0.001 for general fine-tuning.
✅ Set lower values (0.0001) if the dataset is small.

6️⃣ `optim` (Optimizer)

👉 What it does:
Defines how gradients are used to update model weights.

🔹 Why is paged_adamw_32bit used?

Memory-efficient AdamW optimizer (optimized for large models).
“Paged” version helps with offloading unused data to CPU to save GPU memory.

🔹 Example: Regular AdamW optimizer stores all parameters in GPU memory.
Paged AdamW moves unused data to CPU → less GPU memory usage.

💡 Tip:

✅ Best optimizer for large LLM fine-tuning.
✅ Works well with 4-bit quantization.

7️⃣ `lr_scheduler_type` (Learning Rate Scheduler)

👉 What it does:
Controls how the learning rate changes during training.

🔹 Why is “cosine” used?

Gradually reduces learning rate over time for better convergence.

🔹 Example:

At step 0, learning rate = 2e-4.
Midway through training → learning rate is reduced.
At the end of training, learning rate → close to 0.

📉 Cosine Decay Formula:

\[\eta_t = \eta_0 \times 0.5 \times \left(1 + \cos\left(\frac{t}{T} \pi\right)\right)\]

where:

$\eta_t$ = learning rate at step $t$
$\eta_0$ = initial learning rate
$T$ = total training steps

🔹 Graph of Cosine Decay:
📈 Starts high → 📉 gradually decreases → 📉📉 flattens near zero.

💡 Why use cosine?

✅ Helps the model converge smoothly.
✅ Avoids sudden drops in learning rate.

🔑 Summary Table

Parameter	Purpose	Example Effect
`gradient_accumulation_steps`	Updates weights after multiple steps instead of every batch	Allows using smaller batch sizes
`gradient_checkpointing`	Saves memory by recomputing activations	Uses less memory but increases computation
`max_grad_norm`	Clips large gradients to prevent unstable updates	Stops exploding gradients
`learning_rate`	Controls how fast the model learns	Higher = faster training, but unstable
`weight_decay`	Penalizes large weights to prevent overfitting	Regularizes model weights
`optim`	Optimizer type (Paged AdamW)	Memory-efficient training
`lr_scheduler_type`	Adjusts learning rate over time	Cosine decay = smooth reduction

🚀 Key Takeaways

Gradient accumulation helps train models with small batch sizes.
Gradient checkpointing saves memory but slows down training slightly.
Gradient clipping (max_grad_norm) prevents unstable training.
Learning rate scheduling (cosine) helps smooth training.
Paged AdamW optimizer is efficient for large-scale fine-tuning.

Gradient Norm and Gradient Clipping (max_grad_norm)

Gradient norms play a critical role in stabilizing the training of deep learning models, especially large-scale models like LLaMA 2. Let’s go deep into what gradient norm is, why it matters, and how gradient clipping (via max_grad_norm) helps.

1️⃣ What is the Gradient Norm?

👉 Definition:
The gradient norm is the magnitude of the gradient vector (L2 norm) across all parameters of the model. It tells us how large the overall update to the model weights will be in a given training step.

Mathematically, if $\theta$ represents all model parameters and ** $\nabla L(\theta)$ ** is the gradient of the loss function $L$ , then the L2 norm (Euclidean norm) of the gradient is:

\[\|\nabla L(\theta)\|_2 = \sqrt{\sum_{i} g_i^2}\]

where $g_i$ represents the gradient of the $i$ -th parameter.

2️⃣ Why is the Gradient Norm Important?

During training, very large gradients can cause instability:

Exploding gradients: If the gradient norm becomes too large, the model diverges (weights change too drastically, and loss may become NaN).
Vanishing gradients: If gradients become too small, the model learns too slowly or stops learning.

💡 Example:

If the gradient norm = 1000, the model makes huge updates, potentially overshooting the optimal solution.
If the gradient norm = 0.0001, updates are too small, making training inefficient.

Exploding Gradient Example

Consider training a deep neural network where the gradients keep growing:

Training Step	Gradient Norm
Step 1	0.5
Step 10	5.0
Step 20	500.0
Step 30	50,000.0 (exploding!)

Here, the model diverges, and training becomes unstable.

3️⃣ What is Gradient Clipping (`max_grad_norm`)?

👉 Gradient clipping limits the gradient norm to a fixed value (max_grad_norm), preventing extreme updates.

🔹 How it works:

If the gradient norm is below max_grad_norm, do nothing.
If it exceeds max_grad_norm, scale all gradients down proportionally.

Mathematically, if the gradient norm exceeds a given threshold $c$ , the gradient is rescaled:

\[\nabla L(\theta) = \nabla L(\theta) \times \frac{c}{\|\nabla L(\theta)\|_2}\]

where $c$ is max_grad_norm.

4️⃣ Example of Gradient Clipping

Let’s assume:

Gradient norm (before clipping): 8.0
max_grad_norm: 2.0 (we want to cap it at 2)

The gradients are scaled down by:

\[\frac{2.0}{8.0} = 0.25\]

So each gradient component is multiplied by 0.25, reducing the total norm to 2.0.

💡 Before vs. After Clipping

Parameter	Original Gradient	After Clipping
$g_1$	4.0	1.0
$g_2$	3.0	0.75
$g_3$	1.0	0.25

👉 Now, the new gradient norm is 2.0, preventing an unstable update.

5️⃣ Implementing Gradient Clipping in PyTorch

In PyTorch, torch.nn.utils.clip_grad_norm_() is used to apply gradient clipping:

  
import torch

# Dummy model
model = torch.nn.Linear(10, 2)

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)

# Sample loss function
loss_fn = torch.nn.MSELoss()

# Simulated training step
for step in range(100):
    optimizer.zero_grad()  # Reset gradients

    # Dummy input and target
    input_data = torch.randn(32, 10)  # Batch size = 32
    target = torch.randn(32, 2)

    # Forward pass
    output = model(input_data)
    loss = loss_fn(output, target)

    # Backward pass
    loss.backward()

    # Apply gradient clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Clips gradients

    # Optimizer step
    optimizer.step()

6️⃣ Effect of Different `max_grad_norm` Values

Choosing the right max_grad_norm is critical:

`max_grad_norm`	Effect
No clipping	Risk of exploding gradients, unstable training
1.0	Conservative updates, stable learning
5.0	Faster learning, but risk of instability
0.1	Very slow learning, might hinder progress

💡 Best Practice:

✅ 1.0 - 2.0 is safe for large models (e.g., LLaMA 2).
✅ If you notice instability, reduce max_grad_norm.
✅ If training is too slow, increase slightly but not above 5.

7️⃣ Gradient Clipping in Hugging Face Transformers

In Hugging Face Trainer, gradient clipping is set using max_grad_norm:

  
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    max_grad_norm=1.0,  # Gradient clipping
    num_train_epochs=3,
)

🔑 Key Takeaways

Gradient norm measures update magnitude—too high leads to instability, too low slows learning.
Gradient clipping (max_grad_norm) caps large gradients, preventing training divergence.
Scaling is proportional—all gradients are adjusted uniformly to maintain direction.
Common values: 1.0 - 2.0 is ideal for LLaMA 2 fine-tuning.
PyTorch & Hugging Face provide built-in methods to clip gradients easily.

Training Steps and Logging – Detailed Explanation 🚀

When fine-tuning large models like LLaMA 2, understanding training steps and logging settings helps optimize training efficiency and monitoring. Let’s break each setting down with examples and intuition.

🔹 `max_steps`: Maximum Training Steps

📌 Definition:

Controls the total number of training steps (batches) before training stops.
If max_steps = -1, training is controlled by num_train_epochs instead.

📌 Formula for Total Training Steps:

\[\text{Total Steps} = \frac{\text{Dataset Size} \times \text{Num Epochs}}{\text{Batch Size} \times \text{Gradient Accumulation Steps}}\]

💡 Example 1: Using num_train_epochs (default behavior)

Dataset size: 10,000 samples
Batch size: 4
Epochs: 3
Gradient Accumulation: 2

\[\frac{10,000 \times 3}{4 \times 2} = 3,750 \text{ steps}\]

💡 Example 2: Setting a fixed max_steps

  
training_args = TrainingArguments(
    max_steps=5000  # Stops after 5000 steps, ignoring num_train_epochs
)

👉 Use Case: If your dataset is too large, you can limit steps instead of epochs.

🔹 `warmup_ratio`: Learning Rate Warmup

📌 Definition:

Gradually increases the learning rate from 0 to the target learning rate over a small fraction of training steps.
Prevents the model from making large updates too early, which can cause instability.

📌 Formula:

\text{Warmup Steps} = \text{warmup_ratio} \times \text{Total Training Steps}

💡 Example:

warmup_ratio = 0.1
max_steps = 10,000

\[0.1 \times 10,000 = 1,000 \text{ warmup steps}\]

For the first 1,000 steps, the learning rate gradually increases, avoiding sudden large updates.

  
training_args = TrainingArguments(
    learning_rate=2e-4,
    warmup_ratio=0.1,  # 10% of steps used for warmup
)

📊 Why is warmup important? Without warmup:

The model starts training with a full learning rate → unstable updates 🚨
With warmup:
The model slowly adjusts to the learning rate → stable training ✅

🔹 `group_by_length`: Efficient Batching

📌 Definition:

Groups input samples of similar length into the same batch.
Reduces the need for padding, making training faster and more memory-efficient.

💡 Example:
Consider training a model on sentences of different lengths:

Sentence	Length (Tokens)
“Hi”	2
“Hello, how are you?”	10
“The quick brown fox jumps over the lazy dog.”	45

Without group_by_length:
- A batch might contain short and long sentences together, leading to a lot of padding (wasting memory).
With group_by_length:
- The model groups similar-length sentences together → less padding → more efficient training.

  
training_args = TrainingArguments(
    group_by_length=True  # Enables length-based batching
)

📊 When to Use? ✅ If your dataset has variable-length sequences (e.g., text, conversations).
❌ Not needed if your input size is fixed.

🔹 `save_steps`: Checkpoint Saving Frequency

📌 Definition:

Saves the model every save_steps training steps to avoid losing progress.
Setting save_steps=0 disables checkpoint saving.

💡 Example:
If save_steps=500, the model saves a checkpoint every 500 steps.

  
training_args = TrainingArguments(
    save_steps=500  # Save model every 500 steps
)

📊 Why is this useful?

Prevents data loss if training crashes.
Allows resuming training from a checkpoint.
But too frequent saving consumes disk space.

👉 Best Practice:

Small models: Save every few hundred steps.
Large models (LLaMA 2, GPT-3, etc.): Save every few thousand steps to avoid disk overflow.

🔹 `logging_steps`: Log Training Progress

📌 Definition:

Controls how often the training logs are printed.
Default: logging_steps=25 → logs every 25 steps.

💡 Example Output:
If logging_steps=25, you’ll see logs like this:

Step 25 | Loss: 1.234 | Learning Rate: 1.5e-4
Step 50 | Loss: 1.201 | Learning Rate: 1.8e-4
Step 75 | Loss: 1.190 | Learning Rate: 2.0e-4

  
training_args = TrainingArguments(
    logging_steps=25  # Log progress every 25 steps
)

📊 Why is this useful?

Helps monitor training progress in real-time.
If loss starts increasing, you can stop training early.
If training crashes, you have logs to debug issues.

🔑 Key Takeaways

Setting	What It Does	Best Practice
`max_steps`	Controls training duration (use `-1` for epochs)	Use fixed steps for large datasets
`warmup_ratio`	Slowly increases learning rate at start	Use 0.05 - 0.1 for stable training
`group_by_length`	Groups similar-length inputs for efficiency	Use for NLP/text tasks
`save_steps`	Saves model checkpoints every N steps	Use 500-2000 for large models
`logging_steps`	Logs progress every N steps	Use 10-50 for quick monitoring

If your training crashes and you want to resume from the last saved checkpoint, follow these steps:

🔹 1. Locate Your Last Checkpoint

By default, Hugging Face saves checkpoints in the output_dir folder (e.g., ./checkpoint-500, ./checkpoint-1000).

Find the latest saved checkpoint:

  
ls -d output_dir/checkpoint-*

Example output:

output_dir/checkpoint-500
output_dir/checkpoint-1000
output_dir/checkpoint-1500  <-- Most recent checkpoint

💡 Note: If training crashes, always resume from the latest checkpoint.

🔹 2. Resume Training from Checkpoint

Modify your training script to load from the latest checkpoint using resume_from_checkpoint:

  
training_args = TrainingArguments(
    output_dir="./output_dir",  # Where checkpoints are saved
    save_steps=500,             # Save every 500 steps
    resume_from_checkpoint=True # Resume from latest checkpoint
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

✅ This automatically detects the latest checkpoint and resumes training.

🔹 3. Manually Resume from a Specific Checkpoint

If you want to resume from a specific checkpoint (e.g., checkpoint-1500), use:

  
trainer.train(resume_from_checkpoint="./output_dir/checkpoint-1500")

💡 This is useful if you suspect the last checkpoint was corrupted.

🔹 4. Ensure Consistency in Hyperparameters

When resuming training, do not change critical settings like:
✅ learning_rate
✅ batch_size
✅ gradient_accumulation_steps

Changing them may lead to unexpected behavior.

🔹 5. Verify Resume is Working

Check the logs when restarting training. You should see something like:

Resuming training from checkpoint ./output_dir/checkpoint-1500

This confirms that training continues from where it stopped instead of restarting.

🔹 6. Use `save_total_limit` to Manage Checkpoints

Checkpoints take up a lot of disk space. To keep only the latest N checkpoints, use:

  
training_args = TrainingArguments(
    save_total_limit=3  # Keeps only the last 3 checkpoints
)

💡 This prevents your disk from filling up while still allowing resumption.

🚀 Summary

Scenario	Solution
Resume from the latest checkpoint	`resume_from_checkpoint=True`
Resume from a specific checkpoint	`trainer.train(resume_from_checkpoint="path/to/checkpoint")`
Prevent disk overflow	`save_total_limit=3`
Ensure settings are consistent	Keep `batch_size`, `lr`, etc. the same

Let’s go through these parameters in detail with examples and intuition:

1️⃣ max_seq_length: Maximum Token Length

Defines the maximum number of tokens the model can process per input.
If None, it defaults to the model’s predefined limit (e.g., 4096 tokens for Llama-2-7B).
If a sequence is too long, it gets truncated; if too short, it gets padded.

Example:

  
max_seq_length = 512

If an input sentence has 700 tokens, it gets truncated to 512.
If an input sentence has 300 tokens, it may be padded to 512 (depending on the tokenizer).

Why is this important?

Higher max_seq_length → More context but higher memory usage.
Lower max_seq_length → Faster training but less context available.

💡 Practical Tip: If using group_by_length=True, try reducing max_seq_length to fit more examples per batch.

2️⃣ packing: Packs Multiple Short Examples Together

If False (default), each example is processed separately.
If True, multiple short examples are concatenated into a single long sequence to increase efficiency.

Example:

  
packing = True

If max_seq_length = 512, and we have three short examples:

Example 1 → 150 tokens
Example 2 → 200 tokens
Example 3 → 162 tokens

Without packing (default):

Each example is processed separately, wasting space.

With packing=True:

These examples combine into a single 512-token batch.
Reduces padding and wastes less memory.

When to use packing?

When working with many short text examples (e.g., question-answer pairs).
When you want to improve training efficiency.

💡 Practical Tip: Only use packing=True if your dataset has many short sequences.

3️⃣ device_map: Loads Model on Specific GPUs

Controls which GPU(s) the model is loaded on.
Helpful when using multiple GPUs or limited memory.

Example (Single GPU on GPU 0):

  
device_map = {"": 0}  # Load model on GPU 0

Example (Automatically Distribute Across GPUs):

  
device_map = "auto"  # Let HF automatically distribute across GPUs

If you have multiple GPUs, it spreads layers across available GPUs.

Example (Manually Assign GPUs):

  
device_map = {"transformer.wte": 0, "transformer.h.0": 1, "transformer.h.1": 2}

Layer 1 → GPU 0
Layer 2 → GPU 1
Layer 3 → GPU 2
Useful when managing memory manually.

💡 Practical Tip:

If using a single GPU, {"": 0} is fine.
If using multiple GPUs, device_map="auto" simplifies training.

🚀 Summary

Parameter	Meaning	Example	When to Use?
`max_seq_length`	Limits tokens per input	`max_seq_length = 512`	Large values retain more context but use more memory
`packing`	Packs short examples together	`packing = True`	If dataset has many short examples, improves efficiency
`device_map`	Controls GPU usage	`device_map="auto"`	Distributes model across GPUs automatically

Generative AI, Fine-Tuning

generative ai

This post is licensed under CC BY 4.0 by the author.

Want to learn how to fine-tune Llama 2 efficiently, then you should be familiar with the following parameters.

Model and Dataset Parameters

QLoRA Parameters

bitsandbytes Parameters (4-bit Quantization)

Training Arguments

Key Training Parameters

Optimization and Efficiency

Training Steps and Logging

Supervised Fine-Tuning (SFT) Parameters

Summary

1. LoRA Rank (lora_r)

Example: LoRA Rank Calculation

2. LoRA Scaling Factor (lora_alpha)

Example: LoRA Alpha Calculation

3. LoRA Dropout (lora_dropout)

Example: LoRA Dropout Calculation

🚀 Summary

Where Does 1000×1000 Come From?

Example: Transformer Layer Weight Matrix

How Does LoRA Reduce Parameters?

Why Use LoRA?

LoRA Breakdown with Example

1B Parameters Model Example

**Applying LoRA with $r = 5$ **

**Your Example: $A = 5 \times 1B$ and $B = 1B \times 5$ **

Summary

How LoRA is Applied to Attention Layers in LLaMA-2

1️⃣ Self-Attention Overview

2️⃣ How LoRA Modifies These Layers

3️⃣ Why LoRA Works for Attention

4️⃣ Visualization of LoRA Applied to Self-Attention

Without LoRA: Full Fine-Tuning

With LoRA: Low-Rank Updates

5️⃣ Summary: Why LoRA is Powerful

Next Steps

4-bit Quantization in LLaMA-2 Fine-Tuning

1️⃣ What is use_4bit?

Memory Reduction Example

2️⃣ What is bnb_4bit_compute_dtype?

3️⃣ What is bnb_4bit_quant_type?

(1) FP4 (Float 4)

(2) NF4 (Normalized Float 4)

4️⃣ What is use_nested_quant?

5️⃣ Why 4-bit Quantization Matters

🚀 TL;DR - Recommended Settings for LoRA Fine-Tuning

How NF4 (Normalized Float 4) Represents Values

1️⃣ Why Standard 4-bit FP4 Fails?

2️⃣ How NF4 Fixes This?

NF4 16 Discrete Values

3️⃣ NF4 vs. FP4 Intuition

4️⃣ TL;DR - Why NF4 is Better?

How NF4 Quantization is Calculated?

1️⃣ Steps in NF4 Quantization Process

Step 1: Define the NF4 Quantization Table

Step 2: Map Each Weight to the Closest NF4 Value

2️⃣ Example Calculation (Manually)

Given a set of real model weights:

Step 1: Find the Closest NF4 Value for Each Weight

3️⃣ Python Code for NF4 Quantization

4️⃣ How is NF4 Dequantized?

5️⃣ TL;DR: Key Takeaways

Understanding per_device_train_batch_size & per_device_eval_batch_size

1️⃣ What is Batch Size?

2️⃣ How per_device_train_batch_size Works

3️⃣ How per_device_eval_batch_size Works

4️⃣ Key Takeaways

5️⃣ Example Calculation With Gradient Accumulation

Scenario:

Optimization and Efficiency in Fine-Tuning LLaMA 2

1️⃣ gradient_accumulation_steps

2️⃣ gradient_checkpointing

3️⃣ max_grad_norm (Gradient Clipping)

4️⃣ learning_rate

5️⃣ weight_decay

6️⃣ optim (Optimizer)

7️⃣ lr_scheduler_type (Learning Rate Scheduler)

🔑 Summary Table

🚀 Key Takeaways

Gradient Norm and Gradient Clipping (max_grad_norm)

1️⃣ What is the Gradient Norm?

1. LoRA Rank (`lora_r`)

2. LoRA Scaling Factor (`lora_alpha`)

3. LoRA Dropout (`lora_dropout`)

Applying LoRA with $r = 5$

Your Example: $A = 5 \times 1B$ and $B = 1B \times 5$

1️⃣ What is `use_4bit`?

2️⃣ What is `bnb_4bit_compute_dtype`?

3️⃣ What is `bnb_4bit_quant_type`?

4️⃣ What is `use_nested_quant`?

Understanding `per_device_train_batch_size` & `per_device_eval_batch_size`

2️⃣ How `per_device_train_batch_size` Works

3️⃣ How `per_device_eval_batch_size` Works

1️⃣ `gradient_accumulation_steps`

2️⃣ `gradient_checkpointing`

3️⃣ `max_grad_norm` (Gradient Clipping)

4️⃣ `learning_rate`

5️⃣ `weight_decay`

6️⃣ `optim` (Optimizer)

7️⃣ `lr_scheduler_type` (Learning Rate Scheduler)

3️⃣ What is Gradient Clipping (`max_grad_norm`)?

6️⃣ Effect of Different `max_grad_norm` Values

🔹 `max_steps`: Maximum Training Steps

🔹 `warmup_ratio`: Learning Rate Warmup

🔹 `group_by_length`: Efficient Batching

🔹 `save_steps`: Checkpoint Saving Frequency

🔹 `logging_steps`: Log Training Progress

🔹 6. Use `save_total_limit` to Manage Checkpoints