Technical Blog
Fine-Tuning Qwen2.5 Locally: My Experience on an RTX 4060
A hands-on guide to fine-tuning Qwen2.5 on consumer hardware — LoRA, 4-bit quantization, and practical lessons from training on an NVIDIA 4060.
Fine-tuning large language models used to mean spinning up cloud GPUs and burning through credits. But with the right tools, you can now train models like Qwen2.5 entirely on consumer hardware — including a modest NVIDIA RTX 4060 with 8GB VRAM.
I recently fine-tuned Qwen2.5 for a custom use case on my local machine. Here's what I learned and how you can do the same.
Why Fine-Tune Locally?
Cloud fine-tuning has its place: massive models, huge datasets, fast iteration. But local training offers:
- Privacy: Your data never leaves your machine
- Cost control: No per-hour GPU bills — you pay once for the hardware
- Iteration speed: No queueing, no cold starts, instant restarts
- Learning: You understand the full stack end-to-end
The RTX 4060 (8GB VRAM) is a realistic target for 7B-parameter models with modern efficiency tricks: 4-bit quantization, LoRA, and frameworks like Unsloth that squeeze every bit of performance out of the hardware.
The Stack: Unsloth + LoRA + 4-bit Quantization
Three pieces make local fine-tuning on 8GB feasible:
- 4-bit quantization — Load the base model in 4-bit instead of fp16, cutting memory use by ~75%
- LoRA (Low-Rank Adaptation) — Train small adapter matrices instead of full weights; only a tiny fraction of parameters are updated
- Unsloth — Optimized kernels and training loops; roughly 2× faster and 60% less memory than vanilla Hugging Face + Flash Attention 2
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
)On an 8GB GPU, the 7B model in 4-bit with LoRA fits comfortably — typically around 5–6GB during training.
Installation and Environment
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers trl peft accelerate bitsandbytesI used Python 3.10 and PyTorch 2.x with CUDA 12.1. Make sure your NVIDIA drivers and CUDA toolkit match; mismatches cause cryptic crashes.
# Quick sanity check
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"Preparing the Training Data
Qwen2.5 expects chat-style data. For instruction fine-tuning, I used a simple format:
from datasets import Dataset
def format_instruction(example):
return {
"text": f"""<|im_start|>user
{example["instruction"]}<|im_end|>
<|im_start|>assistant
{example["output"]}<|im_end|>"""
}
dataset = Dataset.from_dict({
"instruction": ["Explain recursion in one sentence.", "Write a haiku about coding."],
"output": ["Recursion is when a function calls itself until a base case stops it.", "Logic flows like streams / Bugs hide in the shadows / Debug until dawn"]
})
dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)Important: Qwen2.5 base models don't have <|im_start|> and <|im_end|> pre-trained. For base-model fine-tuning, you may want to use a simpler format or switch to a chat-tuned variant that already understands these tokens.
I used around 500–1000 instruction–output pairs for my first run. Quality matters more than quantity — noisy data hurts more than helps.
Training Configuration for 8GB VRAM
The key is staying within memory. Here's a config that worked for me:
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=2,
learning_rate=2e-5,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
output_dir="./qwen2.5-finetuned",
save_strategy="epoch",
),
)
trainer.train()- Batch size 2 with gradient accumulation 4 → effective batch size 8
- 8-bit Adam to save more VRAM
- bf16 if your GPU supports it (RTX 40 series does)
If you hit OOM (out of memory), reduce per_device_train_batch_size to 1 or max_seq_length to 1024.
What I Ran Into: Practical Lessons
1. VRAM spikes during the first step
The first training step often allocates extra memory for optimizers and gradients. If you're right at the edge, reduce batch size or sequence length slightly so the first step doesn't OOM.
2. Sequence length is a big lever
Going from 2048 to 4096 roughly doubles activation memory. For many instruction tasks, 1024–2048 is enough. Start small and only increase if you need longer context.
3. LoRA rank r
I used r=16 as a default. Higher (e.g. 32) gives more capacity but uses more memory and can overfit on small datasets. For fewer than 1k examples, r=8 or r=16 is usually sufficient.
4. Checkpoint often
Training can run for an hour or more. Use save_strategy="epoch" or save_steps=100 so you don't lose progress if something crashes.
5. Evaluation during training
Adding a small eval set and evaluation_strategy="steps" helps catch overfitting early. I added this after my first run showed the model memorizing rather than generalizing.
Saving and Loading the Fine-Tuned Model
model.save_pretrained("qwen2.5-lora-adapters")
tokenizer.save_pretrained("qwen2.5-lora-adapters")To load for inference:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
model.load_adapter("qwen2.5-lora-adapters")
FastLanguageModel.for_inference(model)You can also merge the LoRA weights into the base model and export to GGUF or other formats for use with llama.cpp, Ollama, or similar tools.
Rough Timings on RTX 4060
For ~500 examples, 2048 tokens max, 2 epochs:
- Training: ~45–60 minutes
- VRAM peak: ~6.5 GB
- Checkpoint size: ~50 MB (LoRA adapters only)
Not bad for a $300 GPU.
Conclusion
Fine-tuning Qwen2.5 locally on an RTX 4060 is entirely feasible. With Unsloth, 4-bit quantization, and LoRA, you can adapt a 7B model to your domain without cloud GPUs or large budgets.
The main constraints are VRAM and patience — you won't train 70B models on 8GB, and iteration is slower than on A100s. But for many practical use cases (custom assistants, domain-specific Q&A, structured output), a locally fine-tuned 7B model is more than enough.
If you have an RTX 4060 or similar and want to own your model and your data, give it a try.
Resources: Unsloth GitHub, Qwen2.5 on Hugging Face