Fine-Tuning Qwen2.5 Locally: My Experience on an RTX 4060

Fine-tuning large language models used to mean spinning up cloud GPUs and burning through credits. But with the right tools, you can now train models like Qwen2.5 entirely on consumer hardware — including a modest NVIDIA RTX 4060 with 8GB VRAM.

I recently fine-tuned Qwen2.5 for a custom use case on my local machine. Here's what I learned and how you can do the same.

Why Fine-Tune Locally?

Cloud fine-tuning has its place: massive models, huge datasets, fast iteration. But local training offers:

Privacy: Your data never leaves your machine
Cost control: No per-hour GPU bills — you pay once for the hardware
Iteration speed: No queueing, no cold starts, instant restarts
Learning: You understand the full stack end-to-end

The RTX 4060 (8GB VRAM) is a realistic target for 7B-parameter models with modern efficiency tricks: 4-bit quantization, LoRA, and frameworks like Unsloth that squeeze every bit of performance out of the hardware.

The Stack: Unsloth + LoRA + 4-bit Quantization

Three pieces make local fine-tuning on 8GB feasible:

4-bit quantization — Load the base model in 4-bit instead of fp16, cutting memory use by ~75%
LoRA (Low-Rank Adaptation) — Train small adapter matrices instead of full weights; only a tiny fraction of parameters are updated
Unsloth — Optimized kernels and training loops; roughly 2× faster and 60% less memory than vanilla Hugging Face + Flash Attention 2

from unsloth import FastLanguageModel
 
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
 
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

On an 8GB GPU, the 7B model in 4-bit with LoRA fits comfortably — typically around 5–6GB during training.

Installation and Environment

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers trl peft accelerate bitsandbytes

I used Python 3.10 and PyTorch 2.x with CUDA 12.1. Make sure your NVIDIA drivers and CUDA toolkit match; mismatches cause cryptic crashes.

# Quick sanity check
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Preparing the Training Data

Qwen2.5 expects chat-style data. For instruction fine-tuning, I used a simple format:

from datasets import Dataset
 
def format_instruction(example):
    return {
        "text": f"""<|im_start|>user
{example["instruction"]}<|im_end|>
<|im_start|>assistant
{example["output"]}<|im_end|>"""
    }
 
dataset = Dataset.from_dict({
    "instruction": ["Explain recursion in one sentence.", "Write a haiku about coding."],
    "output": ["Recursion is when a function calls itself until a base case stops it.", "Logic flows like streams / Bugs hide in the shadows / Debug until dawn"]
})
 
dataset = dataset.map(format_instruction, remove_columns=dataset.column_names)

Important: Qwen2.5 base models don't have <|im_start|> and <|im_end|> pre-trained. For base-model fine-tuning, you may want to use a simpler format or switch to a chat-tuned variant that already understands these tokens.

I used around 500–1000 instruction–output pairs for my first run. Quality matters more than quantity — noisy data hurts more than helps.

Training Configuration for 8GB VRAM

The key is staying within memory. Here's a config that worked for me:

import torch
from trl import SFTTrainer
from transformers import TrainingArguments
 
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=2,
        learning_rate=2e-5,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        output_dir="./qwen2.5-finetuned",
        save_strategy="epoch",
    ),
)
 
trainer.train()

Batch size 2 with gradient accumulation 4 → effective batch size 8
8-bit Adam to save more VRAM
bf16 if your GPU supports it (RTX 40 series does)

If you hit OOM (out of memory), reduce per_device_train_batch_size to 1 or max_seq_length to 1024.

What I Ran Into: Practical Lessons

1. VRAM spikes during the first step

The first training step often allocates extra memory for optimizers and gradients. If you're right at the edge, reduce batch size or sequence length slightly so the first step doesn't OOM.

2. Sequence length is a big lever

Going from 2048 to 4096 roughly doubles activation memory. For many instruction tasks, 1024–2048 is enough. Start small and only increase if you need longer context.

3. LoRA rank `r`

I used r=16 as a default. Higher (e.g. 32) gives more capacity but uses more memory and can overfit on small datasets. For fewer than 1k examples, r=8 or r=16 is usually sufficient.

4. Checkpoint often

Training can run for an hour or more. Use save_strategy="epoch" or save_steps=100 so you don't lose progress if something crashes.

5. Evaluation during training

Adding a small eval set and evaluation_strategy="steps" helps catch overfitting early. I added this after my first run showed the model memorizing rather than generalizing.

Saving and Loading the Fine-Tuned Model

model.save_pretrained("qwen2.5-lora-adapters")
tokenizer.save_pretrained("qwen2.5-lora-adapters")

To load for inference:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
 
model.load_adapter("qwen2.5-lora-adapters")
FastLanguageModel.for_inference(model)

You can also merge the LoRA weights into the base model and export to GGUF or other formats for use with llama.cpp, Ollama, or similar tools.

Rough Timings on RTX 4060

For ~500 examples, 2048 tokens max, 2 epochs:

Training: ~45–60 minutes
VRAM peak: ~6.5 GB
Checkpoint size: ~50 MB (LoRA adapters only)

Not bad for a $300 GPU.

Conclusion

Fine-tuning Qwen2.5 locally on an RTX 4060 is entirely feasible. With Unsloth, 4-bit quantization, and LoRA, you can adapt a 7B model to your domain without cloud GPUs or large budgets.

The main constraints are VRAM and patience — you won't train 70B models on 8GB, and iteration is slower than on A100s. But for many practical use cases (custom assistants, domain-specific Q&A, structured output), a locally fine-tuned 7B model is more than enough.

If you have an RTX 4060 or similar and want to own your model and your data, give it a try.

Resources: Unsloth GitHub, Qwen2.5 on Hugging Face