LLM Fine-Tuning Guide in 2026: LoRA, QLoRA, DPO, GRPO, RLHF, and How to Evaluate Fine-Tuned Models
Fine-tune LLMs in 2026 with LoRA, QLoRA, GRPO, RLHF, DPO, IPO. Compare trl, unsloth, axolotl, DeepSpeed and learn how to evaluate fine-tuned models.
Table of Contents
TL;DR: How to Pick a Fine-Tuning Method in 2026
| Goal | Default method | Framework | When to switch |
|---|---|---|---|
| Style and format adaptation | LoRA or QLoRA SFT | unsloth or trl | Tiny dataset → prompting + RAG instead |
| Instruction following | SFT then DPO | trl or axolotl | Need deep alignment → RLHF |
| Preference alignment | DPO or IPO | trl | Need on-policy RL → GRPO or PPO |
| Reasoning (math, code) | GRPO | trl or open-r1 | Reward is not verifiable → DPO |
| Domain adaptation | Continued pretraining then SFT | axolotl + DeepSpeed | Single niche → just SFT |
| Eval the result | 50+ templates + held-out | Future AGI | Always |
Verdict for 2026: start with QLoRA SFT in unsloth on a single H100, layer DPO if you have preference pairs, switch to GRPO when your reward function is verifiable, and evaluate every candidate in Future AGI before promoting it to production.
Why General-Purpose LLMs Fall Short and Where Fine-Tuning Still Earns Its Compute
Frontier base models like GPT-5, Claude Opus 4.7, and Llama 4.x are powerful but they often lack the specific tone, format, or domain vocabulary a production product needs. Three places where fine-tuning still pays off in 2026:
- Style and format pinning. A prompt can get you 80% of the way to a brand voice; fine-tuning closes the last 20% without burning prompt tokens at every call.
- Closed tasks with verifiable rewards. Math, code, structured extraction, and tool-call accuracy respond well to reinforcement learning on top of a strong SFT base. GRPO is the lever.
- Small-model parity on narrow tasks. A 7B-13B model fine-tuned on a narrow workload can match a 70B-class frontier model at 5-10x lower latency and cost. This is the main reason fine-tuning has survived the rise of capable base models.
Where fine-tuning is the wrong answer in 2026: when the task is open-ended, when ground-truth data is scarce or noisy, when the base model is changing every quarter, or when retrieval can solve the problem more cheaply.
What Changed Since 2025: GRPO, Faster QLoRA, and the Pretrain-Free Reasoning Recipe
The 2025-2026 changes that matter:
- GRPO went mainstream. DeepSeek-R1 in early 2025 popularized Group Relative Policy Optimization, which trains reasoning capabilities directly via reinforcement learning without a separate reward model. The trl library shipped a stable GRPOTrainer in 2025 and open-r1 published reproducible recipes. Source: arxiv.org/abs/2402.03300 (the DeepSeekMath paper introducing GRPO).
- QLoRA tooling matured. unsloth’s optimized kernels deliver 2-5x faster training and ~50-70% lower VRAM than naive HF + bitsandbytes, on a single GPU. axolotl shipped declarative GRPO recipes by mid-2025.
- DPO variants stabilized. IPO (Identity Preference Optimization), KTO (Kahneman-Tversky Optimization), and SimPO landed in trl as alternatives to vanilla DPO for cases where the BT-model assumption fails.
- Synthetic data became standard. Frontier-model judges generate high-quality preference pairs and instruction-response pairs at scale. The catch: you need to evaluate the synthetic data before it goes into training. Future AGI’s eval templates and dataset curation primitives are designed for this loop.
- Pretrain-free reasoning. Several open recipes show you can get strong reasoning out of a base model with SFT + GRPO alone, no separate reward model required. The implication: most teams will never need full RLHF in 2026.
Why Fine-Tuning Lets You Achieve More with Less: Control, Data Efficiency, Resource Savings
Open-source LLMs like the Llama 4 family or Mistral’s 2025-2026 releases are excellent for general-purpose applications. For tasks that demand high accuracy and domain-specific behavior, fine-tuning is the standard approach.
Fine-tuning is the process of taking a pre-trained, generalist model and further training it on a smaller, curated dataset. This method adapts the model to your specific requirements without the prohibitive cost of training from scratch.
The primary benefits:
- Greater control and accuracy. The model learns the specific behaviour, terminology, and patterns of your domain.
- Data efficiency. Since the model already possesses a broad foundation, fine-tuning requires significantly less data than training from the ground up.
- Resource savings. Modern PEFT methods focus on updating a small set of weights, so QLoRA fine-tuning of a 70B-class model fits in a single 48-80GB GPU.
Fine-Tuning Taxonomy: How to Select the Right Method for Your Use Case and Data Availability
Supervised, Semi-Supervised, and Unsupervised LLM Fine-Tuning
Fine-tuning methodologies are categorized by the type of data used for training.
- Supervised Fine-Tuning (SFT). The most common method for adapting an LLM to a specific downstream task. SFT requires a high-quality labeled dataset where each data point consists of an input and its desired output. Training minimizes a cross-entropy loss between predictions and ground truth. This optimizes the model for behaviors like instruction following, classification, or summarization.
- Unsupervised Fine-Tuning. Often called domain-adaptive pre-training. The process continues the model’s original pre-training objective (next-token prediction) on a large domain-specific corpus such as legal documents or medical research. This helps the model learn vocabulary, syntax, and statistical patterns of the target domain before it is fine-tuned for a specific task.
- Semi-Supervised Fine-Tuning. A hybrid approach when labeled data is limited but unlabeled data is abundant. A common strategy: unsupervised continued pretraining on the unlabeled corpus, then SFT on the small labeled set. Pseudo-labeling is another option, where a partially trained model generates labels for unlabeled data.
Feature Extraction vs Full Fine-Tuning vs PEFT
When adapting a pre-trained model, you choose between updating all parameters or only a small subset.
- Full Fine-Tuning. Unfreezes all layers and updates all parameters during training. Highest performance ceiling, highest cost. Risk of catastrophic forgetting and overfitting on small datasets. Reserve for cases where you have multi-node compute and a clean dataset that justifies it.
- Feature Extraction. Keeps the base model frozen as a fixed encoder and trains a small classification head on top. Fast and resource-efficient, but the performance ceiling is bounded by the base representations.
- Parameter-Efficient Fine-Tuning (PEFT). Trains a small fraction of parameters while keeping the base frozen. LoRA, QLoRA, adapters, prefix-tuning, and P-tuning v2 are the main families. PEFT is the default for most teams in 2026.
LoRA, QLoRA, and Adapters: The 2026 PEFT Stack
- LoRA (Low-Rank Adaptation). Hypothesizes that the change in weight matrices during fine-tuning has low intrinsic rank. Instead of updating the full weight matrix, LoRA injects a pair of trainable low-rank matrices A and B such that the effective update is BA. Trainable parameters drop by 100-1000x. Source: arxiv.org/abs/2106.09685.
- QLoRA. Quantizes the frozen base model to 4-bit (NF4) and trains LoRA adapters in 16-bit on top. ~4x VRAM reduction vs vanilla LoRA. The default starting point for single-GPU fine-tuning in 2026. Source: arxiv.org/abs/2305.14314.
- Adapters. Inject small bottleneck networks between transformer layers. Modular; you can swap adapter sets per task without touching the base model.
- Prefix-Tuning and P-Tuning v2. Learn task-specific continuous “virtual tokens” that steer the frozen base model’s attention pattern. Best when you cannot modify model weights at all (black-box deployment).
Selection guide for 2026:
- Default to QLoRA. It is the sweet spot of cost and performance.
- Move to full fine-tuning only when QLoRA leaves measurable quality on the table and you have the compute.
- Use adapters when you need to swap multiple task-specific behaviors at inference time.
Instruction Fine-Tuning, RLHF, DPO, IPO, KTO, SimPO, and GRPO
After SFT, models are further refined to follow instructions and align with human preferences.
Instruction Fine-Tuning (IFT). A form of SFT trained on (instruction, response) pairs to teach the model to follow commands.
Reinforcement Learning from Human Feedback (RLHF). A multi-stage process to align an LLM with subjective human values:
- SFT: initial fine-tune on a high-quality instruction dataset.
- Reward Model (RM) Training: human annotators rank multiple responses per prompt, and a separate model is trained to predict a scalar reward.
- RL Optimization: PPO (arxiv.org/abs/1707.06347) typically, with a KL penalty to keep the policy close to the SFT model.
Direct Preference Optimization (DPO). Bypasses the explicit reward model and the RL loop by deriving a closed-form policy update from preference pairs. Simpler to implement and more stable. Source: arxiv.org/abs/2305.18290.
DPO variants that matter in 2026:
- IPO (Identity Preference Optimization) softens DPO’s BT-model assumption.
- KTO (Kahneman-Tversky Optimization) uses single labeled examples instead of pairs, which is useful when preferences are unary.
- SimPO removes the reference-model KL term for simpler training.
Group Relative Policy Optimization (GRPO). Introduced in the DeepSeekMath paper and later popularized by DeepSeek-R1. Samples a group of K completions per prompt, computes a verifiable reward per completion (math correctness, code unit tests, format match), and uses the within-group advantage to update the policy. No separate reward model needed. The recipe that drove the open-r1 line in 2025-2026. Source: arxiv.org/abs/2402.03300.
Selection guide:
- Most teams: SFT then DPO. Stable, well-understood.
- Reasoning tasks with verifiable rewards: SFT then GRPO.
- Subjective alignment (helpfulness, harmlessness) where DPO underperforms: full RLHF.
- Unary preference signals (thumbs up only): KTO.
Mixture of Experts for Modular and Scalable LLM Adaptation
Mixture of Experts (MoE) replaces dense feed-forward layers with a set of expert sub-networks plus a learned router. Only a small number of experts activate per token, so MoE models scale to trillion-parameter sizes at constant per-token FLOPs.
In 2026 MoE matters for fine-tuning in two ways:
- Targeted expert fine-tuning. You can fine-tune only the experts most relevant to a domain, reducing the trainable parameter count by another order of magnitude.
- Mixture-of-Agents (at the system level). A controller routes tasks across specialized models or agents. This is an orchestration pattern, not a single-model architecture. The agent-level case is covered in Multi-Agent Systems in 2026.
The 2026 Fine-Tuning Framework Landscape: trl, unsloth, axolotl, DeepSpeed, Llama Factory
| Framework | License | Strengths | Best for |
|---|---|---|---|
| trl (Hugging Face) | Apache 2.0 | Canonical SFT, DPO, GRPO trainers. Tight HF ecosystem | Standard recipes, research workflows |
| unsloth | Apache 2.0 | 2-5x faster training, 50-70% lower VRAM, single-GPU focus | Cost-sensitive single-GPU fine-tuning |
| axolotl | Apache 2.0 | Declarative YAML configs, DeepSpeed/FSDP integration | Reproducible recipes at scale |
| DeepSpeed | Apache 2.0 | ZeRO sharding, MoE training, multi-node distributed | Full fine-tuning of 70B+ models |
| Llama Factory | Apache 2.0 | Web UI, broad method coverage | Teams that want a UI workflow |
Source: github.com/huggingface/trl, github.com/unslothai/unsloth, github.com/OpenAccess-AI-Collective/axolotl, github.com/deepspeedai/DeepSpeed, github.com/hiyouga/LLaMA-Factory.
Practical recommendation: use unsloth for prototyping on a single GPU. Move to axolotl + DeepSpeed when you need multi-node or FSDP. Use trl directly when you want the canonical research code path for DPO, GRPO, or KTO.
Strategic Data Preparation and Management for LLM Fine-Tuning
Task Definition: Align Use-Case Requirements with the Right Metrics Before Training
The first step in any fine-tuning project is to clearly define the task and pick evaluation metrics that match the practical objective.
- Classification. Accuracy, F1, macro/micro F1, confusion matrix.
- Generation (translation, summarization). BLEU, ROUGE, BERTScore, plus LLM-judge scores for quality.
- Reasoning. Pass@k on held-out math or code benchmarks, plus task-completion judges.
- Domain-language understanding. Perplexity on held-out domain text, plus downstream task accuracy.
- RAG. Faithfulness, context relevance, answer relevance, citation precision.
The metric you pick before training determines the data you collect and the validation set you reserve. Get this wrong and the rest of the pipeline drifts.
Advanced Data Collection and Curation
- Domain adaptation data. A corpus of text representative of the target domain. Legal contracts, medical notes, internal docs. Used for continued pretraining.
- Instruction datasets. Structured (prompt, response) pairs. Quality beats quantity. 1,000 clean examples often beat 10,000 noisy ones.
- Preference data. Pairs of (chosen, rejected) responses per prompt for DPO. Or scalar reward labels for GRPO when the reward is verifiable.
- Data synthesis. Use a strong judge model (GPT-5, Claude Opus 4.7) to generate instructions, responses, and preference labels at scale. Always evaluate synthetic data before training; see Synthetic Data for Fine-Tuning LLMs.
Cleaning, Augmentation, and Anonymization
- Data cleaning. Remove duplicates, correct errors, strip artifacts, standardize encoding.
- Augmentation. Back-translation, synonym replacement, paraphrasing. Used to balance representation across demographic groups and reduce overfitting on small datasets.
- Anonymization and compliance. Strip PII from training data. Audit for copyrighted material. Verify dataset license. Future AGI’s PII scanner can run as a preprocessing pass to flag risky records before training.
A Minimal 2026 Fine-Tuning Recipe: QLoRA + DPO + Eval
# Step 1: QLoRA SFT with unsloth (single H100)
from unsloth import FastLanguageModel
from trl import SFTTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-3.1-8B-Instruct",
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=sft_dataset,
args={"output_dir": "./sft-out", "num_train_epochs": 3},
)
trainer.train()
# Step 2: DPO on top of SFT with trl
from trl import DPOTrainer
dpo_trainer = DPOTrainer(
model=model,
ref_model=None, # peft handles reference automatically
beta=0.1,
train_dataset=preference_dataset,
tokenizer=tokenizer,
args={"output_dir": "./dpo-out", "num_train_epochs": 1},
)
dpo_trainer.train()
# Step 3: Evaluate the candidate against the base model with Future AGI
import os
from fi.evals import evaluate
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
eval_prompts = ["Summarize this contract clause...", "Draft a response to..."]
base_outputs = ["base model output 1", "base model output 2"]
candidate_outputs = ["fine-tuned output 1", "fine-tuned output 2"]
domain_judge = CustomLLMJudge(
name="domain_quality",
provider=LiteLLMProvider(model="gpt-5-2025-08-07"),
prompt="Is the response on-tone, factual, and complete for the domain? YES or NO.",
)
for label, outputs in [("base", base_outputs), ("candidate", candidate_outputs)]:
for prompt, output in zip(eval_prompts, outputs):
print(label, "faithfulness:", evaluate("faithfulness", output=output, context=prompt))
print(label, "task_completion:", evaluate("task_completion", input=prompt, output=output))
print(label, "domain_judge:", domain_judge(output=output))
Promote the fine-tuned candidate only after the eval pass-rate beats the base model on your real workload and the regression suite stays green. See How to Build an LLM Evaluation Framework for the full eval design pattern.
Where Future AGI Fits in the Fine-Tuning Loop
Future AGI is the evaluation, simulation, and observability layer for fine-tuned models, not a fine-tuning provider. The fit:
- Dataset curation. Run candidate training data through 50+ eval templates (safety, PII, factuality, redundancy) before it reaches the trainer.
- Pre-deployment replay. Use the prototype harness to A/B test the fine-tuned candidate against the base model on real production prompts.
- Production guardrails. Wrap the deployed model in Agent Command Center (/platform/monitor/command-center) to apply PII redaction, prompt injection screening, and brand-tone enforcement on routed calls.
- Drift detection. Continuous eval pass-rates on production traces flag when the fine-tuned model degrades against fresh data.
The fine-tune itself runs in trl, unsloth, axolotl, or DeepSpeed. Future AGI is the layer that tells you whether the fine-tune was worth shipping.
Bottom Line: How to Run a Fine-Tuning Project in 2026
- Start with prompting and RAG. Only fine-tune if these fail.
- Pick QLoRA with unsloth on a single H100 as the default first attempt.
- Layer DPO if you have preference pairs and need alignment.
- Use GRPO if your reward is verifiable (math, code, format).
- Evaluate every candidate in Future AGI before promotion. Hold a private regression set the model never sees.
- Wrap the deployed model in Agent Command Center for guardrails and drift detection.
The fine-tuning landscape moves fast, but the evaluation discipline does not. The teams that ship reliable fine-tuned models in 2026 are the teams that treat eval as the bottleneck, not training.
Frequently asked questions
What is the right fine-tuning method for most teams in 2026?
How is GRPO different from RLHF and DPO?
Which fine-tuning framework should I use, trl, unsloth, or axolotl?
Do I need to fine-tune in 2026 or can I get away with prompting and RAG?
How do I evaluate a fine-tuned model?
What is QLoRA and why is it the default in 2026?
Can Future AGI fine-tune models for me?
How big does my fine-tuning dataset need to be?
Voice AI evaluation infrastructure in 2026: five testing layers, STT/LLM/TTS metrics, synthetic test harness, traceAI instrumentation, and Future AGI Simulate.
OpenAI Frontier vs Claude Cowork 2026 head-to-head: agent execution, governance, security, pricing, and the eval layer every CTO needs on top of both.
How engineering teams ship safe AI in 2026. CI/CD guardrails, drift detection, adversarial robustness, monitoring. Future AGI Protect + Guardrails as #1 stack.