Articles

LLM Fine-Tuning Guide in 2026: LoRA, QLoRA, DPO, GRPO, RLHF, and How to Evaluate Fine-Tuned Models

Fine-tune LLMs in 2026 with LoRA, QLoRA, GRPO, RLHF, DPO, IPO. Compare trl, unsloth, axolotl, DeepSpeed and learn how to evaluate fine-tuned models.

·
Updated
·
11 min read
llms
LLM Fine-Tuning Guide 2026: LoRA, QLoRA, DPO, GRPO, RLHF
Table of Contents

TL;DR: How to Pick a Fine-Tuning Method in 2026

GoalDefault methodFrameworkWhen to switch
Style and format adaptationLoRA or QLoRA SFTunsloth or trlTiny dataset → prompting + RAG instead
Instruction followingSFT then DPOtrl or axolotlNeed deep alignment → RLHF
Preference alignmentDPO or IPOtrlNeed on-policy RL → GRPO or PPO
Reasoning (math, code)GRPOtrl or open-r1Reward is not verifiable → DPO
Domain adaptationContinued pretraining then SFTaxolotl + DeepSpeedSingle niche → just SFT
Eval the result50+ templates + held-outFuture AGIAlways

Verdict for 2026: start with QLoRA SFT in unsloth on a single H100, layer DPO if you have preference pairs, switch to GRPO when your reward function is verifiable, and evaluate every candidate in Future AGI before promoting it to production.

Why General-Purpose LLMs Fall Short and Where Fine-Tuning Still Earns Its Compute

Frontier base models like GPT-5, Claude Opus 4.7, and Llama 4.x are powerful but they often lack the specific tone, format, or domain vocabulary a production product needs. Three places where fine-tuning still pays off in 2026:

  • Style and format pinning. A prompt can get you 80% of the way to a brand voice; fine-tuning closes the last 20% without burning prompt tokens at every call.
  • Closed tasks with verifiable rewards. Math, code, structured extraction, and tool-call accuracy respond well to reinforcement learning on top of a strong SFT base. GRPO is the lever.
  • Small-model parity on narrow tasks. A 7B-13B model fine-tuned on a narrow workload can match a 70B-class frontier model at 5-10x lower latency and cost. This is the main reason fine-tuning has survived the rise of capable base models.

Where fine-tuning is the wrong answer in 2026: when the task is open-ended, when ground-truth data is scarce or noisy, when the base model is changing every quarter, or when retrieval can solve the problem more cheaply.

What Changed Since 2025: GRPO, Faster QLoRA, and the Pretrain-Free Reasoning Recipe

The 2025-2026 changes that matter:

  • GRPO went mainstream. DeepSeek-R1 in early 2025 popularized Group Relative Policy Optimization, which trains reasoning capabilities directly via reinforcement learning without a separate reward model. The trl library shipped a stable GRPOTrainer in 2025 and open-r1 published reproducible recipes. Source: arxiv.org/abs/2402.03300 (the DeepSeekMath paper introducing GRPO).
  • QLoRA tooling matured. unsloth’s optimized kernels deliver 2-5x faster training and ~50-70% lower VRAM than naive HF + bitsandbytes, on a single GPU. axolotl shipped declarative GRPO recipes by mid-2025.
  • DPO variants stabilized. IPO (Identity Preference Optimization), KTO (Kahneman-Tversky Optimization), and SimPO landed in trl as alternatives to vanilla DPO for cases where the BT-model assumption fails.
  • Synthetic data became standard. Frontier-model judges generate high-quality preference pairs and instruction-response pairs at scale. The catch: you need to evaluate the synthetic data before it goes into training. Future AGI’s eval templates and dataset curation primitives are designed for this loop.
  • Pretrain-free reasoning. Several open recipes show you can get strong reasoning out of a base model with SFT + GRPO alone, no separate reward model required. The implication: most teams will never need full RLHF in 2026.

Why Fine-Tuning Lets You Achieve More with Less: Control, Data Efficiency, Resource Savings

Open-source LLMs like the Llama 4 family or Mistral’s 2025-2026 releases are excellent for general-purpose applications. For tasks that demand high accuracy and domain-specific behavior, fine-tuning is the standard approach.

Fine-tuning is the process of taking a pre-trained, generalist model and further training it on a smaller, curated dataset. This method adapts the model to your specific requirements without the prohibitive cost of training from scratch.

The primary benefits:

  • Greater control and accuracy. The model learns the specific behaviour, terminology, and patterns of your domain.
  • Data efficiency. Since the model already possesses a broad foundation, fine-tuning requires significantly less data than training from the ground up.
  • Resource savings. Modern PEFT methods focus on updating a small set of weights, so QLoRA fine-tuning of a 70B-class model fits in a single 48-80GB GPU.

Fine-Tuning Taxonomy: How to Select the Right Method for Your Use Case and Data Availability

Supervised, Semi-Supervised, and Unsupervised LLM Fine-Tuning

Fine-tuning methodologies are categorized by the type of data used for training.

  • Supervised Fine-Tuning (SFT). The most common method for adapting an LLM to a specific downstream task. SFT requires a high-quality labeled dataset where each data point consists of an input and its desired output. Training minimizes a cross-entropy loss between predictions and ground truth. This optimizes the model for behaviors like instruction following, classification, or summarization.
  • Unsupervised Fine-Tuning. Often called domain-adaptive pre-training. The process continues the model’s original pre-training objective (next-token prediction) on a large domain-specific corpus such as legal documents or medical research. This helps the model learn vocabulary, syntax, and statistical patterns of the target domain before it is fine-tuned for a specific task.
  • Semi-Supervised Fine-Tuning. A hybrid approach when labeled data is limited but unlabeled data is abundant. A common strategy: unsupervised continued pretraining on the unlabeled corpus, then SFT on the small labeled set. Pseudo-labeling is another option, where a partially trained model generates labels for unlabeled data.

Feature Extraction vs Full Fine-Tuning vs PEFT

When adapting a pre-trained model, you choose between updating all parameters or only a small subset.

  • Full Fine-Tuning. Unfreezes all layers and updates all parameters during training. Highest performance ceiling, highest cost. Risk of catastrophic forgetting and overfitting on small datasets. Reserve for cases where you have multi-node compute and a clean dataset that justifies it.
  • Feature Extraction. Keeps the base model frozen as a fixed encoder and trains a small classification head on top. Fast and resource-efficient, but the performance ceiling is bounded by the base representations.
  • Parameter-Efficient Fine-Tuning (PEFT). Trains a small fraction of parameters while keeping the base frozen. LoRA, QLoRA, adapters, prefix-tuning, and P-tuning v2 are the main families. PEFT is the default for most teams in 2026.

LoRA, QLoRA, and Adapters: The 2026 PEFT Stack

  • LoRA (Low-Rank Adaptation). Hypothesizes that the change in weight matrices during fine-tuning has low intrinsic rank. Instead of updating the full weight matrix, LoRA injects a pair of trainable low-rank matrices A and B such that the effective update is BA. Trainable parameters drop by 100-1000x. Source: arxiv.org/abs/2106.09685.
  • QLoRA. Quantizes the frozen base model to 4-bit (NF4) and trains LoRA adapters in 16-bit on top. ~4x VRAM reduction vs vanilla LoRA. The default starting point for single-GPU fine-tuning in 2026. Source: arxiv.org/abs/2305.14314.
  • Adapters. Inject small bottleneck networks between transformer layers. Modular; you can swap adapter sets per task without touching the base model.
  • Prefix-Tuning and P-Tuning v2. Learn task-specific continuous “virtual tokens” that steer the frozen base model’s attention pattern. Best when you cannot modify model weights at all (black-box deployment).

Selection guide for 2026:

  • Default to QLoRA. It is the sweet spot of cost and performance.
  • Move to full fine-tuning only when QLoRA leaves measurable quality on the table and you have the compute.
  • Use adapters when you need to swap multiple task-specific behaviors at inference time.

Instruction Fine-Tuning, RLHF, DPO, IPO, KTO, SimPO, and GRPO

After SFT, models are further refined to follow instructions and align with human preferences.

Instruction Fine-Tuning (IFT). A form of SFT trained on (instruction, response) pairs to teach the model to follow commands.

Reinforcement Learning from Human Feedback (RLHF). A multi-stage process to align an LLM with subjective human values:

  1. SFT: initial fine-tune on a high-quality instruction dataset.
  2. Reward Model (RM) Training: human annotators rank multiple responses per prompt, and a separate model is trained to predict a scalar reward.
  3. RL Optimization: PPO (arxiv.org/abs/1707.06347) typically, with a KL penalty to keep the policy close to the SFT model.

Direct Preference Optimization (DPO). Bypasses the explicit reward model and the RL loop by deriving a closed-form policy update from preference pairs. Simpler to implement and more stable. Source: arxiv.org/abs/2305.18290.

DPO variants that matter in 2026:

  • IPO (Identity Preference Optimization) softens DPO’s BT-model assumption.
  • KTO (Kahneman-Tversky Optimization) uses single labeled examples instead of pairs, which is useful when preferences are unary.
  • SimPO removes the reference-model KL term for simpler training.

Group Relative Policy Optimization (GRPO). Introduced in the DeepSeekMath paper and later popularized by DeepSeek-R1. Samples a group of K completions per prompt, computes a verifiable reward per completion (math correctness, code unit tests, format match), and uses the within-group advantage to update the policy. No separate reward model needed. The recipe that drove the open-r1 line in 2025-2026. Source: arxiv.org/abs/2402.03300.

Selection guide:

  • Most teams: SFT then DPO. Stable, well-understood.
  • Reasoning tasks with verifiable rewards: SFT then GRPO.
  • Subjective alignment (helpfulness, harmlessness) where DPO underperforms: full RLHF.
  • Unary preference signals (thumbs up only): KTO.

Mixture of Experts for Modular and Scalable LLM Adaptation

Mixture of Experts (MoE) replaces dense feed-forward layers with a set of expert sub-networks plus a learned router. Only a small number of experts activate per token, so MoE models scale to trillion-parameter sizes at constant per-token FLOPs.

In 2026 MoE matters for fine-tuning in two ways:

  • Targeted expert fine-tuning. You can fine-tune only the experts most relevant to a domain, reducing the trainable parameter count by another order of magnitude.
  • Mixture-of-Agents (at the system level). A controller routes tasks across specialized models or agents. This is an orchestration pattern, not a single-model architecture. The agent-level case is covered in Multi-Agent Systems in 2026.

The 2026 Fine-Tuning Framework Landscape: trl, unsloth, axolotl, DeepSpeed, Llama Factory

FrameworkLicenseStrengthsBest for
trl (Hugging Face)Apache 2.0Canonical SFT, DPO, GRPO trainers. Tight HF ecosystemStandard recipes, research workflows
unslothApache 2.02-5x faster training, 50-70% lower VRAM, single-GPU focusCost-sensitive single-GPU fine-tuning
axolotlApache 2.0Declarative YAML configs, DeepSpeed/FSDP integrationReproducible recipes at scale
DeepSpeedApache 2.0ZeRO sharding, MoE training, multi-node distributedFull fine-tuning of 70B+ models
Llama FactoryApache 2.0Web UI, broad method coverageTeams that want a UI workflow

Source: github.com/huggingface/trl, github.com/unslothai/unsloth, github.com/OpenAccess-AI-Collective/axolotl, github.com/deepspeedai/DeepSpeed, github.com/hiyouga/LLaMA-Factory.

Practical recommendation: use unsloth for prototyping on a single GPU. Move to axolotl + DeepSpeed when you need multi-node or FSDP. Use trl directly when you want the canonical research code path for DPO, GRPO, or KTO.

Strategic Data Preparation and Management for LLM Fine-Tuning

Task Definition: Align Use-Case Requirements with the Right Metrics Before Training

The first step in any fine-tuning project is to clearly define the task and pick evaluation metrics that match the practical objective.

  • Classification. Accuracy, F1, macro/micro F1, confusion matrix.
  • Generation (translation, summarization). BLEU, ROUGE, BERTScore, plus LLM-judge scores for quality.
  • Reasoning. Pass@k on held-out math or code benchmarks, plus task-completion judges.
  • Domain-language understanding. Perplexity on held-out domain text, plus downstream task accuracy.
  • RAG. Faithfulness, context relevance, answer relevance, citation precision.

The metric you pick before training determines the data you collect and the validation set you reserve. Get this wrong and the rest of the pipeline drifts.

Advanced Data Collection and Curation

  • Domain adaptation data. A corpus of text representative of the target domain. Legal contracts, medical notes, internal docs. Used for continued pretraining.
  • Instruction datasets. Structured (prompt, response) pairs. Quality beats quantity. 1,000 clean examples often beat 10,000 noisy ones.
  • Preference data. Pairs of (chosen, rejected) responses per prompt for DPO. Or scalar reward labels for GRPO when the reward is verifiable.
  • Data synthesis. Use a strong judge model (GPT-5, Claude Opus 4.7) to generate instructions, responses, and preference labels at scale. Always evaluate synthetic data before training; see Synthetic Data for Fine-Tuning LLMs.

Cleaning, Augmentation, and Anonymization

  • Data cleaning. Remove duplicates, correct errors, strip artifacts, standardize encoding.
  • Augmentation. Back-translation, synonym replacement, paraphrasing. Used to balance representation across demographic groups and reduce overfitting on small datasets.
  • Anonymization and compliance. Strip PII from training data. Audit for copyrighted material. Verify dataset license. Future AGI’s PII scanner can run as a preprocessing pass to flag risky records before training.

A Minimal 2026 Fine-Tuning Recipe: QLoRA + DPO + Eval

# Step 1: QLoRA SFT with unsloth (single H100)
from unsloth import FastLanguageModel
from trl import SFTTrainer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=sft_dataset,
    args={"output_dir": "./sft-out", "num_train_epochs": 3},
)
trainer.train()
# Step 2: DPO on top of SFT with trl
from trl import DPOTrainer

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,            # peft handles reference automatically
    beta=0.1,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
    args={"output_dir": "./dpo-out", "num_train_epochs": 1},
)
dpo_trainer.train()
# Step 3: Evaluate the candidate against the base model with Future AGI
import os
from fi.evals import evaluate
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

eval_prompts = ["Summarize this contract clause...", "Draft a response to..."]
base_outputs = ["base model output 1", "base model output 2"]
candidate_outputs = ["fine-tuned output 1", "fine-tuned output 2"]

domain_judge = CustomLLMJudge(
    name="domain_quality",
    provider=LiteLLMProvider(model="gpt-5-2025-08-07"),
    prompt="Is the response on-tone, factual, and complete for the domain? YES or NO.",
)

for label, outputs in [("base", base_outputs), ("candidate", candidate_outputs)]:
    for prompt, output in zip(eval_prompts, outputs):
        print(label, "faithfulness:", evaluate("faithfulness", output=output, context=prompt))
        print(label, "task_completion:", evaluate("task_completion", input=prompt, output=output))
        print(label, "domain_judge:", domain_judge(output=output))

Promote the fine-tuned candidate only after the eval pass-rate beats the base model on your real workload and the regression suite stays green. See How to Build an LLM Evaluation Framework for the full eval design pattern.

Where Future AGI Fits in the Fine-Tuning Loop

Future AGI is the evaluation, simulation, and observability layer for fine-tuned models, not a fine-tuning provider. The fit:

  • Dataset curation. Run candidate training data through 50+ eval templates (safety, PII, factuality, redundancy) before it reaches the trainer.
  • Pre-deployment replay. Use the prototype harness to A/B test the fine-tuned candidate against the base model on real production prompts.
  • Production guardrails. Wrap the deployed model in Agent Command Center (/platform/monitor/command-center) to apply PII redaction, prompt injection screening, and brand-tone enforcement on routed calls.
  • Drift detection. Continuous eval pass-rates on production traces flag when the fine-tuned model degrades against fresh data.

The fine-tune itself runs in trl, unsloth, axolotl, or DeepSpeed. Future AGI is the layer that tells you whether the fine-tune was worth shipping.

Bottom Line: How to Run a Fine-Tuning Project in 2026

  1. Start with prompting and RAG. Only fine-tune if these fail.
  2. Pick QLoRA with unsloth on a single H100 as the default first attempt.
  3. Layer DPO if you have preference pairs and need alignment.
  4. Use GRPO if your reward is verifiable (math, code, format).
  5. Evaluate every candidate in Future AGI before promotion. Hold a private regression set the model never sees.
  6. Wrap the deployed model in Agent Command Center for guardrails and drift detection.

The fine-tuning landscape moves fast, but the evaluation discipline does not. The teams that ship reliable fine-tuned models in 2026 are the teams that treat eval as the bottleneck, not training.

Frequently asked questions

What is the right fine-tuning method for most teams in 2026?
QLoRA on a single H100 or A100 is the default starting point for most teams in 2026. It freezes the base model in 4-bit precision and trains a small set of low-rank adapters, which lets a 70B-class model fit on a single 48-80GB GPU. Use full fine-tuning only when you have multi-node compute and a large clean dataset. Reach for DPO when you need preference alignment on top of an SFT base, or GRPO when you have a verifiable reward for reasoning, code, or math tasks.
How is GRPO different from RLHF and DPO?
GRPO (Group Relative Policy Optimization) was introduced in the DeepSeekMath paper (arxiv.org/abs/2402.03300) and popularized by the DeepSeek-R1 line in early 2025. Instead of training a separate reward model the way classical RLHF does, GRPO samples a group of completions per prompt and uses the within-group reward distribution to compute a relative advantage. DPO removes the RL stage entirely by deriving a closed-form policy update from pairwise preferences. RLHF still wins on subjective alignment depth, DPO wins on stability and simplicity, and GRPO wins on reasoning tasks where you have a verifiable reward.
Which fine-tuning framework should I use, trl, unsloth, or axolotl?
Use trl from Hugging Face if you want the canonical SFT, DPO, and GRPO trainers and you are comfortable on the HF stack. Use unsloth when you want 2-5x faster training and 50-70% lower VRAM on a single GPU, especially for LoRA and QLoRA. Use axolotl when you want a declarative YAML config and recipe library that supports DeepSpeed, FSDP, and most current methods. All three are actively maintained in 2026.
Do I need to fine-tune in 2026 or can I get away with prompting and RAG?
Most teams should try prompting plus RAG first. Fine-tune when you need style, format, or domain-language consistency that prompts cannot pin down, when you have a closed task with thousands of high-quality labels, or when you need a small model to match a large model on a narrow workload at lower latency and cost. Reasoning RL (GRPO) is the one place where fine-tuning shows a step change vs prompting.
How do I evaluate a fine-tuned model?
Treat the fine-tuned model as a candidate, not a winner. Run it through the same eval harness as your base model with held-out test sets, automated judges (factuality, task completion, faithfulness on RAG), human review for top-K samples, adversarial probes for safety, and regression tests against the previous production model. Future AGI provides 50+ eval templates plus simulation for agentic replay, which is the evaluation layer this guide assumes.
What is QLoRA and why is it the default in 2026?
QLoRA (Dettmers et al, 2023) loads the base model weights in 4-bit precision using NF4 quantization, freezes them, and trains a small set of low-rank LoRA adapters on top in 16-bit. The result is fine-tuning quality close to full 16-bit LoRA at roughly a quarter of the GPU memory. Every major framework supports it natively in 2026, and a single H100 can fine-tune a 70B-class model without distributed training.
Can Future AGI fine-tune models for me?
No. Future AGI is the evaluation, simulation, and observability layer for fine-tuned models, not a fine-tuning provider. Use trl, unsloth, axolotl, or DeepSpeed to run the fine-tune itself, then pipe outputs into Future AGI's eval templates and prototype harness to compare candidates against your production model. Once a candidate wins, deploy behind Agent Command Center guardrails to catch regressions in production.
How big does my fine-tuning dataset need to be?
For style and format adaptation 1,000-5,000 carefully curated examples is often enough. For instruction following 10K-100K examples is typical. For preference alignment via DPO you want 5K-50K preference pairs. The quality bar is much higher than the size bar: 1,000 clean labels almost always beat 10,000 noisy ones.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.