Guides

Fine-Tuning LLMs in 2026: LoRA, QLoRA, DPO, GRPO Compared

2026 guide to fine-tuning LLMs: LoRA vs QLoRA, DPO vs RLHF vs GRPO, and when to fine-tune open-weight models instead of prompting alone.

December 1, 2024

Updated May 14, 2026

7 min read

fine-tuning lora qlora dpo grpo rlhf peft llms 2026

Table of Contents

TL;DR: Which Fine-Tuning Method to Pick in 2026

Need	2026 Recipe	Compute	Notes
Tone or style adaptation	LoRA SFT	1x24GB GPU	500 to 2,000 examples
Domain adaptation on 70B model	QLoRA SFT	1x80GB H100	4-bit NF4 base, rank 16 to 64
Align to human preferences	DPO	Same as LoRA	1,000 to 5,000 pairs
Reasoning fine-tune with verifier	GRPO	Multi-GPU	DeepSeek-R1 style
Safety-critical alignment	RLHF with PPO	Multi-GPU	Reward model required
Quick task adaptation, no weights	Prompt tuning	CPU OK	Limited expressiveness
Frontier API model	Vendor-hosted fine-tune API	Vendor-hosted	Check OpenAI fine-tune docs for supported snapshots

When you go to production, evaluate with a held-out set on task metrics plus LLM-as-judge on faithfulness, tool-call correctness, and instruction following. FAGI’s ai-evaluation and traceAI (both Apache 2.0) are designed for exactly this companion role: score the fine-tuned model and the base model on identical prompts and ship only when the delta is real.

Why Frontier LLMs Still Need Fine-Tuning in 2026

Frontier LLMs in 2026 are excellent at general instruction following. They still fail at three things that fine-tuning fixes: exact output schema, narrow domain vocabulary, and brand voice. A base model can describe a SOC 2 control in plausible English, but it will not consistently emit the JSON your downstream parser expects, will not use your internal product nicknames, and will drift away from your support team’s voice.

Fine-tuning is also the answer when you need to compress capability. A 70B-class open-weight model fine-tuned on your task often matches a frontier API model at a fraction of the inference cost. The 2026 cost-per-quality sweet spot for many production agents is a QLoRA 8B-class or 70B-class open-weight model served on your own infra, with RAG on top for fresh knowledge.

This guide covers the four methods that actually matter in 2026: LoRA, QLoRA, DPO, and GRPO. We also cover where full RLHF and prompt tuning still fit, how to estimate cost, and how to evaluate before shipping.

LoRA vs QLoRA: The 2026 PEFT Baseline

LoRA (Hu et al., 2021, arXiv:2106.09685) trains two low-rank matrices A and B such that the weight update is W + BA, with A and B much smaller than W. The base W stays frozen. You typically train 0.1 to 1 percent of the parameters and ship adapters of 50 to 500 MB.

QLoRA (Dettmers et al., 2023, arXiv:2305.14314) adds two tricks: 4-bit NF4 quantization on the frozen base and paged optimizers. The result is that you can fine-tune a 70B-class model on a single 80GB H100, or a 65B model on a 48GB consumer card. The base never leaves 4-bit during training; gradients flow through the dequantized weights.

What changed in 2026: most open-source LLM stacks (Hugging Face PEFT, Axolotl, Unsloth, TRL) treat QLoRA as the default and LoRA as the option you choose when you have GPU headroom. Unsloth’s kernels brought QLoRA throughput within roughly 10 to 20 percent of LoRA on common hardware. Pick LoRA when batch size and throughput matter; pick QLoRA when memory is the binding constraint.

A Minimal LoRA Recipe with Hugging Face TRL

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

from datasets import Dataset

base = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)

# Minimal placeholder dataset. Replace with your real Hugging Face Dataset.
train_dataset = Dataset.from_list([
    {"prompt": "Translate to French: hello", "completion": "bonjour"},
    {"prompt": "Translate to French: goodbye", "completion": "au revoir"},
])

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = SFTConfig(
    output_dir="./llama-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)
trainer.train()

Refer to the Hugging Face PEFT docs for the current set of supported target modules per architecture.

DPO vs RLHF vs GRPO: Which Alignment Method in 2026

DPO is the Default

DPO (Rafailov et al., 2023, arXiv:2305.18290) reformulates RLHF as a single classification-style loss on pairs of chosen and rejected responses. No reward model. No online sampling. One pass through your preference data.

In 2026, DPO is the first thing you try after SFT. Hugging Face TRL ships a DPOTrainer, and the recipe is stable across the major open-weight families (Llama, Qwen, Mistral, and others). You need 1,000 to 5,000 pairs to see real improvement; more is better up to about 50K, then you start fighting noise.

Use GRPO for Reasoning Fine-Tunes

GRPO was popularized by the DeepSeek-R1 reasoning training pipeline (arXiv:2501.12948) and is the right method when you have a verifier rather than a preference labeler. The model emits a group of N rollouts per prompt; each rollout is scored by the verifier (correctness for math, pass/fail for unit tests, valid JSON for tool calls); rewards are normalized within the group; the policy is updated to favor above-mean rollouts. There is no value model; the group itself is the baseline.

GRPO is the 2026 standard for math, code, and structured-output reasoning fine-tunes. TRL added a GRPOTrainer and Unsloth supports it on consumer GPUs.

Keep PPO RLHF for Safety-Critical Alignment

Full RLHF with PPO (Schulman et al., 2017) still has a place: when you need a learned reward model that captures multi-dimensional human preferences, when you want online exploration during training, or when your safety team requires reward shaping for refusal behavior. Anthropic’s Constitutional AI work is in this family. The cost is operational: you need to train and maintain the reward model, and PPO is more sensitive to hyperparameters than DPO or GRPO.

When to Fine-Tune vs RAG vs Prompt

The 2026 decision tree:

Prompting alone if the base model already nails the task on your eval set.
RAG plus prompting if the task needs fresh or proprietary knowledge.
Fine-tuning if you have a stable schema, a real eval, and 500+ in-distribution examples.
Fine-tuning plus RAG for the cost-per-quality sweet spot on a domain agent.
GRPO or RLHF only if you have a real reward signal (verifier or preferences).

Skipping the eval step is the most common failure mode. Fine-tuning a model with no held-out task metric is how teams ship regressions and discover them three weeks later in production.

Cost and Compute Budget

Single-GPU QLoRA on an 8B-class open-weight model with 10K examples typically runs overnight on a 24GB consumer card or 2 to 4 hours on a single H100. Cloud spot cost is in the 5 to 30 USD range. QLoRA on a 70B-class model needs an 80GB H100 or 2x48GB and runs 50 to 200 USD per epoch on the same dataset. DPO on a LoRA adapter roughly doubles those numbers because you score two responses per step.

Cost is dominated by sequence length, LoRA rank, batch size, and hardware at fixed base size. The honest answer: rent the smallest box that fits, profile one epoch, then decide. Vendor rates change month to month; verify on your provider before budgeting.

Evaluating a Fine-Tuned Model Before Shipping

Three layers:

Task metrics: exact match, BLEU, ROUGE, tool-call accuracy, JSON schema validity, latency, cost per call. Hard floors.
LLM-as-judge on the qualitative axes you tuned for: faithfulness, instruction following, tone, brand voice.
Regression check on a held-out general capability set (MMLU subset, your prior production prompts). Make sure you did not catastrophically forget.

Compare against the base model on identical prompts. If the delta is not meaningful on your eval, the fine-tune is not worth shipping.

FAGI’s ai-evaluation library (Apache 2.0) provides the LLM-as-judge layer. A minimal pattern:

import os
from fi.evals import evaluate

os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

base_output = "..."   # response from the base open-weight model
ft_output = "..."     # response from your QLoRA fine-tune
context = "..."       # source documents or task spec

base_score = evaluate("faithfulness", output=base_output, context=context)
ft_score = evaluate("faithfulness", output=ft_output, context=context)

For agent and tool-call traces, traceAI (Apache 2.0) instruments OpenTelemetry spans on Hugging Face, OpenAI, Anthropic, LangChain, and LlamaIndex calls so you can replay the same prompts on the base and the fine-tuned model side by side. Both repositories ship with an Apache 2.0 license verified on GitHub.

Where Future AGI Fits in a Fine-Tuning Stack

Future AGI does not train models. It is the evaluation and observability companion that tells you whether your fine-tune was worth shipping and catches regressions in production:

ai-evaluation (Apache 2.0): LLM-as-judge metrics including faithfulness, instruction following, tool-use accuracy, and custom judges via CustomLLMJudge and LiteLLMProvider.
traceAI (Apache 2.0): OpenTelemetry instrumentation for OpenAI, Anthropic, Hugging Face, LangChain, LlamaIndex, and 30+ frameworks; replay base vs fine-tune on the same trace.
Agent Command Center at /platform/monitor/command-center: BYOK gateway for production agents with guardrails on inputs and outputs.

For the broader fine-tuning workflow, see our LLM Fine-Tuning Guide for 2025, the deeper Continued LLM Pretraining post, and Synthetic Data for Fine-Tuning LLMs for dataset construction.

Frequently asked questions

What is the difference between LoRA and QLoRA in 2026?

LoRA adds low-rank update matrices to a frozen base model. QLoRA combines LoRA with 4-bit NF4 quantization on the frozen weights, cutting memory enough that the original Dettmers paper fit a 65B-class model on a 48GB GPU; in 2026 a 70B-class fine-tune commonly targets an 80GB H100 or 2x48GB. QLoRA is the default for single-GPU fine-tuning of large open-weight models; LoRA without quantization is preferred when you have multi-GPU compute and want maximum throughput. Both target the same rank-r delta and are merge-compatible at inference.

When should I use DPO instead of RLHF?

Use DPO (Direct Preference Optimization, Rafailov et al. 2023) when you have pairwise preference data and want a single supervised-style training pass with no reward model. Use full RLHF with PPO when you need an explicit reward model, online sampling, and reward shaping, typically for safety-critical alignment. DPO is simpler to run on Hugging Face TRL, faster to converge, and is the default preference-tuning method in most 2026 open-weight pipelines.

What is GRPO and when is it useful?

GRPO (Group Relative Policy Optimization) trains a model on grouped rollouts where each group is scored relative to the others, removing the need for a separate value model. It was popularized by the DeepSeek-R1 reasoning training pipeline in 2025 and is now standard for verifier-driven reasoning fine-tunes (math correctness, unit-test pass, JSON schema match). GRPO scales better than PPO on reasoning tasks and is widely supported in open-source RL pipelines.

Should I fine-tune or use RAG plus prompting in 2026?

Start with RAG plus prompt engineering. Fine-tune only when you have a stable task with a known output schema, a real evaluation harness that shows the base model plateaued, and 500 or more high-quality examples. Fine-tuning is the right answer for tone, structured output, domain vocabulary, and tool-call format. RAG is the right answer for fresh knowledge. They compose: a fine-tuned 8B-class open-weight model with strong RAG is the 2026 cost-per-quality sweet spot for many production agents.

How many examples do I need to fine-tune an LLM?

For LoRA on instruction following or tone, 500 to 2,000 curated examples often beat 50,000 noisy ones. For domain adaptation with QLoRA, 5,000 to 20,000 examples in the target distribution. For preference tuning with DPO, 1,000 to 5,000 ranked pairs. For reasoning fine-tunes with GRPO, you need a verifier rather than examples; a few thousand prompts plus a programmatic check is enough. Quality and distribution match the eval set matter far more than raw volume.

What is the cost of fine-tuning an open-weight LLM with LoRA in 2026?

QLoRA on an 8B-class open-weight model fits on a single 24GB consumer GPU and trains a 10K-example dataset overnight; cloud cost is typically 5 to 30 USD on spot. QLoRA on a 70B-class model needs a single 80GB H100 or 2x48GB and runs 50 to 200 USD per epoch on the same dataset. Cost is dominated by sequence length, rank, and batch size at fixed base size; verify on your own provider since rates vary.

Can I fine-tune frontier API models in 2026?

OpenAI and other frontier vendors expose supervised fine-tuning and (in some cases) direct preference optimization through their fine-tuning APIs; check the current OpenAI fine-tuning documentation for the list of supported model snapshots, available methods, and access requirements. When vendor fine-tuning is unavailable or too restrictive, open-weight models running locally are the no-permission alternative.

How do I evaluate a fine-tuned model before shipping?

Hold out 10 to 20 percent of your training distribution as an eval set, then score with task-specific metrics (exact match, BLEU, ROUGE, tool-call accuracy) plus LLM-as-judge on faithfulness, instruction following, and the dimensions you tuned for. Compare against the base model with the same prompt; if the delta is not meaningful on your eval, the fine-tune is not worth shipping. Track regression on a held-out general capability set so you do not catastrophically forget.

View all

Guides

Real-Time Learning in LLMs (2026): Online Learning Methods Explained

How real-time and online learning works in LLMs in 2026: continual learning, RLHF, DPO, GRPO, LoRA, MoE, retrieval-augmented adaptation, and trade-offs.

Vrinda Damani · Dec 5, 2024

9 min

Guides

Synthetic Data for LLM Fine-Tuning in 2026: Methods & Stack

Generate synthetic data to fine-tune LLMs in 2026. Self-Instruct, Constitutional AI, DPO/IPO traces, function calling, and how to evaluate dataset quality.

Rishav Hada · Jan 14, 2025

12 min

Guides

LLM Fine-Tuning Techniques 2026: LoRA, QLoRA, SFT, DPO

LLM fine-tuning techniques in 2026: feature-based, full fine-tune, LoRA, QLoRA, BitFit, SFT, DPO, RLHF, multi-task. When to use each and how to evaluate.

Rishav Hada · Jan 4, 2025

9 min