Guides

LLM Fine-Tuning Techniques in 2026: From Feature-Based to LoRA, QLoRA, SFT, and DPO

LLM fine-tuning techniques in 2026: feature-based, full fine-tune, LoRA, QLoRA, BitFit, SFT, DPO, RLHF, multi-task. When to use each and how to evaluate.

January 4, 2025

Updated May 14, 2026

9 min read

evaluations llms fine-tuning 2026

LLM fine-tuning techniques in 2026: the landscape in one read

Fine-tuning in 2026 is no longer a single decision. It is a stack: pick a base model, pick how many parameters you will touch (feature-based, head-only, full, or PEFT), pick the supervision shape (SFT, DPO, RLHF), and pick how you will evaluate the result. The methods below are the ones that show up in real production pipelines, and the trade-offs that decide which one you reach for.

TL;DR

Method	Trainable params	Best for	Typical cost
Feature-based	0 (base) + small head	Quick classification baselines on top of an encoder	Lowest
Head-only	Just the classifier head	Encoder LLM classification, retrieval rerankers	Low
Full fine-tuning (full-parameter SFT or DPO)	100%	Highest quality on a narrow domain when you can afford it	Highest
LoRA	~0.1 to 1%	The default PEFT in 2026; adapters per tenant or task	Low
QLoRA	~0.1 to 1% on 4-bit base	Fine-tuning big base models on a single GPU	Lowest of large-model options
BitFit	Bias-only	Tiny budget, surprisingly strong	Very low
Adapters / IA^3 / Prefix	Small inserted modules	Hugging Face PEFT alternatives to LoRA	Low
Instruction tuning (SFT)	Depends	Teaching a base to follow instructions	Medium
DPO / IPO / KTO	Depends	Preference tuning without a reward model	Medium
RLHF / RLAIF	Depends	Final alignment for objectives DPO cannot capture	Highest
Multi-task / sequential	Depends	Generalization across related tasks	Medium

What changed since 2025

Three forces reshape the 2026 fine-tuning playbook:

DPO is the new default for preference tuning. Direct Preference Optimization (Rafailov et al., 2023, arXiv:2305.18290) and its cousins (IPO, KTO, ORPO) replace the SFT-then-PPO pipeline for most teams. They are simpler to implement and produce competitive quality.
QLoRA-class methods make big-model adaptation accessible. With 4-bit NF4 quantization, paged optimizers, and double quantization (Dettmers et al., 2023, arXiv:2305.14314), a single 48GB GPU can fine-tune a 65B-class base model.
Prompting and retrieval close more of the gap than they used to. Prompt optimization libraries (DSPy, GEPA, ProTeGi), long-context windows, and structured tool use mean that many problems that once required fine-tuning are now solved by better prompts and better retrieval. Fine-tune when prompting cannot close the gap, not by default.

Foundational fine-tuning techniques

Feature-based approach

A pre-trained LLM (often an encoder like an instruction-tuned BERT/DeBERTa, or hidden-state embeddings exposed by a decoder-only chat model through its provider API or via mean-pooling its last hidden state) produces contextual embeddings for each input. Those embeddings feed a downstream classifier (SVM, logistic regression, a small MLP). The base model is frozen, so training cost is whatever you spend on the head.

Pros: cheapest method, deterministic at inference if the base is fixed, easy to swap heads per task.
Cons: ceiling is bounded by what is encoded in the embeddings. Modern decoder-only chat models often expose embeddings less cleanly than encoders.
When to use: quick classification baselines, retrieval rerankers, telemetry classifiers, anything where you need a working model in an afternoon.

Partial / head-only fine-tuning

Freeze the base model body and train only a task-specific head on top. This sits between feature-based (no body gradients at all) and full fine-tuning (every layer learns).

Pros: small compute footprint, much harder to forget existing capabilities.
Cons: still bottlenecked by the frozen body; can plateau on harder tasks.
When to use: classification on top of an encoder LLM, span extraction, scoring heads.

Full fine-tuning (full-parameter training)

Unfreeze every parameter and train end-to-end. “Full fine-tuning” is the parameter-scope choice (every weight is trainable); whether you train it with supervised data (full-parameter SFT) or preference data (full-parameter DPO) is a separate axis. This is the maximum-expressivity option and still the gold standard when you can afford the compute and the data.

Pros: highest ceiling; the model can change its behavior anywhere in the network.
Cons: GPU memory and time are dominant; catastrophic forgetting is real and needs to be managed; weights are large to ship and version.
When to use: large supervised data sets, narrow domains where quality matters more than cost, model distillation into a smaller target.

Parameter-efficient fine-tuning (PEFT)

In 2026, parameter-efficient fine-tuning is the default for most product teams. The Hugging Face PEFT library (github.com/huggingface/peft) packages the common methods.

LoRA (Low-Rank Adaptation)

LoRA (Hu et al., 2021, arXiv:2106.09685) injects two trainable low-rank matrices A and B into the attention and MLP projection weights of a transformer. The base weights stay frozen; only A @ B is learned. The result is an adapter that is typically 0.1 to 1 percent the size of the base model.

Pros: small memory footprint, fast training, adapters are cheap to version and swap, easy to merge into base weights at deploy.
Cons: the rank is a real bottleneck on very large tasks; very high-rank updates can approach full fine-tuning costs.
When to use: the default starting point for 2026 fine-tuning workflows.

QLoRA

QLoRA (Dettmers et al., 2023, arXiv:2305.14314) keeps the base model in 4-bit NormalFloat (NF4) precision, uses double quantization for the quantization constants, and pages optimizer states to CPU memory. LoRA matrices are still trained in higher precision on top.

Pros: lets you fine-tune very large base models (up to 65B-class) on a single 48GB GPU.
Cons: training is slower than plain LoRA on fp16/bf16; numerics require care for stability.
When to use: GPU memory is the binding constraint and you want a big base.

BitFit

BitFit (Ben Zaken et al., 2022, arXiv:2106.10199) updates only the bias terms of the model. Trainable parameter count drops by another order of magnitude versus LoRA.

Pros: smallest memory footprint of any non-trivial fine-tune.
Cons: ceiling is lower than LoRA on most generation tasks.
When to use: lightweight classification, adapter-budget-constrained settings, ablation baselines.

Adapters, IA^3, Prefix-Tuning

Adapters (Houlsby et al., 2019) insert small trainable feed-forward modules between transformer layers.
IA^3 (Liu et al., 2022) scales attention and feed-forward activations with learned vectors.
Prefix-tuning (Li and Liang, 2021) prepends trainable continuous vectors to the attention key/value cache.

All three are PEFT alternatives to LoRA available through Hugging Face PEFT. LoRA is the default in 2026, but these remain useful for specific architectures or budget regimes.

Supervision regimes: SFT, DPO, RLHF

Method talks about which parameters move. Supervision regime talks about what objective they move toward.

Instruction tuning (SFT)

Supervised fine-tuning on instruction-response pairs. The base learns to follow natural-language commands. Open recipes like Allen AI’s Tulu 3 (arxiv.org/abs/2411.15124) document curated mixtures of math, code, IFEval, GSM8K, and conversational data, and are a good reference for both data composition and hyperparameters.

Direct preference optimization (DPO)

DPO (Rafailov et al., 2023, arXiv:2305.18290) trains directly on triples of (prompt, preferred response, rejected response) using a classification-style loss derived from the Bradley-Terry preference model. No separate reward model, no PPO loop.

Pros: simpler training stack than RLHF, competitive quality on most benchmarks, plays well with LoRA.
Cons: very sensitive to data quality; near-duplicate preferred and rejected pairs collapse the signal.
Variants: IPO (Azar et al., 2023), KTO (Ethayarajh et al., 2024), ORPO (Hong et al., 2024).

RLHF and RLAIF

Train a reward model from human preferences (or AI preferences for RLAIF), then optimize the LLM against the reward with PPO or another RL algorithm. Frontier labs still run RLHF for final alignment because it can express objectives that pure preference triples cannot. For most teams, DPO is the cheaper substitute.

Practical considerations

Data preparation

Quality over volume. A 5k well-curated SFT set usually outperforms a 50k noisy set on the same task.
Deduplication and contamination checks. Make sure evaluation data is not in the training set.
Privacy. When fine-tuning on user data, follow your retention and consent policies. Techniques like differential privacy and PII redaction belong in the preprocessing step.
Mix some base data. Including 5 to 20 percent of broader instruction data during domain SFT reduces catastrophic forgetting on adjacent tasks.

Hyperparameter optimization

The hyperparameters that matter most:

Learning rate (1e-5 to 5e-4 for LoRA; 1e-6 to 5e-5 for full SFT, depending on model size).
Batch size and gradient accumulation.
Number of epochs (usually 1 to 3 for SFT, 1 for DPO).
For LoRA: rank r, alpha, and which modules to target (often q_proj, k_proj, v_proj, o_proj, sometimes MLP projections).

Use a held-out validation set, not the training loss, to decide when to stop. Bayesian optimization tools or simple sweeps work fine; the sample efficiency of LoRA makes coarse sweeps tractable.

Avoiding overfitting and catastrophic forgetting

Prefer LoRA so the original weights are untouched.
Use early stopping on a validation set.
Mix in baseline data (replay).
Use elastic-weight consolidation or L2 regularization on critical parameters for full SFT.
Run capability benchmarks (MMLU, IFEval, HumanEval) before and after to quantify regressions.

How to evaluate a fine-tuned model in 2026

The hardest part of fine-tuning is knowing whether the result is actually better. The 2026 standard:

A frozen domain-specific golden set you control, scored with deterministic metrics (exact match, BLEU, ROUGE, code execution, structured-output parsing).
Public capability benchmarks to detect regression (MMLU, IFEval, GSM8K, HumanEval, MT-Bench).
LLM-as-a-judge evaluation for open-ended outputs (faithfulness, answer relevance, tool correctness, helpfulness).
A live shadow trace evaluation on captured user requests once the model is in staging.

Future AGI’s evaluation library is one option for points 3 and 4: from fi.evals import evaluate runs cloud judges like turing_flash (about 1 to 2 seconds), turing_small (about 2 to 3 seconds), or turing_large (about 3 to 5 seconds) on the same traces you collect from observability, with metrics for faithfulness and answer relevance baked in.

from fi.evals import evaluate

score = evaluate(
    "faithfulness",
    output="Eiffel Tower is 330 meters tall.",
    context="The Eiffel Tower in Paris is 330 meters tall.",
)
print(score)

For traces, the open-source traceAI library (Apache 2.0) ships OpenTelemetry-native instrumentors for OpenAI, Anthropic, LangChain, LlamaIndex, OpenAI Agents, and MCP servers. The same run_id that flows through fi_instrumentation.register and FITracer is the join key for evaluation downstream.

Choosing the right technique

Use this short decision tree:

You want a classification baseline this afternoon. Feature-based or head-only on an encoder LLM.
You want to teach a small open-weight base to follow your prompts. SFT + LoRA. QLoRA if the base is large.
Your SFT model is correct but is not aligned to user preferences. Add DPO on top of LoRA.
You need maximum quality on a narrow domain and can afford the compute. Full SFT, then DPO.
You only have bias-budget compute. BitFit; treat it as a ceiling experiment.
You are exhausting prompting and retrieval and still missing a measured target. Fine-tune. Otherwise, prompt optimize and instrument first.

Where Future AGI fits

Fine-tuning is owned by Hugging Face Transformers + PEFT, TRL, axolotl, Unsloth, and similar frameworks; Future AGI is not a fine-tuning framework. Future AGI’s role around fine-tuning is the evaluation and observability layer:

fi.evals.evaluate and fi.evals.Evaluator (with fi.evals.metrics.CustomLLMJudge and fi.evals.llm.LiteLLMProvider) score the model before and after tuning.
traceAI + fi_instrumentation.register + FITracer capture production traces of the fine-tuned model so regressions surface in live traffic.
The Agent Command Center BYOK gateway at /platform/monitor/command-center routes between your fine-tuned model and frontier baselines so you can A/B test in production.

Together, fine-tuning framework on one side and evaluation + observability on the other is the shape most 2026 teams settle on.

Summary

LLM fine-tuning in 2026 is a stack of choices: scope (feature-based, head-only, PEFT, full), regime (SFT, DPO, RLHF), and evaluation (golden set, benchmarks, LLM-as-judge, live trace). The default starting point is LoRA + SFT, with QLoRA when memory is tight and DPO on top for preference alignment. RLHF and full fine-tuning are escalation paths, not defaults. The decision that matters more than which method to pick is whether you should be fine-tuning at all; prompt-optimize and instrument first, fine-tune when the gap is real and measured.

Frequently asked questions

What is LLM fine-tuning in 2026?

Fine-tuning takes a base or chat-tuned LLM and continues training it on data that represents the task or behavior you want. In 2026, full-parameter SFT is still the highest-ceiling option, but parameter-efficient methods like LoRA and QLoRA dominate practical workflows because they cost an order of magnitude less compute and produce adapters that are easy to version, serve, and merge.

Should I use full fine-tuning or LoRA?

Use LoRA (and QLoRA on top of it) when you want fast iteration, low GPU cost, and an adapter you can swap or merge per tenant. Use full fine-tuning when you need maximum quality on a narrow domain and you can afford the compute and memory, or when you plan to continue training on a large data volume where adapter rank becomes a bottleneck. For most product teams, LoRA is the default and full fine-tuning is the escalation path.

What is QLoRA and when should I pick it over plain LoRA?

QLoRA (Dettmers et al., 2023) is LoRA applied on top of a base model quantized to 4-bit NormalFloat (NF4) with paged optimizers and double quantization. It lets you fine-tune a 65B-class model on a single 48 GB GPU, which is the regime that makes adapter creation accessible to small teams. Pick QLoRA when GPU memory is the binding constraint; plain LoRA when you have enough memory and want slightly cleaner numerics.

What is the difference between SFT, DPO, and RLHF?

SFT (supervised fine-tuning) teaches a model to produce the right answer from labeled examples. RLHF (reinforcement learning from human feedback) trains a reward model from pairwise human preferences and then optimizes the LLM against that reward, classically with PPO. DPO (direct preference optimization, Rafailov et al., 2023) skips the explicit reward model and trains directly on preference pairs with a simple classification-style loss. In 2026, the typical stack is base → SFT → DPO; RLHF remains the gold standard for some objectives but is more expensive to run.

How do I evaluate a fine-tuned model?

Hold out a representative test set before you start, evaluate on capability benchmarks (e.g., MMLU, HumanEval, IFEval where relevant) plus a domain-specific golden set you control. Track regression on out-of-domain tasks to catch catastrophic forgetting. For production behavior, run LLM-as-a-judge evaluations against captured user traces; Future AGI's `fi.evals.evaluate` and `fi.evals.Evaluator` cover faithfulness, answer relevance, and tool correctness on top of the same traces.

What is catastrophic forgetting and how do I mitigate it?

Catastrophic forgetting is the regression in unrelated abilities you see after fine-tuning on a narrow slice of data. Mitigations include: keep base data in the mix at a small percentage (replay), use a low learning rate and short training, prefer LoRA over full fine-tuning so the original weights are untouched, and use elastic-weight consolidation or regularization on critical parameters when full-tune is required.

How much data do I need to fine-tune a base LLM in 2026?

For instruction-following SFT, a few thousand high-quality examples can produce a measurable lift on a narrow domain. For preference tuning (DPO), 1k to 10k well-curated preference pairs is a typical floor. Open recipes like Tulu 3 (Allen AI, 2024) document the curated mixtures at scale. Quality matters more than volume; spending time on deduplication, contamination checks, and clean labels usually outperforms doubling raw count.

Do I need to fine-tune at all in 2026?

Often no. Long-context windows, tool use, retrieval-augmented generation, prompt optimization (DSPy, GEPA, ProTeGi), and prompt caching solve a large share of the problems that historically required fine-tuning. Fine-tune when (a) prompting cannot close a measured quality gap, (b) you need lower per-request cost than a frontier model, or (c) you need a smaller model footprint for latency, on-prem, or unit economics reasons. Evaluate first; tune second.

View all

Guides

Build a Generative AI Chatbot in 2026: Step-by-Step Guide

Build a generative AI chatbot in 2026: model selection, RAG, prompt-opt, evaluation, observability, guardrails, gateway. Step-by-step with current tooling.

Rishav Hada · Jul 24, 2025

8 min

Guides

Top 10 Prompt Optimization Tools in 2026

Top 10 prompt optimization tools in 2026 ranked: FutureAGI, DSPy, TextGrad, PromptHub, PromptLayer, LangSmith, Helicone, Humanloop, DeepEval, Prompt Flow.

NVJK Kartik · Jul 15, 2025

11 min

Guides

Gemini 2.5 Pro in 2026: Is It Still Worth Using After Gemini 3.1 Pro?

Gemini 2.5 Pro in May 2026: pricing, benchmarks, retirement status, and whether to upgrade to Gemini 3.1 Pro for new builds. With migration checklist.

Rishav Hada · Apr 29, 2025

9 min