Guides

Continued LLM Pretraining in 2026: Frameworks, Strategies, and Evaluation

Continued LLM pretraining in 2026: Megatron-LM, DeepSpeed, Axolotl, NeMo, Unsloth. Domain adaptation, catastrophic forgetting, evaluation with Future AGI.

·
Updated
·
11 min read
agents evaluations data quality fine-tuning llms pretraining rag
Continued LLM Pretraining in 2026
Table of Contents

TL;DR: Continued LLM Pretraining at a Glance

ChoiceWhen to pickToken budgetGPU footprint
Full continued pretrainingHard domain shift, deep knowledge injection10 to 200 billion+32 to 512+ H100/H200
LoRA / DoRA / LoRA-XSLight to moderate domain adaptation, preserve general capability1 to 20 billion4 to 16 H100
Replay-augmented continued pretrainingStrong adaptation with low forgetting risk5 to 100 billion16 to 256 H100
Instruction tuning over a CPT checkpointAlways do this if you want a usable assistant50 thousand to 5 million SFT examples1 to 8 H100

For evaluation across all four paths, Future AGI evaluators can score capability retention, domain gain, and downstream task usefulness; see the documentation at docs.futureagi.com.

What changed since 2025: PyTorch FSDP2 has become a common sharding option across many training stacks. NVIDIA NeMo Framework 2.0 streamlined the configuration model. The Axolotl config schema converged on a community standard for parameter-efficient continued pretraining. Unsloth pushed single-node continued pretraining throughput to noticeably lower cost. Inference-aware data quality filters (perplexity, embedding-cluster, n-gram dedup) became standard data prep, not optional.

What Continued LLM Pretraining Is

Continued LLM pretraining, sometimes called continual pretraining (CPT) or domain-adaptive pretraining (DAPT), is the process of taking an already pretrained model and training it further on additional text. The objective stays the same as in original pretraining: next-token prediction over raw text. The volume is high (typically billions of tokens), the supervision is implicit, and the change to the model is broad rather than task-specific.

It sits between the pretrained base and the instruction-tuned chat model in the standard pipeline:

Base model (pretrained on web-scale data)
        |
        v
Continued pretraining (domain or fresh text)   <- you are here
        |
        v
Supervised fine-tuning (instruction or task data)
        |
        v
RLHF or DPO (preference alignment)
        |
        v
Deployed assistant

Skip continued pretraining and you can still get a usable assistant on a generic domain. Include it and you get noticeably better domain performance with a smaller fine-tuning budget downstream.

Why Continued Pretraining Matters in 2026

Three drivers keep continued pretraining relevant even as base models get bigger:

  • Knowledge cutoff drift. Frontier base models still ship with cutoffs that are 6 to 18 months stale. Domain-specific knowledge moves faster, especially in regulated industries.
  • Vocabulary and structure. Legal citations, medical codes, and finance instruments have grammar that web data underrepresents. Continued pretraining fixes this faster than RAG alone.
  • Pre-fine-tune scaffolding. A continued-pretrained checkpoint is a stronger starting point for instruction tuning. Domain SFT runs converge faster and reach higher quality when they start from a CPT checkpoint.

The trade-off has always been catastrophic forgetting: train too aggressively on new domain text and the model loses general capability. The 2026 toolchain (replay, LoRA, low LR schedules, NeMo’s data blending) has narrowed this risk substantially.

Continued Pretraining vs Fine-Tuning vs RAG

MethodSupervisionVolumeUpdatesBest for
Continued pretrainingNext-token prediction on raw text1 to 200 billion tokensAll weights or LoRADomain knowledge, vocabulary, reasoning
Supervised fine-tuningInput-output pairs1 thousand to 5 million examplesAll weights or LoRATask formats, instruction following
RLHF / DPOPreference pairs10 thousand to 1 million pairsAll weightsAlignment, tone, safety
RAGNone (retrieval at inference)Index sizeIndex onlyRecent facts, citations, source-grounded answers

All four compose. The 2026 default for a domain-heavy assistant is: continued pretrain on domain text, instruction-tune on domain-task data, DPO for alignment, then RAG for recency. Future AGI evaluators score the assistant at each stage.

Frameworks for Continued LLM Pretraining in 2026

NVIDIA NeMo Framework 2.0

NeMo is a common enterprise choice for large NVIDIA-centred pretraining and continued pretraining runs. It ships data blending recipes, FP8 mixed precision, FSDP and tensor-parallel sharding, and integrates with NVIDIA NIM for inference.

  • Repo: github.com/NVIDIA/NeMo
  • License: Apache 2.0
  • Strength: production-grade, FP8-ready, integrated data pipeline
  • Trade-off: heavier configuration, NVIDIA hardware-first

Megatron-LM

Megatron-LM is NVIDIA’s research-scale training library, the substrate underneath NeMo and many production frontier-model trainers. Use it when you need custom kernels or maximum throughput on H100/H200 fleets.

  • Repo: github.com/NVIDIA/Megatron-LM
  • License: BSD-style (Megatron core), Apache 2.0 dependencies
  • Strength: highest throughput, tensor and pipeline parallelism, FP8 support
  • Trade-off: research-grade ergonomics

Microsoft DeepSpeed

DeepSpeed remains the go-to for memory-tight training. ZeRO-3 sharding lets you continued-pretrain large models on smaller GPU memory budgets, and DeepSpeed-Chat covers downstream RLHF.

  • Repo: github.com/deepspeedai/DeepSpeed
  • License: Apache 2.0
  • Strength: ZeRO sharding, CPU offload, MoE training
  • Trade-off: heavier configuration than HuggingFace defaults

Axolotl

Axolotl is the open-source ergonomics layer that most independent labs reach for. Single YAML config, support for full continued pretraining and PEFT, and a community of recipes.

  • Repo: github.com/axolotl-ai-cloud/axolotl
  • License: Apache 2.0
  • Strength: config-driven, opinionated defaults, large recipe catalog
  • Trade-off: not the absolute fastest, leans on FSDP and DeepSpeed under the hood

Unsloth

Unsloth pushed single-node throughput for LoRA and full continued pretraining to record-low cost. It rewrites the attention and gradient kernels in Triton for 2 to 5x speedup on consumer and prosumer hardware.

  • Repo: github.com/unslothai/unsloth
  • License: Apache 2.0
  • Strength: fastest LoRA / QLoRA on 1 to 8 GPUs, low memory footprint
  • Trade-off: single-node focus, multi-node capabilities still maturing

HuggingFace TRL and PyTorch FSDP2

For research and small to mid-scale runs, the HuggingFace TRL library plus PyTorch FSDP2 is a clean and well-maintained stack. Most other frameworks layer on top of it.

  • Repo: github.com/huggingface/trl
  • License: Apache 2.0
  • Strength: easy to read, easy to debug, deep HuggingFace integration
  • Trade-off: less out-of-the-box for very large multi-node runs

Strategies That Work in 2026

Data preparation: deduplicate, decontaminate, balance

Three steps come first, before any training code runs:

  1. Deduplicate aggressively. Use n-gram dedup (5-gram or 10-gram) plus near-duplicate detection. Most public corpora benefit from 30 to 50 percent reduction.
  2. Decontaminate against your evaluation benchmarks. Filter out any document that overlaps with MMLU, MedQA, FinQA, or whatever you plan to score on. Skipping this inflates your eval numbers without inflating real-world performance.
  3. Balance domain to general at a 4:1 to 20:1 ratio depending on how aggressive the domain shift is. Aggressive shifts (legal-only, code-only) need a higher general-data fraction to fight forgetting.

Learning rate: small, with warmup, and decay

Continued pretraining is not pretraining from scratch. The base model already sits near a local minimum. A typical schedule:

  • Peak LR: 5e-6 to 5e-5, often 5 to 10x lower than original pretraining
  • Warmup: 1 to 5 percent of total steps
  • Decay: cosine to 10 percent of peak

Replay: mix general data into every batch

Replay buffers are one of the most effective anti-forgetting techniques. 5 to 20 percent of every batch is general-purpose text (FineWeb-Edu, RedPajama, Dolma). For many domain adaptations this materially reduces the risk of general benchmark regression, though deep shifts still need a held-out general validation set.

PEFT first, full continued pretraining if needed

LoRA, DoRA, and LoRA-XS update only adapter matrices, leaving the base weights frozen. This helps preserve general capability and runs 5 to 20 times faster, though adapters can still degrade general behaviour at inference if the domain shift is too aggressive. The 2026 default is: try LoRA continued pretraining first, validate domain gain on a held-out test set with Future AGI evaluators, graduate to full continued pretraining only if the LoRA result is not strong enough.

Monitoring during training

Three signals to watch live:

  • Training loss on domain data. Smoothed loss should trend downward; plateaus can signal data exhaustion, learning-rate issues, or optimization limits.
  • Validation loss on a held-out general corpus. Should stay flat or improve slightly; rising indicates forgetting.
  • Spot-check generation on a curated prompt set. Catches qualitative issues numeric losses miss.

Continued Pretraining Across Industries

Healthcare

The BioMistral and Meditron checkpoints used continued pretraining on PubMed, MIMIC-IV, and clinical guidelines to outperform general-purpose models on clinical QA. (Med-PaLM is a separate medical QA and alignment effort from Google rather than a canonical CPT example.) The pattern is now standard: domain-adapt the base, instruction-tune for clinical phrasing, evaluate with MedQA, MedMCQA, and Future AGI safety templates for harm avoidance.

Finance

BloombergGPT (2023) demonstrated the value of domain-specific continued pretraining on financial text. The 2026 pattern has moved to LoRA-based continued pretraining on SEC filings, earnings transcripts, and finance news, paired with retrieval for time-sensitive numbers. Evaluate with FinQA, ConvFinQA, and Future AGI faithfulness templates.

Legal-domain models such as the open-source CaseLaw-BERT successors and Saul-7B use continued pretraining on case law and statutes (LegalBench is an evaluation benchmark, not a training corpus) to absorb citation grammar. Pair with retrieval over a current case database for production. Evaluate with LegalBench tasks plus Future AGI citation-accuracy templates.

Code

Code continued pretraining now uses The Stack v2 plus filtered pull requests, paired with execution-grounded evaluation. StarCoder2 and the DeepSeek-Coder family illustrate the open-source pattern. Evaluate with HumanEval, MBPP, and execution-based templates.

Education

Education-domain continued pretraining covers curriculum text, textbooks, and pedagogical content. Pair with synthetic-data generation for adversarial student prompts, run Future AGI evaluators for pedagogical correctness, age-appropriate language, and refusal of off-topic requests.

Evaluation: Three Layers, No Shortcuts

A continued-pretrained model needs evaluation at three layers. Skip any one and you ship a regression you cannot see.

Layer 1: Capability retention

Run general benchmarks on the checkpoint before any instruction tuning:

  • MMLU and MMLU-Pro for world knowledge
  • HellaSwag for common-sense
  • ARC-Challenge for reasoning
  • HumanEval if the base had any code capability

A small 1 to 2 point drop may be acceptable if it sits within run-to-run benchmark variance and is offset by validated domain gains; pre-declare the threshold before training. A 5+ point drop signals catastrophic forgetting, fall back to replay or LoRA.

Layer 2: Domain gain

Run domain-specific benchmarks:

  • MedQA, MedMCQA for healthcare
  • FinQA, ConvFinQA for finance
  • LegalBench for legal
  • HumanEval+, MBPP for code

You want at least a 5 to 10 point gain to justify the continued pretraining cost.

Layer 3: Downstream task usefulness

Most teams skip this layer and regret it. A model that scores higher on MedQA but worse on agentic tasks is rarely a net win. Steps:

  1. Instruction-tune a checkpoint of the continued-pretrained model on a domain-task dataset.
  2. Wire it into your actual production pipeline (RAG, tools, multi-turn).
  3. Run Future AGI evaluators across task completion, faithfulness, tool-use correctness, and grounding.
  4. Compare against the base model on the same pipeline.

Quick start: score a continued-pretrained model with Future AGI

import os
from fi.evals import evaluate

os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

# Toy example: score whether the model's answer is faithful to a domain context.
# Do not treat this string as medical guidance; it is for demonstration only.
context = "Internal reference text for the domain you are evaluating."
output = "The model's answer goes here."

score = evaluate(
    "faithfulness",
    output=output,
    context=context,
)
print(score)

Run the same pattern across a held-out domain test set, attach the scores to a regression report, gate the next training run on the report. For deeper coverage of evaluation patterns see LLM Evaluation in 2026.

Common Pitfalls and How to Avoid Them

Pitfall 1: Skipping decontamination

If your continued pretraining corpus contains your evaluation prompts, your eval numbers are inflated and your shipped model is weaker than the numbers say. Decontaminate before training, every time.

Pitfall 2: Too high a learning rate

Continued pretraining is not from-scratch pretraining. Use a LR 5 to 10 times lower than the original pretraining LR. Most catastrophic forgetting starts with an overshoot from too aggressive an LR.

Pitfall 3: No replay buffer

Replay is cheap and effective. 5 to 20 percent general-data fraction in every batch fights forgetting at almost no extra compute.

Pitfall 4: No evaluation at layer 3

Eval at layers 1 and 2 only tells you the model passed benchmarks. Layer 3 (downstream task usefulness in your actual pipeline) is the only number that matters for production.

Pitfall 5: Continued pretraining when RAG would do

If the gap is “the model does not know recent facts”, RAG fixes that. If the gap is “the model does not know domain vocabulary or reasoning”, continued pretraining fixes that. Most teams reach for continued pretraining when RAG would solve the problem at 1 percent of the cost.

Synthetic Data Plus Continued Pretraining

Two patterns work in 2026:

  • Use synthetic data to fill gaps in the domain corpus. Generate domain documents, case briefs, or technical notes that look like real raw text, then convert them to next-token training data. Prompt-and-answer pairs are usually reserved for the supervised fine-tuning stage; do not drop them into a continued pretraining mix without reformatting them into flowing domain text first.
  • Use synthetic data to harden evaluation. Future AGI fi.simulate produces persona-driven test sets that exercise failure modes the base benchmarks miss.

For a deeper look at synthetic data for training see Synthetic Data for Fine-Tuning LLMs.

Cost and Compute Profile in 2026

A rough cost orientation, not a quote:

Run shapeTokensGPUsWall timeSpot cost (rough)
LoRA continued pretraining, 7B5B4 to 8 H1001 to 3 daysa few hundred to a few thousand USD
Full continued pretraining, 7B30B16 to 32 H1003 to 10 dayslow tens of thousands USD
Full continued pretraining, 13B50B32 to 64 H1007 to 20 daystens of thousands USD
Full continued pretraining, 70B100B+128 to 512 H100/H2002 to 6 weekshundreds of thousands USD+

Prices vary considerably by cloud, region, spot vs reserved, and storage egress. Confirm current rates directly with your provider before budgeting.

Wrapping Up

Continued LLM pretraining stays relevant in 2026 because base-model knowledge cutoffs lag, regulated domains move fast, and a continued-pretrained checkpoint is a stronger starting point for everything downstream. Pick the framework by scale: NeMo or Megatron for the largest runs, DeepSpeed for memory-tight clusters, Axolotl or Unsloth for ergonomic and parameter-efficient adaptation. Always evaluate at three layers, capability retention, domain gain, downstream usefulness, with the third layer scored against your production pipeline. Future AGI evaluators run that third layer at scale with built-in templates for faithfulness, task completion, tool-use correctness, and safety at docs.futureagi.com.

For deeper reads see Synthetic Data for Fine-Tuning LLMs, LLM Evaluation in 2026, and RAG vs Fine-Tuning.

Frequently asked questions

What is continued LLM pretraining?
Continued LLM pretraining, sometimes called continual pretraining or domain-adaptive pretraining, is the process of taking an existing pretrained model and training it further on additional, often domain-specific text. The objective remains next-token prediction, the same as the original pretraining stage. The result is a model that retains its general capabilities while gaining stronger vocabulary, reasoning patterns, and recall for the new domain. It sits between original pretraining and instruction fine-tuning in the model adaptation pipeline.
How is continued pretraining different from fine-tuning?
Continued pretraining uses next-token prediction on raw text in large volumes, typically billions of tokens, and updates broad model knowledge. Fine-tuning uses supervised input-output pairs or preference data in much smaller volumes, typically thousands to millions of examples, and updates how the model behaves on specific tasks. Continued pretraining changes what the model knows. Fine-tuning changes how the model responds. Both can be combined: continued pretrain first to adapt to a domain, then fine-tune for instructions or specific tasks.
Which frameworks dominate continued pretraining in 2026?
Five frameworks cover most production workloads: NVIDIA NeMo for the largest enterprise runs, Megatron-LM for research-scale custom kernels, Microsoft DeepSpeed for memory-efficient distributed training, Axolotl for the most ergonomic config-driven adaptation, and Unsloth for the fastest single-node QLoRA continued pretraining. PyTorch FSDP2 and the HuggingFace TRL library underpin many of them. Choose by scale: NeMo or Megatron above 70B parameters, DeepSpeed for memory-tight clusters, Axolotl or Unsloth for parameter-efficient continued pretraining.
What is catastrophic forgetting and how do I prevent it?
Catastrophic forgetting happens when a model loses general capability while training on new domain data. Three mitigations work in 2026. Replay buffers mix general-purpose tokens into the continued pretraining batches at a 5 to 20 percent ratio, often using FineWeb-Edu or RedPajama samples. Lower learning rates, typically 1e-5 to 5e-5, prevent overshoot from the pretrained optimum. Parameter-efficient methods like LoRA, DoRA, and the newer LoRA-XS keep the base weights frozen, which helps reduce forgetting risk at the cost of slightly lower domain absorption; aggressive shifts can still degrade general behaviour at inference, so always score capability retention.
How do I evaluate a continued-pretrained model?
Three layers of evaluation. Capability retention: run general benchmarks like MMLU, HellaSwag, MMLU-Pro to confirm general knowledge held up. Domain gain: run domain-specific benchmarks like MedQA for healthcare, FinQA for finance, or LegalBench for legal. Downstream usefulness: instruction-tune a checkpoint and evaluate task completion, faithfulness, and tool-use correctness with templates from Future AGI evaluators. Skipping the third step is the most common mistake; a model that scores higher on MedQA but worse on agentic tasks is rarely a net win.
Should I use full continued pretraining or LoRA-style adaptation?
It depends on scale and risk tolerance. Full continued pretraining updates every weight, which gives the strongest domain absorption and the highest catastrophic forgetting risk. LoRA, DoRA, and LoRA-XS update small adapter matrices, which preserves general knowledge and trains 5 to 20 times faster, at the cost of slightly lower domain absorption. In 2026 most teams default to LoRA for first iteration, validate domain gain against a held-out test set with Future AGI evaluators, and graduate to full continued pretraining only if the LoRA result is not strong enough.
How long does continued pretraining take in 2026?
A typical mid-size continued pretraining job in 2026 looks like this: take a 7B to 13B parameter base model, train on 5 to 50 billion tokens of domain data, on 8 to 64 H100 or H200 GPUs, for 3 to 30 days of wall time. Frontier-scale continued pretraining on a 70B+ model with hundreds of billions of tokens runs for weeks on hundreds of GPUs. LoRA-only continued pretraining on a 7B model with 5B tokens can finish in 1 to 3 days on 4 to 8 H100s.
What datasets work well for continued pretraining?
Use a mix of in-domain text and general-purpose replay. In-domain sources vary by use case: PubMed abstracts and MIMIC for clinical, SEC filings and FNSPID for finance, EDGAR and CaseHOLD for legal, The Stack v2 for code. Pair with 5 to 20 percent general data from FineWeb-Edu, RedPajama, or Dolma for replay. Always deduplicate, decontaminate against your evaluation benchmarks, and screen for PII before training. Future AGI's fi.simulate plus evaluators help generate adversarial domain test sets to harden the resulting model.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.