Continued LLM Pretraining in 2026: Frameworks, Strategies, and Evaluation
Continued LLM pretraining in 2026: Megatron-LM, DeepSpeed, Axolotl, NeMo, Unsloth. Domain adaptation, catastrophic forgetting, evaluation with Future AGI.
Table of Contents
TL;DR: Continued LLM Pretraining at a Glance
| Choice | When to pick | Token budget | GPU footprint |
|---|---|---|---|
| Full continued pretraining | Hard domain shift, deep knowledge injection | 10 to 200 billion+ | 32 to 512+ H100/H200 |
| LoRA / DoRA / LoRA-XS | Light to moderate domain adaptation, preserve general capability | 1 to 20 billion | 4 to 16 H100 |
| Replay-augmented continued pretraining | Strong adaptation with low forgetting risk | 5 to 100 billion | 16 to 256 H100 |
| Instruction tuning over a CPT checkpoint | Always do this if you want a usable assistant | 50 thousand to 5 million SFT examples | 1 to 8 H100 |
For evaluation across all four paths, Future AGI evaluators can score capability retention, domain gain, and downstream task usefulness; see the documentation at docs.futureagi.com.
What changed since 2025: PyTorch FSDP2 has become a common sharding option across many training stacks. NVIDIA NeMo Framework 2.0 streamlined the configuration model. The Axolotl config schema converged on a community standard for parameter-efficient continued pretraining. Unsloth pushed single-node continued pretraining throughput to noticeably lower cost. Inference-aware data quality filters (perplexity, embedding-cluster, n-gram dedup) became standard data prep, not optional.
What Continued LLM Pretraining Is
Continued LLM pretraining, sometimes called continual pretraining (CPT) or domain-adaptive pretraining (DAPT), is the process of taking an already pretrained model and training it further on additional text. The objective stays the same as in original pretraining: next-token prediction over raw text. The volume is high (typically billions of tokens), the supervision is implicit, and the change to the model is broad rather than task-specific.
It sits between the pretrained base and the instruction-tuned chat model in the standard pipeline:
Base model (pretrained on web-scale data)
|
v
Continued pretraining (domain or fresh text) <- you are here
|
v
Supervised fine-tuning (instruction or task data)
|
v
RLHF or DPO (preference alignment)
|
v
Deployed assistant
Skip continued pretraining and you can still get a usable assistant on a generic domain. Include it and you get noticeably better domain performance with a smaller fine-tuning budget downstream.
Why Continued Pretraining Matters in 2026
Three drivers keep continued pretraining relevant even as base models get bigger:
- Knowledge cutoff drift. Frontier base models still ship with cutoffs that are 6 to 18 months stale. Domain-specific knowledge moves faster, especially in regulated industries.
- Vocabulary and structure. Legal citations, medical codes, and finance instruments have grammar that web data underrepresents. Continued pretraining fixes this faster than RAG alone.
- Pre-fine-tune scaffolding. A continued-pretrained checkpoint is a stronger starting point for instruction tuning. Domain SFT runs converge faster and reach higher quality when they start from a CPT checkpoint.
The trade-off has always been catastrophic forgetting: train too aggressively on new domain text and the model loses general capability. The 2026 toolchain (replay, LoRA, low LR schedules, NeMo’s data blending) has narrowed this risk substantially.
Continued Pretraining vs Fine-Tuning vs RAG
| Method | Supervision | Volume | Updates | Best for |
|---|---|---|---|---|
| Continued pretraining | Next-token prediction on raw text | 1 to 200 billion tokens | All weights or LoRA | Domain knowledge, vocabulary, reasoning |
| Supervised fine-tuning | Input-output pairs | 1 thousand to 5 million examples | All weights or LoRA | Task formats, instruction following |
| RLHF / DPO | Preference pairs | 10 thousand to 1 million pairs | All weights | Alignment, tone, safety |
| RAG | None (retrieval at inference) | Index size | Index only | Recent facts, citations, source-grounded answers |
All four compose. The 2026 default for a domain-heavy assistant is: continued pretrain on domain text, instruction-tune on domain-task data, DPO for alignment, then RAG for recency. Future AGI evaluators score the assistant at each stage.
Frameworks for Continued LLM Pretraining in 2026
NVIDIA NeMo Framework 2.0
NeMo is a common enterprise choice for large NVIDIA-centred pretraining and continued pretraining runs. It ships data blending recipes, FP8 mixed precision, FSDP and tensor-parallel sharding, and integrates with NVIDIA NIM for inference.
- Repo: github.com/NVIDIA/NeMo
- License: Apache 2.0
- Strength: production-grade, FP8-ready, integrated data pipeline
- Trade-off: heavier configuration, NVIDIA hardware-first
Megatron-LM
Megatron-LM is NVIDIA’s research-scale training library, the substrate underneath NeMo and many production frontier-model trainers. Use it when you need custom kernels or maximum throughput on H100/H200 fleets.
- Repo: github.com/NVIDIA/Megatron-LM
- License: BSD-style (Megatron core), Apache 2.0 dependencies
- Strength: highest throughput, tensor and pipeline parallelism, FP8 support
- Trade-off: research-grade ergonomics
Microsoft DeepSpeed
DeepSpeed remains the go-to for memory-tight training. ZeRO-3 sharding lets you continued-pretrain large models on smaller GPU memory budgets, and DeepSpeed-Chat covers downstream RLHF.
- Repo: github.com/deepspeedai/DeepSpeed
- License: Apache 2.0
- Strength: ZeRO sharding, CPU offload, MoE training
- Trade-off: heavier configuration than HuggingFace defaults
Axolotl
Axolotl is the open-source ergonomics layer that most independent labs reach for. Single YAML config, support for full continued pretraining and PEFT, and a community of recipes.
- Repo: github.com/axolotl-ai-cloud/axolotl
- License: Apache 2.0
- Strength: config-driven, opinionated defaults, large recipe catalog
- Trade-off: not the absolute fastest, leans on FSDP and DeepSpeed under the hood
Unsloth
Unsloth pushed single-node throughput for LoRA and full continued pretraining to record-low cost. It rewrites the attention and gradient kernels in Triton for 2 to 5x speedup on consumer and prosumer hardware.
- Repo: github.com/unslothai/unsloth
- License: Apache 2.0
- Strength: fastest LoRA / QLoRA on 1 to 8 GPUs, low memory footprint
- Trade-off: single-node focus, multi-node capabilities still maturing
HuggingFace TRL and PyTorch FSDP2
For research and small to mid-scale runs, the HuggingFace TRL library plus PyTorch FSDP2 is a clean and well-maintained stack. Most other frameworks layer on top of it.
- Repo: github.com/huggingface/trl
- License: Apache 2.0
- Strength: easy to read, easy to debug, deep HuggingFace integration
- Trade-off: less out-of-the-box for very large multi-node runs
Strategies That Work in 2026
Data preparation: deduplicate, decontaminate, balance
Three steps come first, before any training code runs:
- Deduplicate aggressively. Use n-gram dedup (5-gram or 10-gram) plus near-duplicate detection. Most public corpora benefit from 30 to 50 percent reduction.
- Decontaminate against your evaluation benchmarks. Filter out any document that overlaps with MMLU, MedQA, FinQA, or whatever you plan to score on. Skipping this inflates your eval numbers without inflating real-world performance.
- Balance domain to general at a 4:1 to 20:1 ratio depending on how aggressive the domain shift is. Aggressive shifts (legal-only, code-only) need a higher general-data fraction to fight forgetting.
Learning rate: small, with warmup, and decay
Continued pretraining is not pretraining from scratch. The base model already sits near a local minimum. A typical schedule:
- Peak LR: 5e-6 to 5e-5, often 5 to 10x lower than original pretraining
- Warmup: 1 to 5 percent of total steps
- Decay: cosine to 10 percent of peak
Replay: mix general data into every batch
Replay buffers are one of the most effective anti-forgetting techniques. 5 to 20 percent of every batch is general-purpose text (FineWeb-Edu, RedPajama, Dolma). For many domain adaptations this materially reduces the risk of general benchmark regression, though deep shifts still need a held-out general validation set.
PEFT first, full continued pretraining if needed
LoRA, DoRA, and LoRA-XS update only adapter matrices, leaving the base weights frozen. This helps preserve general capability and runs 5 to 20 times faster, though adapters can still degrade general behaviour at inference if the domain shift is too aggressive. The 2026 default is: try LoRA continued pretraining first, validate domain gain on a held-out test set with Future AGI evaluators, graduate to full continued pretraining only if the LoRA result is not strong enough.
Monitoring during training
Three signals to watch live:
- Training loss on domain data. Smoothed loss should trend downward; plateaus can signal data exhaustion, learning-rate issues, or optimization limits.
- Validation loss on a held-out general corpus. Should stay flat or improve slightly; rising indicates forgetting.
- Spot-check generation on a curated prompt set. Catches qualitative issues numeric losses miss.
Continued Pretraining Across Industries
Healthcare
The BioMistral and Meditron checkpoints used continued pretraining on PubMed, MIMIC-IV, and clinical guidelines to outperform general-purpose models on clinical QA. (Med-PaLM is a separate medical QA and alignment effort from Google rather than a canonical CPT example.) The pattern is now standard: domain-adapt the base, instruction-tune for clinical phrasing, evaluate with MedQA, MedMCQA, and Future AGI safety templates for harm avoidance.
Finance
BloombergGPT (2023) demonstrated the value of domain-specific continued pretraining on financial text. The 2026 pattern has moved to LoRA-based continued pretraining on SEC filings, earnings transcripts, and finance news, paired with retrieval for time-sensitive numbers. Evaluate with FinQA, ConvFinQA, and Future AGI faithfulness templates.
Legal
Legal-domain models such as the open-source CaseLaw-BERT successors and Saul-7B use continued pretraining on case law and statutes (LegalBench is an evaluation benchmark, not a training corpus) to absorb citation grammar. Pair with retrieval over a current case database for production. Evaluate with LegalBench tasks plus Future AGI citation-accuracy templates.
Code
Code continued pretraining now uses The Stack v2 plus filtered pull requests, paired with execution-grounded evaluation. StarCoder2 and the DeepSeek-Coder family illustrate the open-source pattern. Evaluate with HumanEval, MBPP, and execution-based templates.
Education
Education-domain continued pretraining covers curriculum text, textbooks, and pedagogical content. Pair with synthetic-data generation for adversarial student prompts, run Future AGI evaluators for pedagogical correctness, age-appropriate language, and refusal of off-topic requests.
Evaluation: Three Layers, No Shortcuts
A continued-pretrained model needs evaluation at three layers. Skip any one and you ship a regression you cannot see.
Layer 1: Capability retention
Run general benchmarks on the checkpoint before any instruction tuning:
- MMLU and MMLU-Pro for world knowledge
- HellaSwag for common-sense
- ARC-Challenge for reasoning
- HumanEval if the base had any code capability
A small 1 to 2 point drop may be acceptable if it sits within run-to-run benchmark variance and is offset by validated domain gains; pre-declare the threshold before training. A 5+ point drop signals catastrophic forgetting, fall back to replay or LoRA.
Layer 2: Domain gain
Run domain-specific benchmarks:
- MedQA, MedMCQA for healthcare
- FinQA, ConvFinQA for finance
- LegalBench for legal
- HumanEval+, MBPP for code
You want at least a 5 to 10 point gain to justify the continued pretraining cost.
Layer 3: Downstream task usefulness
Most teams skip this layer and regret it. A model that scores higher on MedQA but worse on agentic tasks is rarely a net win. Steps:
- Instruction-tune a checkpoint of the continued-pretrained model on a domain-task dataset.
- Wire it into your actual production pipeline (RAG, tools, multi-turn).
- Run Future AGI evaluators across task completion, faithfulness, tool-use correctness, and grounding.
- Compare against the base model on the same pipeline.
Quick start: score a continued-pretrained model with Future AGI
import os
from fi.evals import evaluate
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
# Toy example: score whether the model's answer is faithful to a domain context.
# Do not treat this string as medical guidance; it is for demonstration only.
context = "Internal reference text for the domain you are evaluating."
output = "The model's answer goes here."
score = evaluate(
"faithfulness",
output=output,
context=context,
)
print(score)
Run the same pattern across a held-out domain test set, attach the scores to a regression report, gate the next training run on the report. For deeper coverage of evaluation patterns see LLM Evaluation in 2026.
Common Pitfalls and How to Avoid Them
Pitfall 1: Skipping decontamination
If your continued pretraining corpus contains your evaluation prompts, your eval numbers are inflated and your shipped model is weaker than the numbers say. Decontaminate before training, every time.
Pitfall 2: Too high a learning rate
Continued pretraining is not from-scratch pretraining. Use a LR 5 to 10 times lower than the original pretraining LR. Most catastrophic forgetting starts with an overshoot from too aggressive an LR.
Pitfall 3: No replay buffer
Replay is cheap and effective. 5 to 20 percent general-data fraction in every batch fights forgetting at almost no extra compute.
Pitfall 4: No evaluation at layer 3
Eval at layers 1 and 2 only tells you the model passed benchmarks. Layer 3 (downstream task usefulness in your actual pipeline) is the only number that matters for production.
Pitfall 5: Continued pretraining when RAG would do
If the gap is “the model does not know recent facts”, RAG fixes that. If the gap is “the model does not know domain vocabulary or reasoning”, continued pretraining fixes that. Most teams reach for continued pretraining when RAG would solve the problem at 1 percent of the cost.
Synthetic Data Plus Continued Pretraining
Two patterns work in 2026:
- Use synthetic data to fill gaps in the domain corpus. Generate domain documents, case briefs, or technical notes that look like real raw text, then convert them to next-token training data. Prompt-and-answer pairs are usually reserved for the supervised fine-tuning stage; do not drop them into a continued pretraining mix without reformatting them into flowing domain text first.
- Use synthetic data to harden evaluation. Future AGI fi.simulate produces persona-driven test sets that exercise failure modes the base benchmarks miss.
For a deeper look at synthetic data for training see Synthetic Data for Fine-Tuning LLMs.
Cost and Compute Profile in 2026
A rough cost orientation, not a quote:
| Run shape | Tokens | GPUs | Wall time | Spot cost (rough) |
|---|---|---|---|---|
| LoRA continued pretraining, 7B | 5B | 4 to 8 H100 | 1 to 3 days | a few hundred to a few thousand USD |
| Full continued pretraining, 7B | 30B | 16 to 32 H100 | 3 to 10 days | low tens of thousands USD |
| Full continued pretraining, 13B | 50B | 32 to 64 H100 | 7 to 20 days | tens of thousands USD |
| Full continued pretraining, 70B | 100B+ | 128 to 512 H100/H200 | 2 to 6 weeks | hundreds of thousands USD+ |
Prices vary considerably by cloud, region, spot vs reserved, and storage egress. Confirm current rates directly with your provider before budgeting.
Wrapping Up
Continued LLM pretraining stays relevant in 2026 because base-model knowledge cutoffs lag, regulated domains move fast, and a continued-pretrained checkpoint is a stronger starting point for everything downstream. Pick the framework by scale: NeMo or Megatron for the largest runs, DeepSpeed for memory-tight clusters, Axolotl or Unsloth for ergonomic and parameter-efficient adaptation. Always evaluate at three layers, capability retention, domain gain, downstream usefulness, with the third layer scored against your production pipeline. Future AGI evaluators run that third layer at scale with built-in templates for faithfulness, task completion, tool-use correctness, and safety at docs.futureagi.com.
For deeper reads see Synthetic Data for Fine-Tuning LLMs, LLM Evaluation in 2026, and RAG vs Fine-Tuning.
Frequently asked questions
What is continued LLM pretraining?
How is continued pretraining different from fine-tuning?
Which frameworks dominate continued pretraining in 2026?
What is catastrophic forgetting and how do I prevent it?
How do I evaluate a continued-pretrained model?
Should I use full continued pretraining or LoRA-style adaptation?
How long does continued pretraining take in 2026?
What datasets work well for continued pretraining?
Dynamic prompts in 2026: template engines, variable injection, runtime context, versioning, and evaluation. With code, failure modes, and an eval harness.
How to evaluate LLMs in 2026. Pick use-case metrics, score with judges + heuristics, gate CI, and run continuous production evals in under 200 lines.
Automated error detection for generative AI in 2026. Compares the top platforms, real traceAI + fi.evals patterns, and rollout playbook.