What Is Learning Rate in Machine Learning?
The scalar hyperparameter that controls how far a gradient-descent optimiser moves model weights at each training step.
What Is Learning Rate in Machine Learning?
Learning rate is the scalar hyperparameter that controls how far a gradient-descent optimiser moves the model weights at each training step. It multiplies the gradient before the weight update: too small a value yields slow convergence or stuck-in-saddle-point behaviour; too large blows up the loss. Modern deep-learning training rarely uses a fixed rate — schedulers warm up linearly, cosine-anneal, or step-decay through training, and adaptive methods like Adam, AdamW, and Adafactor scale the rate per parameter. In LLM fine-tuning the choice is decisive; orders of magnitude separate a successful run from a destroyed model.
Why It Matters in Production LLM and Agent Systems
Most teams running LLMs do not train them from scratch; they fine-tune. And every fine-tune is a learning-rate decision waiting to break things. A LoRA pass at 1e-4 may merge cleanly; the same data at 1e-3 wipes the base model’s capabilities, leaving a fluent-but-useless adapter. RLHF runs are even more delicate — too large a rate sends the policy away from the reference distribution, KL regularisation pulls it back, and the policy collapses to the few actions the reward model loves.
The pain shows up across roles. ML engineers chase eval-fail-rate spikes after a fine-tune that turn out to be learning-rate too high causing catastrophic forgetting. Product leads test a fine-tune that beats the base model on the target task but lost reasoning on out-of-domain prompts because the rate scheduler did not warm up. Compliance leads ask whether the fine-tuned model still satisfies safety evaluations, since a high-rate update can degrade refusal behaviour.
In 2026 stacks the surface widens. Continuous fine-tuning pipelines, online RLHF, and agent-trajectory fine-tunes each have their own rate budget. Without a regression-eval gate, every fine-tune is a regression risk.
How FutureAGI Handles Fine-Tune Quality Regressions
FutureAGI does not implement gradient descent — we sit downstream of any training run. The connection point is the candidate model coming out of fine-tune at each learning-rate setting. The team versions a Dataset covering core capabilities, attaches AnswerRelevancy, Faithfulness, HallucinationScore, JSONValidation, and any task-specific evaluators, and runs Dataset.add_evaluation(...) against each candidate model.
A concrete workflow: a team runs four fine-tune jobs sweeping learning rate [5e-5, 1e-4, 5e-4, 1e-3] over the same data. They evaluate each candidate on Dataset v3 and surface results in the dashboard. 5e-5 underfits — AnswerRelevancy is unchanged from base. 1e-4 lifts target-task scores by 9 points with no out-of-domain regression. 5e-4 lifts the target by 12 points but drops HallucinationScore by 8 on out-of-domain prompts (catastrophic forgetting starts). 1e-3 collapses the model. The team picks 1e-4 and ships, with regression-eval configured to run the same battery on every weekly retrain — so the next time the rate or scheduler shifts, the regression surfaces immediately.
For RLHF runs, the same pattern adds ContentSafety and ActionSafety to catch alignment regressions that aggressive learning rates cause when the policy drifts away from the reference.
How to Measure or Detect It
The learning-rate decision is best measured via downstream evaluation, not just training-loss curves:
- Training-loss curve — flat or rising means rate too small; oscillating or NaN means rate too large.
- Held-out perplexity — sanity check that the model still models held-out text.
AnswerRelevancy— task-relevance signal on the target capability.Faithfulness— for grounded tasks, has support against context degraded.HallucinationScore— out-of-domain regression early-warning.- Per-cohort eval-fail-rate — surfaces uneven degradation across slices.
- Out-of-domain delta —
(out-of-domain-score-after) - (out-of-domain-score-before)should not be negative.
from fi.evals import AnswerRelevancy, HallucinationScore
ar = AnswerRelevancy()
hs = HallucinationScore()
# Run for each candidate learning-rate model
for lr_label in ["base", "lr_5e-5", "lr_1e-4", "lr_5e-4"]:
# ... call your model variant ...
print(lr_label, ar.evaluate(input="...", output="..."))
print(lr_label, hs.evaluate(input="...", output="...", context="..."))
Common Mistakes
- Picking a learning rate by training-loss alone. Loss going down is necessary but not sufficient; out-of-domain quality can degrade silently.
- No warmup on a high-rate schedule. The first hundred steps with a hot rate destabilise pretrained weights.
- Same rate across LoRA, full fine-tune, and RLHF. Different update mechanics need different rates by orders of magnitude.
- Skipping the regression-eval gate. Without an automated
regression-evalrun, the bad rate ships and breaks production. - Optimising for one task and ignoring catastrophic forgetting. Always include out-of-domain prompts in the eval cohort.
Frequently Asked Questions
What is learning rate in machine learning?
Learning rate is the scalar that controls how far a gradient-descent optimiser moves model weights at each step. Too small means slow or stuck training; too large means divergent loss. It is often the most consequential hyperparameter.
How is learning rate different from batch size?
Learning rate sets the step size per update; batch size sets how many examples each update averages over. They interact — larger batches usually need larger learning rates — but they control distinct dimensions of the optimisation.
How does FutureAGI relate to learning rate?
FutureAGI does not train models, so it does not set the learning rate. We evaluate the resulting models — every fine-tune at a candidate learning rate runs against a versioned dataset with AnswerRelevancy, Faithfulness, and HallucinationScore evaluators, exposing which rate produced which quality trade-off.