How is active learning different from semi-supervised learning?

Active learning asks humans to label selected examples. Semi-supervised learning uses a small labeled set plus a larger unlabeled set, often with model-generated pseudo-labels.

How do you measure active learning?

FutureAGI measures active learning through `fi.queues.AnnotationQueue` progress, evaluator disagreement from `Groundedness` or `FactualAccuracy`, eval-fail-rate-by-cohort, and improvement after labeled items enter a regression dataset.

What Is Active Learning? Definition & FutureAGI Guide (2026)

What Is Active Learning?

Active learning is a model-development workflow where the system chooses which examples deserve human labeling next. It is a model-family technique used around training, fine-tuning, and evaluation data, especially when labels are expensive. In production LLM and agent systems, active learning shows up in traces, datasets, and annotation queues: FutureAGI can route uncertain outputs, evaluator disagreements, high-impact failures, and repeated user complaints into fi.queues.AnnotationQueue for targeted review.

Why It Matters in Production LLM and Agent Systems

Random labeling wastes review time on examples the model already handles. The harder failure is false coverage: a team believes it has a good dataset because it has many labels, while the long-tail failures that hurt users remain under-sampled. That leads to silent hallucinations, weak refusals, wrong tool choices, and fine-tunes that improve common cases while missing the defects that triggered the project.

Developers feel the pain as stale evals. A release fails on one workflow, but the labeled set contains mostly easy chat turns. SREs see the same gap as rising retry rate, p99 latency from unnecessary fallbacks, or token-cost-per-trace growth caused by prompts compensating for missing training evidence. Product and support teams see repeated thumbs-down clusters that never become test cases. Compliance teams care when disputed answers, policy exceptions, or safety incidents are not promoted into reviewed data.

Agentic systems make active learning more important because a bad example is rarely just a final answer. It may include a retrieval miss, an ambiguous user intent, a planner step, a tool call, a guardrail decision, and a final response. A useful labeling workflow has to preserve that trace context. In 2026 multi-step pipelines, the highest-value examples are often disagreement cases: evaluator says pass, user escalates; tool succeeded, task failed; answer is grounded, but policy outcome is unacceptable.

How FutureAGI Handles Active Learning

FutureAGI’s approach is to make active learning operational: select examples from production evidence, review them in a queue, and feed the reviewed data back into evals and datasets. The requested anchor is sdk:AnnotationQueue, exposed as fi.queues.AnnotationQueue. That SDK surface supports queue creation, labels, item assignment, submitted annotations, scores, progress, analytics, agreement, imports, and exports.

Example: a financial-support agent is instrumented with traceAI-langchain. Each trace records the prompt, retrieved policy snippets, response, route, llm.token_count.prompt, user outcome, and evaluator scores from Groundedness, FactualAccuracy, and ToolSelectionAccuracy. The active-learning policy does not sample every trace. It selects four cohorts: low evaluator confidence, evaluator-human disagreement, high-severity topics such as refunds or account closure, and repeated thumbs-down messages with similar embeddings.

The engineer sends those selected traces into fi.queues.AnnotationQueue with labels such as unsupported_claim, wrong_tool, missing_policy_context, and acceptable_refusal. Reviewers label the full trace, not just the final answer. If Groundedness passes but reviewers mark unsupported_claim, the evaluator threshold or rubric needs calibration. If ToolSelectionAccuracy fails on a specific planner step, the next action is a regression eval before changing prompts or fine-tuning data.

Unlike a generic uncertainty-sampling loop in scikit-learn, this workflow is tied to LLM traces, evaluator outputs, human labels, and product severity. The result is a smaller queue with higher expected impact per label.

How to Measure or Detect Active Learning

Measure active learning by the value of selected labels, not by queue size:

Selection precision: percent of queued items that reviewers confirm as defects, ambiguous cases, or release-gate examples.
Queue progress from fi.queues.AnnotationQueue: assigned, reviewed, accepted, rejected, agreement, score distribution, and export count.
Evaluator disagreement: cases where Groundedness, FactualAccuracy, or ToolSelectionAccuracy conflicts with human labels.
Eval-fail-rate-by-cohort: failure rate before and after reviewed examples are added to the golden dataset.
User-feedback proxy: thumbs-down rate, escalation rate, support reopen rate, and complaint clusters for selected cohorts.
Trace fields: llm.token_count.prompt, route, model id, tool name, and agent.trajectory.step grouped by selected versus unselected traces.

Minimal evaluator context:

from fi.evals import Groundedness

result = Groundedness().evaluate(
    response=model_output,
    context=retrieved_policy_chunks,
)
if result.score < 0.7:
    send_to_annotation_queue(trace_id)

The strongest signal is improvement per reviewed label: if 200 new annotations do not reduce regression failures or user escalations, the selection policy is choosing the wrong examples.

Common Mistakes

Sampling only low-confidence outputs. Some serious LLM failures are high-confidence hallucinations, policy violations, or tool actions that look syntactically correct.
Labeling final answers without trace context. Active learning for agents needs planner, retriever, tool, guardrail, and response spans.
Treating reviewer agreement as optional. Low agreement means the rubric is unclear, even if the queue is complete.
Feeding every labeled item into training. Some examples belong in evals or policy tests, not fine-tuning data.
Ignoring severity. A rare compliance defect can be worth more than hundreds of harmless style disagreements.