Models

What Is CatBoost?

A gradient-boosting library from Yandex optimized for categorical features, using ordered boosting to reduce target leakage.

What Is CatBoost?

CatBoost is an open-source gradient-boosting library developed by Yandex, with first-class support for categorical features. It introduces two design choices that distinguish it from XGBoost and LightGBM: ordered boosting, which reduces target leakage by computing residuals on rows the current model has not yet seen, and categorical encoding based on running statistics rather than naive label encoding. CatBoost trains on CPU or GPU, exports to ONNX, and tends to win or tie XGBoost and LightGBM on tabular benchmarks where categorical columns dominate. It does not generate text.

Why It Matters in Production LLM and Agent Systems

CatBoost rarely sits in the hot path of an LLM application — it lives upstream. Common 2026 placements include intent classifiers that route inbound contacts to the right LLM agent, eval-result regressors that predict whether a model output is likely to fail, scoring models that rank retrieved chunks before reranking, and tabular-feature classifiers that fuse user metadata into a routing decision.

The pain shows up where these classical models drift while their LLM consumers stay still. An intent classifier trained six months ago on tabular features now mis-routes 8% of “billing” calls to a “technical-support” agent — the LLM tier dutifully processes each call to a poor outcome, and the resolution-rate dashboard blames the LLM. An eval-result regressor predicts “low risk” on a class of prompts that the new model variant actually fails on; the team trusts the prediction and ships without manual review.

For 2026 multi-tier systems, the lesson is that classical and LLM components share one production surface. CatBoost-style models drift differently from LLMs (concept drift on categorical distributions, feature drift on continuous columns), but the consequences land in the LLM tier — wrong route, wrong rank, wrong risk score. Capacity, accuracy, and cost are joint properties of the whole stack, not the LLM alone.

How FutureAGI Handles CatBoost Models in the Stack

FutureAGI does not train, serve, or directly evaluate CatBoost models. We evaluate the LLM systems they feed and ride alongside, and we surface the joint behavior on the trace.

Concretely: a contact-center team uses CatBoost to classify caller intent from telephony metadata before dispatch. The classifier output is logged as a span_attribute on the routing span. FutureAGI then runs ConversationResolution and TaskCompletion on the LLM voice agent the call lands on. The dashboard slices resolution-rate by predicted-intent — if “billing-dispute” routes resolve at 0.82 but “tech-support” routes resolve at 0.41, the team can ask whether the agent is bad at tech-support, or the classifier is mis-routing tech-support traffic to the wrong agent. Comparing classifier-predicted intent to a sampled human-labeled intent (via the FutureAGI annotation queue) tells you which.

For pre-deployment evaluation of a new CatBoost classifier version, the team versions a Dataset of historical calls with both old-classifier and new-classifier predictions, runs the LLM through both, and compares aggregate TaskCompletion. The result is a single regression-eval number that says whether the new classifier helps or hurts the downstream LLM — closer to what production cares about than the classifier’s offline F1 alone.

How to Measure or Detect It

Joint health of CatBoost + LLM is measured at three layers:

  • Classifier-level metrics: F1, AUC, log-loss on a labeled holdout — standard for tabular models.
  • Resolution-rate-by-predicted-cohort: dashboard signal showing ConversationResolution sliced by classifier output; surfaces routing failures.
  • fi.evals.TaskCompletion: per-call agent score; the downstream proxy for whether the upstream classifier did its job.
  • Drift diagnostics: PSI or KL-divergence on input features compared to training distribution; classical drift signals catch CatBoost issues early.
  • Annotation-queue agreement: agreement rate between classifier prediction and human label sampled via the FutureAGI annotation queue.
from fi.evals import TaskCompletion

task = TaskCompletion()
result = task.evaluate(
    input="My internet has been down since this morning.",
    output="I've opened ticket #T-4421 and dispatched a technician for tomorrow morning."
)
print(result.score, result.reason)

Common Mistakes

  • Tuning the classifier on offline F1 alone. Optimize for downstream LLM resolution, not classifier accuracy in isolation.
  • Skipping target-leakage checks. CatBoost reduces leakage in encodings but does not stop you from leaking labels through engineered features.
  • Ignoring categorical drift. A new product line introduces new category levels; without retraining, CatBoost defaults silently and routing degrades.
  • Confusing prediction probability with confidence. A 0.51-vs-0.49 prediction is not a coin flip you should trust without abstaining.
  • Letting the classifier and the LLM teams own separate dashboards. Joint behavior needs a joint surface.

Frequently Asked Questions

What is CatBoost?

CatBoost is an open-source gradient-boosting library from Yandex, designed to handle categorical features natively using ordered boosting and target-based encodings, with GPU training support.

How is CatBoost different from XGBoost and LightGBM?

All three are gradient-boosting libraries. CatBoost handles categorical features without manual encoding and uses ordered boosting to reduce target leakage. XGBoost is the classic baseline; LightGBM uses leaf-wise growth and tends to be fastest on dense numeric features.

How does FutureAGI relate to CatBoost?

FutureAGI doesn't train CatBoost models. We evaluate the LLM applications they sit upstream of — for example, an intent classifier feeding a routing policy — using AnswerRelevancy and TaskCompletion against the LLM's downstream output.