Deep dive into language model calibration: Platt scaling, isotonic regression, temperature scaling

Share

# Entry

A model that claims to be 90% confident should be right 90% of the time. When this relationship falls apart, you will receive incorrect calibration problem. The model results no longer say anything useful about reliability.

For enormous language models (LLM) miscalibration is common. AND NAACL 2024 Study found that confidence scores deviated from actual correctness rates in QA, code generation, and inference tasks.

Other test biomedical models found average calibration scores ranging from just 23.9% to 46.6% across all models tested. The difference is constant.

Standard solution in classic machine learning is post-hoc recalibration: fit a uncomplicated function to the set aside validation set to map the raw confidence scores to better calibrated probabilities.

Three dominant methods: temperature scaling, Platt scalingAND isotonic regression. All three were designed for discriminative classifiersand applying them to LLM requires caution.

LLM calibration

# Measurement calibration

The dominant metric is Expected calibration error (ECG). Groups predictions into confidence intervals, calculates the difference between mean confidence and observed accuracy within each interval, and averages across intervals weighted by size. ECE = 0 is the perfect calibration.

A reliability diagram shows the relationship between confidence and accuracy. A perfectly calibrated model sits on a diagonal. Below is the overconfidence model: The curve shows high confidence, but accuracy cannot keep up.

LLM calibration

AND Rating 2025 GPT-4o-mini as a text classifier found that 66.7% of errors occurred at confidence levels above 80% – the canonical pattern of overconfidence.

ECE alone is increasingly seen as insufficient. AND research article recommends pairing ECE with Brier scoreoverconfidence factors and reliability diagrams combined. A single number obscures significant differences in where and how the model behaves incorrectly.

# Why LLMs complicate the standard setup

The three methods we discuss assume a constant output space. The classifier generates one probability per class, and calibration maps them to better estimates.

LLM don’t act this way.

There are four complications that matter here.

LLM calibration

The output space is exponentially enormous: confidence cannot be computed at the sequence level. Semantically equivalent outcomes may have very different token-level probabilities. The trust disagrees with the details; AND research article atomic calibration showed that generative models show the lowest average confidence in the middle of the generation, rather than at the beginning or end.

Many LLMs only reveal the probabilities of the highest-k tokens through their APItherefore, classic calibration approaches that rely on full logit access require modification.

LLM calibration

# Applying temperature scaling

Temperature scaling divides the logit vector by the T scalar before applying softmax. When T > 1, the distribution flattens and confidence decreases. When T

T fits the validation set held out by minimizing the negative log-likelihood. The method adds one parameter, preserves prediction rankings, and is economical to compute.

The original formula targeted DenseNet image classifiers. In the case of LLM, temperature controls the probability distribution of the vocabulary at each decoding stage, so the same logic applies.

The problem is this Reinforcement learning from human feedback (RLHF). Post-RLHF models develop input-dependent overconfidence: the degree of miscalibration varies with input, and a single T cannot explain this variation.

Average ECE scores above 0.377 have been documented for models such as GPT-3 on verbalized self-confidence tasks and 2025 study confirms that RLHF-tuned models consistently overestimate confidence in all cases.

Adaptive temperature scaling (ATS) deals with this directly. ATS predicts the temperature for each token based on hidden features at the token level, adapting to a supervised tuning dataset rather than using a single fixed T. Researchers confirmed that ATS improved calibration by 10-50% without compromising task performance. For any RLHF-tuned model, ATS provides a stronger baseline than standard temperature scaling.

Standard temperature scaling still works well for pre-RLHF base models. When miscalibration is approximately uniform across all inputs, a single T is often sufficient to correct for systematic over- or under-confidence.

The problem is specific to post-RLHF models, where input-dependent overconfidence means that a single T cannot correct for all inputs.

# Application of Platt scaling

Platt scaling fits a logistic function based on uncalibrated results: p = σ(A·s + B), where A and B learn from the issued validation set with binary correctness labels.

The sigmoid shape provides a parametric mapping with two free parameters.

Platt scaling was originally developed for SVMs, but can be generalized to any system that produces a scalar confidence score.

LLM calibration

Two-parameter fitting also allows for productive operate of data compared to isotonic regression: it can generate useful estimates from a smaller calibration set, which is crucial in implementation contexts where labeled validity data is circumscribed.

In the context of LLM, Platt scaling works on sequence-level or token-level confidence scores.

AND paper based on the code confidence generated by LLM, it was found that Platt scaling produced better calibrated results than uncalibrated results. Another study on LLM for Text to SQL Conversion has been introduced Platt’s multidimensional scaling (MPS), extending single-variable Platt scaling to combine sub-score frequency scores across multiple generated samples – consistently outperforming single-score baselines.

Two limitations are documented. First, global Platt scaling at the sequence level is too abrasive for tasks where correctness depends on local editing decisions: a single sigmoid mapping is unable to capture sample-dependent miscalibration patterns.

Additionally, Platt scaling may degrade correct scoring performance for robust models.

# Using isotonic regression

Isotonic regression is performed using a non-parametric method.

It learns a piecewise constant, monotone, non-decreasing mapping from uncalibrated outcomes to calibrated probabilities using Neighboring pool violators algorithm (PAWA). There is no assumed shape for the calibration function, which makes it more malleable than Platt scaling when the confidence-accuracy relationship is not sigmoidal.

The piecewise constant output adapts to any monotonous shape: linear, stepped or concave. This adaptability is the main reason why isotonic regression tends to outperform Platt scaling in empirical comparisons.

The cost is the risk of overfitting for miniature calibration sets. Mapping generalizes well only when there is enough data to constrain it.

Empirically, isotonic regression outperforms Platt scaling.

Exacting comparison across multiple datasets and architectures showed that isotonic regression outperforms Platt scaling on ECE and Brier scores with statistical significance, using paired tests with a Bonferroni correction at α = 0.003.

LLM calibration

In this study, the Random Forest baseline improved from a reliability score of 0.8268 in the uncalibrated condition, to 0.9551 in the Platt scale, and to 0.9660 in the isotonic regression condition. Both methods can degrade actual scoring performance for robust models, but the isotonic advantage persists consistently.

For multiclass LLM settings, it has been shown that standard isotonic regression can be further improved with extensions supporting normalization, consistently outperforming both OvR isotonic regression and standard parametric methods on NLL and ECE.

The data requirement is a binding restriction. The advantage of isotonic regression is real, but it does not translate to low-data deployment scenarios.

# What literature leaves open

Three gaps it is worth marking them before implementing any of these methods.

The RLHF the interaction was only examined in terms of temperature scaling. How Platt scaling and isotonic regression in post-RLHF models has not been systematically tested. ATS exists because standard temperature scaling required an explicit solution in this case. It is an open question whether the other two methods require similar extensions.

LLM Calibration

The most direct comparisons all three methods are taken from the general machine learning calibration literature. LLM-specific benchmarks that test all three directly are uncommon. ICSE Code Calibration 2025 paper is one of the few and its scope is circumscribed to code generation.

The size of the calibration set is a real implementation limitation. The isotonic regression results from the articles assume that the datasets are enormous enough to limit mapping. In production with a circumscribed number of labeled examples, the gap between isotonic regression and Platt’s scale may close or reverse.

# Application

Temperature scaling is the right starting point for most teams. For entry-level models without RLHF, a single T is often sufficient.

For RLHF-tuned models, switch to ATS: per-token temperature supports input-dependent overconfidence that the global scalar lacks.

Platt scaling is a practical choice when the calibration set is miniature or when calibration requires inclusion in a larger pipeline. It is productive in data processing and uncomplicated to implement. The limitation is scope: it cannot capture miscalibration, which varies from sample to sample and tends to degrade performance for robust models.

Isotonic regression has the strongest empirical track record of the three. Utilize it when the calibration set is enormous enough to constrain the mapping without overfitting, and combine it with extensions that support normalization in multiclass settings.

The decision that comes before all this is what “trust” means for the task. Symbolic probability, sequence probability, verbalized confidence, and consistency across samples can all produce different values ​​for the same result. A calibration method applied to the wrong signal does not improve reliability. The correct definition is a prerequisite for any of the above methods to work.

Nate Rosidi is a data scientist and product strategist. He is also an adjunct professor of analytics and the founder of StrataScratch, a platform that helps data scientists prepare for job interviews using real interview questions from top companies. Nate writes about the latest career trends, gives interview advice, shares data science projects, and discusses all things SQL.

Latest Posts

More News