Add CalibrationErrorMetric and CalibrationError handler #8707

theo-barfoot · 2026-01-16T16:48:54Z

Description

Addresses #8505

Overview

This PR adds calibration error metrics and an Ignite handler to MONAI, enabling users to evaluate and monitor model calibration for segmentation and other multi-class probabilistic tasks with shape (B, C, spatial...).

What's Included

1. Calibration Metrics (`monai/metrics/calibration.py`)

calibration_binning(): Core function to compute calibration bins with mean predictions, mean ground truths, and bin counts. Exported to support research workflows where users need per-bin statistics for plotting reliability diagrams.
CalibrationReduction: Enum supporting three reduction methods:
- EXPECTED - Expected Calibration Error (ECE): weighted average by bin count
- AVERAGE - Average Calibration Error (ACE): simple average across bins
- MAXIMUM - Maximum Calibration Error (MCE): maximum error across bins
CalibrationErrorMetric: A CumulativeIterationMetric subclass supporting:
- Configurable number of bins
- Background channel exclusion (include_background)
- All standard MONAI metric reductions (mean, sum, mean_batch, etc.)
- Batched, per-class computation

2. Ignite Handler (`monai/handlers/calibration.py`)

CalibrationError: An IgniteMetricHandler wrapper that:
- Attaches to PyTorch Ignite engines for training/validation loops
- Supports save_details for per-sample/per-channel metric details via the metric buffer
- Integrates with MONAI's existing handler ecosystem

3. Comprehensive Tests

tests/metrics/test_calibration_metric.py: Tests covering:
- Binning function correctness with NaN handling
- ECE/ACE/MCE reduction modes
- Background exclusion
- Cumulative iteration behavior
- Input validation (shape mismatch, ndim, num_bins)
tests/handlers/test_handler_calibration_error.py: Tests covering:
- Handler attachment and computation via engine.run()
- All calibration reduction modes
- save_details functionality
- Optional Ignite dependency handling (tests skip if Ignite not installed)

Public API

Exposes the following via monai.metrics:

CalibrationErrorMetric
CalibrationReduction
calibration_binning

Exposes via monai.handlers:

CalibrationError

Implementation Notes

Uses scatter_add + counts instead of scatter_reduce("mean") for better PyTorch version compatibility
Includes input validation with clear error messages
Clamps bin indices to prevent out-of-range errors with slightly out-of-bound probabilities
Uses torch.nan_to_num instead of in-place operations for cleaner code
Ignite is treated as an optional dependency in tests (skipped if not installed)

Related Work

The algorithmic approach follows the calibration metrics from Average-Calibration-Losses, with related publications:

Future Work

As discussed in the issue, calibration losses will be added in a separate PR to keep changes focused and easier to review.

Checklist

Code follows MONAI style guidelines (ruff passes)
All new code has appropriate license headers
Public API is exported in __init__.py files
Docstrings include examples with proper transforms usage
Unit tests cover main functionality
Tests handle optional Ignite dependency gracefully
No breaking changes to existing API

Example Usage

from monai.metrics import CalibrationErrorMetric
from monai.transforms import Activations, AsDiscrete

# Setup transforms
softmax = Activations(softmax=True)
to_onehot = AsDiscrete(to_onehot=num_classes)

# Create metric
metric = CalibrationErrorMetric(
    num_bins=15,
    include_background=False,
    calibration_reduction="expected"  # ECE
)

# In evaluation loop
# Note: y_pred should be probabilities in [0,1], y should be one-hot/binarized
for batch_data in dataloader:
    logits, labels = model(batch_data)
    preds = softmax(logits)
    labels_onehot = to_onehot(labels)
    metric(y_pred=preds, y=labels_onehot)

ece = metric.aggregate()

With Ignite Handler

from monai.handlers import CalibrationError, from_engine

calibration_handler = CalibrationError(
    num_bins=15,
    include_background=False,
    calibration_reduction="expected",
    output_transform=from_engine(["pred", "label"]),
    save_details=True,
)
calibration_handler.attach(evaluator, name="calibration_error")

coderabbitai · 2026-01-16T16:49:09Z

📝 Walkthrough

Walkthrough

Adds end-to-end calibration tooling: new metrics module monai/metrics/calibration.py providing calibration_binning, CalibrationReduction, and CalibrationErrorMetric; new handler monai/handlers/calibration.py exposing a CalibrationError IgniteMetricHandler wrapper; updates monai/metrics/init.py and monai/handlers/init.py to export the new symbols; adds comprehensive unit tests for metrics and the handler; and adds documentation entries for the new metric and handler.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly summarizes the main addition: two new calibration components (metric and handler) being introduced.
Description check	✅ Passed	Description covers overview, included components, public API, tests, implementation notes, and examples, but the PR template checklist items are not explicitly filled out.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@monai/metrics/calibration.py`:
- Around line 228-235: In the CalibrationReduction.MAXIMUM branch, don’t convert
NaN to 0 (which hides “no data”); instead use a -inf sentinel when calling
torch.nan_to_num on abs_diff (e.g. nan=-torch.inf), take the max along dim=-1,
then detect buckets that were all-NaN (e.g. all_nan_mask =
torch.isnan(abs_diff).all(dim=-1)) and restore those positions in the result to
NaN; update the method where self.calibration_reduction is checked (the MAXIMUM
branch that uses abs_diff_no_nan) accordingly and add a unit test covering the
“all bins empty” case to prevent regressions.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 57fdd59 and 202b25f.

📒 Files selected for processing (6)

monai/handlers/__init__.py
monai/handlers/calibration.py
monai/metrics/__init__.py
monai/metrics/calibration.py
tests/handlers/test_handler_calibration_error.py
tests/metrics/test_calibration_metric.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

⚙️ CodeRabbit configuration file

Review the Python code for quality and correctness. Ensure variable names adhere to PEP8 style guides, are sensible and informative in regards to their function, though permitting simple names for loop and comprehension variables. Ensure routine names are meaningful in regards to their function and use verbs, adjectives, and nouns in a semantically appropriate way. Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings. Examine code for logical error or inconsistencies, and suggest what may be changed to addressed these. Suggest any enhancements for code improving efficiency, maintainability, comprehensibility, and correctness. Ensure new or modified definitions will be covered by existing or new unit tests.

Files:

monai/metrics/__init__.py
monai/handlers/__init__.py
monai/handlers/calibration.py
tests/handlers/test_handler_calibration_error.py
monai/metrics/calibration.py
tests/metrics/test_calibration_metric.py

🧬 Code graph analysis (6)

monai/metrics/__init__.py (1)

monai/metrics/calibration.py (3)

CalibrationErrorMetric (139-260)

CalibrationReduction (125-136)

calibration_binning (30-122)

monai/handlers/__init__.py (1)

monai/handlers/calibration.py (1)

CalibrationError (23-71)

monai/handlers/calibration.py (1)

monai/utils/enums.py (1)

MetricReduction (239-250)

tests/handlers/test_handler_calibration_error.py (4)

monai/handlers/calibration.py (1)

CalibrationError (23-71)

monai/handlers/utils.py (1)

from_engine (170-210)

monai/utils/module.py (2)

min_version (273-285)

optional_import (315-445)

tests/test_utils.py (1)

assert_allclose (119-159)

monai/metrics/calibration.py (4)

monai/metrics/metric.py (1)

CumulativeIterationMetric (296-353)

monai/metrics/utils.py (2)

do_metric_reduction (71-130)

ignore_background (54-68)

monai/utils/enums.py (2)

MetricReduction (239-250)

StrEnum (68-90)

monai/utils/profiling.py (1)

end (430-432)

tests/metrics/test_calibration_metric.py (3)

monai/metrics/calibration.py (4)

CalibrationErrorMetric (139-260)

CalibrationReduction (125-136)

calibration_binning (30-122)

aggregate (239-260)

monai/utils/enums.py (1)

MetricReduction (239-250)

monai/metrics/metric.py (1)

get_buffer (282-293)

🪛 Ruff (0.14.11)

tests/handlers/test_handler_calibration_error.py

106-106: Unused function argument: engine

(ARG001)

142-142: Unused function argument: engine

(ARG001)

168-168: Unused function argument: engine

(ARG001)

monai/metrics/calibration.py

23-27: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

71-71: Avoid specifying long messages outside the exception class

(TRY003)

73-73: Avoid specifying long messages outside the exception class

(TRY003)

75-75: Avoid specifying long messages outside the exception class

(TRY003)

204-204: Unused method argument: kwargs

(ARG002)

237-237: Avoid specifying long messages outside the exception class

(TRY003)

256-256: Prefer TypeError exception for invalid type

(TRY004)

256-256: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: min-dep-py3 (3.9)
GitHub Check: min-dep-os (macOS-latest)
GitHub Check: min-dep-py3 (3.12)
GitHub Check: min-dep-py3 (3.10)
GitHub Check: min-dep-os (ubuntu-latest)
GitHub Check: min-dep-pytorch (2.5.1)
GitHub Check: min-dep-py3 (3.11)
GitHub Check: min-dep-pytorch (2.6.0)
GitHub Check: min-dep-pytorch (2.7.1)
GitHub Check: min-dep-os (windows-latest)
GitHub Check: min-dep-pytorch (2.8.0)
GitHub Check: quick-py3 (macOS-latest)
GitHub Check: packaging
GitHub Check: flake8-py3 (pytype)
GitHub Check: flake8-py3 (codeformat)
GitHub Check: quick-py3 (windows-latest)
GitHub Check: flake8-py3 (mypy)
GitHub Check: build-docs
GitHub Check: quick-py3 (ubuntu-latest)

🔇 Additional comments (14)

monai/handlers/__init__.py (1)

15-15: LGTM!

Import is correctly placed alphabetically and aligns with the new CalibrationError handler in monai/handlers/calibration.py.

monai/metrics/__init__.py (1)

16-16: LGTM!

Public API exports correctly added for the new calibration functionality.

tests/metrics/test_calibration_metric.py (4)

14-25: LGTM!

Imports and device setup are appropriate. Good use of _device for CUDA/CPU portability.

142-196: LGTM!

Comprehensive binning tests with proper NaN handling and input validation coverage.

199-253: LGTM!

All three calibration reduction modes tested with proper isolation from metric reduction.

256-354: LGTM!

Good coverage of metric options including include_background, metric_reduction, get_not_nans, cumulative behavior, and reset.

tests/handlers/test_handler_calibration_error.py (3)

19-25: LGTM!

Proper optional import pattern for Ignite with version check and skip decorator.

82-122: LGTM!

Handler tests properly verify metric computation and details shape. The unused engine parameter in _val_func is required by Ignite's callback signature.

124-181: LGTM!

Edge case tests cover single iteration and save_details=False behavior with appropriate defensive checks.

monai/handlers/calibration.py (1)

23-71: LGTM!

Clean handler implementation following MONAI patterns. Docstring adequately documents all parameters. Consider adding a usage example similar to other handlers if desired.

monai/metrics/calibration.py (4)

30-122: calibration_binning looks solid

Validation, binning, and empty-bin NaN handling are clear and consistent with the stated contract.

125-136: Enum values are clear

Naming and values match expected calibration reduction modes.

187-203: Init wiring looks good

Config is stored cleanly and defaults are sensible.

239-260: Aggregate logic is clean

Reduction and get_not_nans behavior are consistent with MONAI patterns.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

monai/metrics/calibration.py

ericspod · 2026-01-19T01:23:49Z

Hi @theo-barfoot thanks for the calibration classes! There's one comment from Coderabbit to consider and the flake issue can be fixed as it suggests. Other than that I have had only a brief look but I think it's good although I would add more information in the docstrings about what the classes would be for, maybe an example of usage if that's possible in few lines, and you can reference your paper and others on the subject (a primary source paper on calibration would be good). I think a good few people wouldn't be familiar with calibration so it would help to explain the context for these classes. We also need to update docs to refer to the new classes and allow sphynx to autogenerate things, ie. here and elsewhere.

- Add calibration_binning() function for hard binning calibration - Add CalibrationErrorMetric with ECE/ACE/MCE reduction modes - Add CalibrationError Ignite handler - Add comprehensive tests for metrics and handler Addresses Project-MONAI#8505 Signed-off-by: Theo Barfoot <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Theo Barfoot <[email protected]>

- Fix MAXIMUM reduction to return NaN (not 0.0) for all-empty bins (CodeRabbit) - Enhance docstrings with 'Why Calibration Matters' section explaining that probabilities should match observed accuracy - Add paper references: Guo et al. 2017 (ICML primary source), MICCAI 2024, - Add Sphinx autodoc entries to metrics.rst and handlers.rst - Improve parameter documentation and usage examples Signed-off-by: Theo Barfoot <[email protected]>

theo-barfoot · 2026-01-20T14:23:31Z

Thanks for the feedback @ericspod . I have implemented your suggested changes. Let me know what you think.

My plan is also to open another PR to implement calibration auxiliary losses described in #8505 . However, I wanted to split it into multiple PRs for simplicity. Also, the journal paper for these calibration losses is under final review before publication, so I could wait for that to be published first. Alternatively, I can combine the calibration losses into this PR is that is more convenient.

ericspod · 2026-01-21T12:14:17Z

Thanks for the feedback @ericspod . I have implemented your suggested changes. Let me know what you think.

My plan is also to open another PR to implement calibration auxiliary losses described in #8505 . However, I wanted to split it into multiple PRs for simplicity. Also, the journal paper for these calibration losses is under final review before publication, so I could wait for that to be published first. Alternatively, I can combine the calibration losses into this PR is that is more convenient.

As we discussed we'll put the losses into another PR and use the arxiv preprint link here which can be updated later.

Signed-off-by: Theo Barfoot <[email protected]>

theo-barfoot · 2026-01-21T13:26:22Z

Great thanks, I have added the arXiv links for my paper.

I'll get another PR underway for the losses and a calibration tutorial.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@monai/metrics/calibration.py`:
- Around line 71-73: Remove the stray "}}} " prefix from the paper title string
occurrences—search for the text snippet "}}} Average Calibration Losses for
Reliable Uncertainty in Medical Image Segmentation." in
monai/metrics/calibration.py and the handler file and delete the leading "}}} "
so the reference begins with "Average Calibration Losses..."; ensure all three
occurrences (the one around lines 71-73, the one around 213-214, and the handler
file occurrences) are updated consistently.

🧹 Nitpick comments (5)

monai/handlers/calibration.py (1)
93-109: Consider exposing get_not_nans and right parameters.

CalibrationErrorMetric accepts get_not_nans and right parameters that aren't exposed here. Users wanting tuple returns or right-inclusive bins must bypass this handler.
Proposed addition
     def __init__(
         self,
         num_bins: int = 20,
         include_background: bool = True,
         calibration_reduction: CalibrationReduction | str = CalibrationReduction.EXPECTED,
         metric_reduction: MetricReduction | str = MetricReduction.MEAN,
         output_transform: Callable = lambda x: x,
         save_details: bool = True,
+        get_not_nans: bool = False,
+        right: bool = False,
     ) -> None:
         metric_fn = CalibrationErrorMetric(
             num_bins=num_bins,
             include_background=include_background,
             calibration_reduction=calibration_reduction,
             metric_reduction=metric_reduction,
+            get_not_nans=get_not_nans,
+            right=right,
         )
monai/metrics/calibration.py (4)
23-23: __all__ not sorted.

Per Ruff RUF022 hint.
Suggested fix
-__all__ = ["calibration_binning", "CalibrationErrorMetric", "CalibrationReduction"]
+__all__ = ["CalibrationErrorMetric", "CalibrationReduction", "calibration_binning"]
108-133: Nested loops could be vectorized, but documented tradeoff.

The note at line 76-77 acknowledges the loop-based approach. For large batch/channel counts this may be slow, but scatter_add compatibility is prioritized.

303-305: Minor: scalar tensor creation in hot path.

torch.tensor(float("nan"), device=max_vals.device) creates a new tensor each call. Pre-allocating or using max_vals.new_tensor(float("nan")) is marginally cleaner but not critical.
Alternative
-            max_vals = torch.where(
-                max_vals == float("-inf"), torch.tensor(float("nan"), device=max_vals.device), max_vals
-            )
+            max_vals = torch.where(
+                max_vals == float("-inf"), max_vals.new_tensor(float("nan")), max_vals
+            )
326-327: Consider TypeError for type mismatch.

Ruff TRY004 suggests TypeError is more appropriate when checking isinstance.
Suggested fix
         if not isinstance(data, torch.Tensor):
-            raise ValueError("the data to aggregate must be PyTorch Tensor.")
+            raise TypeError("the data to aggregate must be PyTorch Tensor.")

coderabbitai · 2026-01-21T13:27:37Z

monai/metrics/calibration.py

+        - Barfoot, T., et al. "}}} Average Calibration Losses for Reliable Uncertainty in
+          Medical Image Segmentation." arXiv:2506.03942v3, 2025.
+          https://arxiv.org/abs/2506.03942v3


⚠️ Potential issue | 🟡 Minor

Stray characters in reference title.

The }}} prefix in the paper title appears to be a typo or copy-paste artifact.

Suggested fix

- - Barfoot, T., et al. "}}} Average Calibration Losses for Reliable Uncertainty in + - Barfoot, T., et al. "Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation." arXiv:2506.03942v3, 2025.

Note: Same issue appears at lines 213-214 and in the handler file at lines 61-62.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Barfoot, T., et al. "}}} Average Calibration Losses for Reliable Uncertainty in

Medical Image Segmentation." arXiv:2506.03942v3, 2025.

https://arxiv.org/abs/2506.03942v3

- Barfoot, T., et al. "Average Calibration Losses for Reliable Uncertainty in

Medical Image Segmentation." arXiv:2506.03942v3, 2025.

https://arxiv.org/abs/2506.03942v3

🤖 Prompt for AI Agents

In `@monai/metrics/calibration.py` around lines 71 - 73, Remove the stray "}}} " prefix from the paper title string occurrences—search for the text snippet "}}} Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation." in monai/metrics/calibration.py and the handler file and delete the leading "}}} " so the reference begins with "Average Calibration Losses..."; ensure all three occurrences (the one around lines 71-73, the one around 213-214, and the handler file occurrences) are updated consistently.

@theo-barfoot This should be fixed for the handler's docstring as well.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

theo-barfoot requested review from KumoLiu, Nic-Ma and ericspod as code owners January 16, 2026 16:48

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

monai/metrics/calibration.py Outdated Show resolved Hide resolved

theo-barfoot and others added 3 commits January 19, 2026 15:43

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c4b62a

for more information, see https://pre-commit.ci Signed-off-by: Theo Barfoot <[email protected]>

theo-barfoot force-pushed the feature/calibration-metrics branch from cb53022 to f3446ce Compare January 19, 2026 15:43

Update arXiv references

cb2476c

Signed-off-by: Theo Barfoot <[email protected]>

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CalibrationErrorMetric and CalibrationError handler #8707

Add CalibrationErrorMetric and CalibrationError handler #8707

Uh oh!

theo-barfoot commented Jan 16, 2026

Uh oh!

coderabbitai bot commented Jan 16, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

ericspod commented Jan 19, 2026

Uh oh!

theo-barfoot commented Jan 20, 2026

Uh oh!

ericspod commented Jan 21, 2026

Uh oh!

theo-barfoot commented Jan 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 21, 2026 •

edited

Loading

Uh oh!

ericspod Jan 21, 2026

Uh oh!

coderabbitai bot Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add CalibrationErrorMetric and CalibrationError handler #8707

Are you sure you want to change the base?

Add CalibrationErrorMetric and CalibrationError handler #8707

Uh oh!

Conversation

theo-barfoot commented Jan 16, 2026

Description

Overview

What's Included

1. Calibration Metrics (monai/metrics/calibration.py)

2. Ignite Handler (monai/handlers/calibration.py)

3. Comprehensive Tests

Public API

Implementation Notes

Related Work

Future Work

Checklist

Example Usage

With Ignite Handler

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericspod commented Jan 19, 2026

Uh oh!

theo-barfoot commented Jan 20, 2026

Uh oh!

ericspod commented Jan 21, 2026

Uh oh!

theo-barfoot commented Jan 21, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericspod Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Calibration Metrics (`monai/metrics/calibration.py`)

2. Ignite Handler (`monai/handlers/calibration.py`)

coderabbitai bot commented Jan 16, 2026 •

edited

Loading

coderabbitai bot Jan 21, 2026 •

edited

Loading