feat: switch evaluation metric to log-odds drop by dhalmazna · Pull Request #18 · RationAI/ciao

dhalmazna · 2026-04-28T18:51:01Z

Context:
This PR changes the logit drop to log-odds drop. Decided on this after consultation with Vojtěch Kůr.

What's Changed / Added:

ciao/scoring/region.py:
- Introduced the log_odds_for_class function, which computes the numerically stable log-odds metric (target_logit - logsumexp(other_logits)).
- Updated _compute_batch_deltas and calculate_region_deltas to compute and return the Log-Odds drop instead of raw logit differences.
ciao/explainer/ciao_explainer.py: Extracted the base original_log_odds computation directly to the CIAOExplainer class. This value is computed exactly once per image and passed down the pipeline, removing redundant baseline forward passes.
ciao/model/predictor.py: Cleaned up the ModelPredictor class by consolidating outputs. get_logits is now the core method for raw tensors, get_predictions returns the Softmax, and get_class_logit_batch was removed in favor of direct tensor slicing.
ciao/algorithm/builder.py, context.py, lookahead.py, & segments.py: Updated typing, constructors, and parameter passing to expect original_log_odds instead of original_logit.
ciao/__main__.py: Updated MLflow logging to track original_log_odds both globally and per-region for easier debugging of model confidence.

Related Task:
XAI-29

Summary by CodeRabbit

Release Notes

New Features
- Added configurable seed selection mode to control explanation behavior
- Included original model probability and log-odds in explanation results
- Added combined metrics for multi-region explanations
Bug Fixes
- Fixed duplicate metric logging issues in explanation runs
- Improved conditional metric handling for multi-region scenarios
Chores
- Updated gitignore entries for test and temporary directories

…on results Co-authored-by: Copilot <copilot@github.com>

Co-authored-by: Copilot <copilot@github.com>

…lanation results

coderabbitai · 2026-04-28T18:51:16Z

Warning

Review limit reached

@dhalmazna, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes and 9 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f5003d2-2ec5-4781-8e3f-874a6cb1a127

📥 Commits

Reviewing files that changed from the base of the PR and between 76cbeb0 and 9b0f026.

📒 Files selected for processing (1)

ciao/algorithm/lookahead.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/log-odds

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR updates CIAO’s evaluation/scoring pipeline to use log-odds drop (instead of raw logit drop) as the primary delta metric, and threads a precomputed baseline log-odds through the explainer/search code to avoid redundant forward passes.

Changes:

Added a numerically-stable log-odds helper and switched region delta computation to log-odds drop.
Precomputes baseline logits/probabilities/log-odds once per image in CIAOExplainer and passes the baseline downstream.
Simplified ModelPredictor API (logits as the core output; probabilities derived via softmax) and updated callers + MLflow logging.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`ciao/scoring/segments.py`	Threads `original_log_odds` into surrogate dataset generation and updates target definition to log-odds drop.
`ciao/scoring/region.py`	Adds `log_odds_for_class` and switches `_compute_batch_deltas`/`calculate_region_deltas` to log-odds drop.
`ciao/model/predictor.py`	Makes `get_logits` the core inference method and redefines `get_predictions` as softmax over logits; removes class-logit helper.
`ciao/explainer/ciao_explainer.py`	Computes baseline logits/probabilities/log-odds once and passes baseline values through scoring/builder.
`ciao/algorithm/lookahead.py`	Updates calls into region scoring to pass `original_log_odds`.
`ciao/algorithm/context.py`	Extends `SearchContext` to carry `original_log_odds`.
`ciao/algorithm/builder.py`	Accepts/passes `original_log_odds` and takes `original_prob` as an input (no baseline forward pass here).
`ciao/__main__.py`	Logs `original_log_odds` as an MLflow metric alongside existing per-region metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

This pull request refactors the scoring mechanism to use log-odds instead of raw logits for evaluating model predictions. It introduces a new utility function log_odds_for_class for numerically stable computation and updates the ModelPredictor and various scoring functions to propagate the pre-computed original_log_odds throughout the explanation pipeline. I have no feedback to provide as there are no review comments.

Co-authored-by: Copilot <copilot@github.com>

…logging

…log_odds device/dtype Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

ciao/explainer/ciao_explainer.py (1)
56-57: ⚡ Quick win

Validate sigma at the explain() boundary.

CIAOExplainer.explain() now exposes sigma, but invalid values are only rejected later in _select_seed_and_sign() after the expensive preprocessing/segmentation/scoring path has already run. Please validate sigma alongside the other public inputs here so config errors fail fast and consistently.
Suggested change
         if max_regions <= 0:
             raise ValueError(f"max_regions must be positive, got {max_regions}")
         if desired_length <= 0:
             raise ValueError(f"desired_length must be positive, got {desired_length}")
         if batch_size <= 0:
             raise ValueError(f"batch_size must be positive, got {batch_size}")
+        if sigma not in (None, -1, 1):
+            raise ValueError(f"sigma must be one of None, -1, or 1, got {sigma!r}")
Based on learnings: Follow the “Boundary vs. Core” principle: parameter validation should be done only at the boundary layer (e.g., CIAOExplainer.explain), while core/internal functions should assume validated inputs.

Also applies to: 70-74
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ciao/explainer/ciao_explainer.py` around lines 56 - 57, Validate the sigma
parameter at the public boundary inside CIAOExplainer.explain: add input checks
for sigma (e.g., non-None allowed types/range/allowed SeedSelectionMode values)
alongside the existing validations for desired_length and other public inputs,
and raise a clear ValueError/TypeError if invalid; then remove or keep only
assertions in the internal _select_seed_and_sign() that assume validated inputs
so expensive preprocessing/segmentation/scoring only runs for valid sigma
values.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ciao/algorithm/lookahead.py`:
- Around line 88-97: The code stores a sign-normalized score
(signed_best/current_best_score) into the trajectory "best_score" field causing
logged values to differ from RegionResult.score when optimization_sign == -1;
change the append logic in the lookahead trajectory code so the comparison still
uses signed_best/current_best_score for decision-making but the trajectory entry
stores the raw best_score (or rename the field if you prefer) instead of
current_best_score; apply the same fix to the second occurrence that mirrors
lines 124-129 so both trajectory logs persist the raw score while retaining the
signed value for comparisons.

---

Nitpick comments:
In `@ciao/explainer/ciao_explainer.py`:
- Around line 56-57: Validate the sigma parameter at the public boundary inside
CIAOExplainer.explain: add input checks for sigma (e.g., non-None allowed
types/range/allowed SeedSelectionMode values) alongside the existing validations
for desired_length and other public inputs, and raise a clear
ValueError/TypeError if invalid; then remove or keep only assertions in the
internal _select_seed_and_sign() that assume validated inputs so expensive
preprocessing/segmentation/scoring only runs for valid sigma values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 726551b2-936b-47b5-adc7-4538fd4e36af

📥 Commits

Reviewing files that changed from the base of the PR and between d31cf6d and 76cbeb0.

📒 Files selected for processing (10)

.gitignore
ciao/__main__.py
ciao/algorithm/builder.py
ciao/algorithm/context.py
ciao/algorithm/lookahead.py
ciao/explainer/ciao_explainer.py
ciao/model/predictor.py
ciao/scoring/region.py
ciao/scoring/segments.py
configs/base.yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dhalmazna and others added 4 commits April 28, 2026 20:45

feat: enhance logging and computation of original logits in explanati…

e56fdaf

…on results Co-authored-by: Copilot <copilot@github.com>

feat: add original logit parameter to region building functions

7b6449a

Co-authored-by: Copilot <copilot@github.com>

feat: change logit drop to log-odds drop

de963c9

feat: update logging to use original log-odds instead of logit in exp…

72fe86d

…lanation results

dhalmazna self-assigned this Apr 28, 2026

dhalmazna requested review from a team and Copilot April 28, 2026 18:51

Copilot started reviewing on behalf of dhalmazna April 28, 2026 18:51 View session

dhalmazna marked this pull request as draft April 28, 2026 18:52

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread ciao/scoring/region.py

Comment thread ciao/scoring/region.py Outdated

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

dhalmazna and others added 8 commits April 29, 2026 07:03

feat: add sigma parameter for seed selection in region building

c5a3cef

Co-authored-by: Copilot <copilot@github.com>

feat: implement time tracking in trajectory

1c204af

fix: deduplicate time tracking

dc4f980

feat: add jobs_testing and .claude to .gitignore

9bc7009

feat: add combined score and probability drop to explanation results …

919dd37

…logging

fix: avoid logits clone in log_odds_for_class and normalize original_…

cdfdaee

…log_odds device/dtype Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: apply ruff formatting to region.py

7ee8433

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: ignore local outputs and external frameworks

76cbeb0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread ciao/algorithm/lookahead.py

fix: store raw score in trajectory instead of sign-normalized objective

9b0f026

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dhalmazna marked this pull request as ready for review June 3, 2026 10:00

dhalmazna requested review from Adames4, vejtek and vojtech-kur June 3, 2026 10:00

Adames4 approved these changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: switch evaluation metric to log-odds drop#18

feat: switch evaluation metric to log-odds drop#18
dhalmazna wants to merge 13 commits into
masterfrom
feat/log-odds

dhalmazna commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Review limit reached

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dhalmazna commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dhalmazna commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading