Skip to content

feat: implement burstable strategy for mem and cpu#531

Draft
mircea-pavel-anton wants to merge 1 commit into
robusta-dev:mainfrom
mircea-pavel-anton:feat/burstable-strategy
Draft

feat: implement burstable strategy for mem and cpu#531
mircea-pavel-anton wants to merge 1 commit into
robusta-dev:mainfrom
mircea-pavel-anton:feat/burstable-strategy

Conversation

@mircea-pavel-anton

@mircea-pavel-anton mircea-pavel-anton commented Jun 12, 2026

Copy link
Copy Markdown

The purpose of this PR is to add a basic "burstable" recommendation strategy.

As far as I can tell, the current strategies enforce a QoS recommendation (limit == request) for memory, while allowing CPU to be burstable.

While this makes sense in certain scenarios, I'm also running KRR in my homelab to propose resource adjustments. In such cases, apps are most of the time idling, with occasional usage spikes. Setting requests and limits to the same value as to accommodate that spike would result In massive over provisioning in this situation.

Hence I'm working on getting this strategy going that can recommend individual values for requests and limits based on percentiles, just like CPU recommendations work in the other strategies.

For reference, I'm running my fork of KRR (the code in this PR) like so: https://github.com/mirceanton/home-ops/tree/main/apps%2Fjobs%2Fkrr%2Fapp

To generate this "dashboard": mirceanton/home-ops#559

Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

This PR introduces a new BurstableStrategy for Kubernetes resource recommendation based on percentile Prometheus metrics. It adds a PercentileMemoryLoader metric factory, a configurable strategy with settings for percentiles and buffers, CPU/memory proposal logic with HPA awareness and optional OOMKill memory bumping, and comprehensive tests.

Changes

BurstableStrategy with Percentile Metrics and OOMKill Integration

Layer / File(s) Summary
Percentile Memory Metric Loader
robusta_krr/core/integrations/prometheus/metrics/memory.py, robusta_krr/core/integrations/prometheus/metrics/__init__.py
PercentileMemoryLoader(percentile) factory validates percentile bounds and returns a PrometheusMetric subclass generating PromQL quantile_over_time expressions over container_memory_working_set_bytes.
BurstableStrategy Settings and Module Setup
robusta_krr/strategies/__init__.py, robusta_krr/strategies/burstable.py (module helpers and settings)
BurstableStrategySettings defines configurable percentiles, buffers, history thresholds, and HPA/OOMKill options; _named_loader helper wraps metric factories with class naming; strategy is exported at package level.
BurstableStrategy Core Logic
robusta_krr/strategies/burstable.py (strategy class and proposal methods)
BurstableStrategy computes CPU request from max percentile request (optionally omitting limit if disabled) and memory request from max percentile request; memory limit is the max of percentile-limit-plus-buffer and optional OOMKill-derived bumped value; both proposals validate data presence, enforce points_required, return undefined when HPA is detected (unless allow_hpa enabled), and handle NaN when data is insufficient.
BurstableStrategy Tests
tests/test_burstable.py
CPU tests verify request/limit separation, disabled limit handling, and NaN on missing/insufficient data; memory tests verify buffering behavior, OOMKill limit bumping (including when ignored), and HPA-driven undefined behavior; CLI smoke tests validate command and settings structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether it relates to the changeset. Add a pull request description explaining the purpose, implementation details, and testing approach for the burstable strategy feature.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: implementing a burstable strategy for memory and CPU resource recommendations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/test_burstable.py (1)

132-199: ⚡ Quick win

Add regression tests for empty limit loaders to prevent np.max([]) crashes.

Current tests don’t cover the case where request loader has data but limit loader is empty. Add one CPU and one memory test to ensure run() returns undefined rather than raising.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_burstable.py` around lines 132 - 199, Add two regression tests to
tests/test_burstable.py that cover the case where the request loader has data
but the corresponding limit loader is empty so run() does not call np.max([])
and instead returns undefined; specifically, add one test for CPU and one for
Memory using make_strategy, make_history_data (populate request values but set
e.g. "BurstableCpuLimitLoader" or "BurstableMemoryLimitLoader" to {}), call
strategy.run(history, make_object_data()), and assert that
result[ResourceType.Cpu].limit (and result[ResourceType.Memory].limit) is NaN
(use math.isnan) or that request is NaN as appropriate to match other tests;
reference the existing helpers make_strategy, make_history_data,
make_object_data, ResourceType, and strategy.run so the new tests follow the
same patterns as the surrounding tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@robusta_krr/core/integrations/prometheus/metrics/memory.py`:
- Line 88: The code is rounding the quantile value before emitting PromQL:
replace the use of round(percentile / 100, 2) with the precise percentile/100
value (or format it as a full-precision float string) so the emitted PromQL uses
the exact quantile; locate the place constructing the query (the expression
containing {round(percentile / 100, 2)}) where the variable percentile is used
and remove the rounding so quantiles like 95.5 and 99.9 become 0.955 and 0.999
in the emitted query.
- Line 83: The parameter name `object` in the method get_query(self, object:
K8sObjectData, duration: str, step: str) shadows a Python builtin and triggers
Ruff A002; rename the parameter (e.g., to `obj` or `k8s_obj`) in the get_query
signature and update all uses inside the method (and any callers in this
module/class) to the new name, keeping the type K8sObjectData unchanged so
behavior is preserved.

In `@robusta_krr/strategies/burstable.py`:
- Around line 137-141: The cpu and memory limit computations call np.max on
history_data["BurstableCPULimitLoader"] / "BurstableMemoryLimitLoader" values
without checking for emptiness; modify the cpu_limit and memory_limit
expressions (the code that computes cpu_limit and memory_limit in burstable.py)
to first check that the respective loader dict exists and has at least one
non-empty value (e.g., len(history_data.get("BurstableCPULimitLoader", {})) > 0
and that any(v.size for v in values) or similar) and only then call np.max,
otherwise set the limit to None (respecting disable_cpu_limit /
disable_memory_limit). Use the existing loader keys ("BurstableCPULimitLoader",
"BurstableMemoryLimitLoader") and history_data variable names so the fix is
local and avoids ValueError from np.max on empty sequences.

---

Nitpick comments:
In `@tests/test_burstable.py`:
- Around line 132-199: Add two regression tests to tests/test_burstable.py that
cover the case where the request loader has data but the corresponding limit
loader is empty so run() does not call np.max([]) and instead returns undefined;
specifically, add one test for CPU and one for Memory using make_strategy,
make_history_data (populate request values but set e.g.
"BurstableCpuLimitLoader" or "BurstableMemoryLimitLoader" to {}), call
strategy.run(history, make_object_data()), and assert that
result[ResourceType.Cpu].limit (and result[ResourceType.Memory].limit) is NaN
(use math.isnan) or that request is NaN as appropriate to match other tests;
reference the existing helpers make_strategy, make_history_data,
make_object_data, ResourceType, and strategy.run so the new tests follow the
same patterns as the surrounding tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2c60d03b-e80d-493a-95bf-a2070e82ab16

📥 Commits

Reviewing files that changed from the base of the PR and between 383275a and a9e2b4b.

📒 Files selected for processing (5)
  • robusta_krr/core/integrations/prometheus/metrics/__init__.py
  • robusta_krr/core/integrations/prometheus/metrics/memory.py
  • robusta_krr/strategies/__init__.py
  • robusta_krr/strategies/burstable.py
  • tests/test_burstable.py

raise ValueError("percentile must be between 0 and 100")

class PercentileMemoryLoader(PrometheusMetric):
def get_query(self, object: K8sObjectData, duration: str, step: str) -> str:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rename the object parameter to avoid lint failure (Ruff A002).

Line 83 introduces a builtin-shadowing argument name (object), which is flagged by Ruff and may break CI lint gates.

💡 Suggested fix
-        def get_query(self, object: K8sObjectData, duration: str, step: str) -> str:
-            pods_selector = "|".join(pod.name for pod in object.pods)
+        def get_query(self, object_data: K8sObjectData, duration: str, step: str) -> str:
+            pods_selector = "|".join(pod.name for pod in object_data.pods)
             cluster_label = self.get_prometheus_cluster_label()
             return f"""
                 quantile_over_time(
                     {percentile / 100},
                     max(
                         container_memory_working_set_bytes{{
-                            namespace="{object.namespace}",
+                            namespace="{object_data.namespace}",
                             pod=~"{pods_selector}",
-                            container="{object.container}"
+                            container="{object_data.container}"
                             {cluster_label}
                         }}
🧰 Tools
🪛 Ruff (0.15.15)

[error] 83-83: Function argument object is shadowing a Python builtin

(A002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@robusta_krr/core/integrations/prometheus/metrics/memory.py` at line 83, The
parameter name `object` in the method get_query(self, object: K8sObjectData,
duration: str, step: str) shadows a Python builtin and triggers Ruff A002;
rename the parameter (e.g., to `obj` or `k8s_obj`) in the get_query signature
and update all uses inside the method (and any callers in this module/class) to
the new name, keeping the type K8sObjectData unchanged so behavior is preserved.

Source: Linters/SAST tools

cluster_label = self.get_prometheus_cluster_label()
return f"""
quantile_over_time(
{round(percentile / 100, 2)},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not round the quantile value before emitting PromQL.

Line 88 rounds percentile / 100 to 2 decimals, which silently shifts requested percentiles (e.g., 95.5 → 0.95, 99.9 → 1.0). That changes recommendation behavior.

💡 Suggested fix
-                    {round(percentile / 100, 2)},
+                    {percentile / 100},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{round(percentile / 100, 2)},
{percentile / 100},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@robusta_krr/core/integrations/prometheus/metrics/memory.py` at line 88, The
code is rounding the quantile value before emitting PromQL: replace the use of
round(percentile / 100, 2) with the precise percentile/100 value (or format it
as a full-precision float string) so the emitted PromQL uses the exact quantile;
locate the place constructing the query (the expression containing
{round(percentile / 100, 2)}) where the variable percentile is used and remove
the rounding so quantiles like 95.5 and 99.9 become 0.955 and 0.999 in the
emitted query.

Comment on lines +137 to +141
cpu_limit = (
None
if self.settings.disable_cpu_limit
else np.max([v[0, 1] for v in history_data["BurstableCPULimitLoader"].values()])
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard limit-series loaders before np.max to avoid runtime exceptions.

In both CPU (Lines 137-141) and memory (Lines 178-182), np.max is called on loader values without checking that the loader dict is non-empty. If the limit series is missing/sparse while request series exists, this throws ValueError and fails the strategy run.

💡 Suggested fix
         cpu_request = np.max([v[0, 1] for v in data_req.values()])
+        data_limit = history_data["BurstableCPULimitLoader"]
+        if not self.settings.disable_cpu_limit and len(data_limit) == 0:
+            return ResourceRecommendation.undefined(info="No data")
         cpu_limit = (
             None
             if self.settings.disable_cpu_limit
-            else np.max([v[0, 1] for v in history_data["BurstableCPULimitLoader"].values()])
+            else np.max([v[0, 1] for v in data_limit.values()])
         )

@@
         memory_request = np.max([v[0, 1] for v in data_req.values()])
-        memory_limit_base = np.max([v[0, 1] for v in history_data["BurstableMemoryLimitLoader"].values()])
+        data_limit = history_data["BurstableMemoryLimitLoader"]
+        if len(data_limit) == 0:
+            return ResourceRecommendation.undefined(info="No data")
+        memory_limit_base = np.max([v[0, 1] for v in data_limit.values()])

Also applies to: 178-182

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@robusta_krr/strategies/burstable.py` around lines 137 - 141, The cpu and
memory limit computations call np.max on history_data["BurstableCPULimitLoader"]
/ "BurstableMemoryLimitLoader" values without checking for emptiness; modify the
cpu_limit and memory_limit expressions (the code that computes cpu_limit and
memory_limit in burstable.py) to first check that the respective loader dict
exists and has at least one non-empty value (e.g.,
len(history_data.get("BurstableCPULimitLoader", {})) > 0 and that any(v.size for
v in values) or similar) and only then call np.max, otherwise set the limit to
None (respecting disable_cpu_limit / disable_memory_limit). Use the existing
loader keys ("BurstableCPULimitLoader", "BurstableMemoryLimitLoader") and
history_data variable names so the fix is local and avoids ValueError from
np.max on empty sequences.

@mircea-pavel-anton mircea-pavel-anton marked this pull request as draft June 12, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants