feat: implement burstable strategy for mem and cpu#531
feat: implement burstable strategy for mem and cpu#531mircea-pavel-anton wants to merge 1 commit into
Conversation
Signed-off-by: Mircea-Pavel ANTON <contact@mirceanton.com>
|
|
WalkthroughThis PR introduces a new ChangesBurstableStrategy with Percentile Metrics and OOMKill Integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/test_burstable.py (1)
132-199: ⚡ Quick winAdd regression tests for empty limit loaders to prevent
np.max([])crashes.Current tests don’t cover the case where request loader has data but limit loader is empty. Add one CPU and one memory test to ensure
run()returns undefined rather than raising.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_burstable.py` around lines 132 - 199, Add two regression tests to tests/test_burstable.py that cover the case where the request loader has data but the corresponding limit loader is empty so run() does not call np.max([]) and instead returns undefined; specifically, add one test for CPU and one for Memory using make_strategy, make_history_data (populate request values but set e.g. "BurstableCpuLimitLoader" or "BurstableMemoryLimitLoader" to {}), call strategy.run(history, make_object_data()), and assert that result[ResourceType.Cpu].limit (and result[ResourceType.Memory].limit) is NaN (use math.isnan) or that request is NaN as appropriate to match other tests; reference the existing helpers make_strategy, make_history_data, make_object_data, ResourceType, and strategy.run so the new tests follow the same patterns as the surrounding tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@robusta_krr/core/integrations/prometheus/metrics/memory.py`:
- Line 88: The code is rounding the quantile value before emitting PromQL:
replace the use of round(percentile / 100, 2) with the precise percentile/100
value (or format it as a full-precision float string) so the emitted PromQL uses
the exact quantile; locate the place constructing the query (the expression
containing {round(percentile / 100, 2)}) where the variable percentile is used
and remove the rounding so quantiles like 95.5 and 99.9 become 0.955 and 0.999
in the emitted query.
- Line 83: The parameter name `object` in the method get_query(self, object:
K8sObjectData, duration: str, step: str) shadows a Python builtin and triggers
Ruff A002; rename the parameter (e.g., to `obj` or `k8s_obj`) in the get_query
signature and update all uses inside the method (and any callers in this
module/class) to the new name, keeping the type K8sObjectData unchanged so
behavior is preserved.
In `@robusta_krr/strategies/burstable.py`:
- Around line 137-141: The cpu and memory limit computations call np.max on
history_data["BurstableCPULimitLoader"] / "BurstableMemoryLimitLoader" values
without checking for emptiness; modify the cpu_limit and memory_limit
expressions (the code that computes cpu_limit and memory_limit in burstable.py)
to first check that the respective loader dict exists and has at least one
non-empty value (e.g., len(history_data.get("BurstableCPULimitLoader", {})) > 0
and that any(v.size for v in values) or similar) and only then call np.max,
otherwise set the limit to None (respecting disable_cpu_limit /
disable_memory_limit). Use the existing loader keys ("BurstableCPULimitLoader",
"BurstableMemoryLimitLoader") and history_data variable names so the fix is
local and avoids ValueError from np.max on empty sequences.
---
Nitpick comments:
In `@tests/test_burstable.py`:
- Around line 132-199: Add two regression tests to tests/test_burstable.py that
cover the case where the request loader has data but the corresponding limit
loader is empty so run() does not call np.max([]) and instead returns undefined;
specifically, add one test for CPU and one for Memory using make_strategy,
make_history_data (populate request values but set e.g.
"BurstableCpuLimitLoader" or "BurstableMemoryLimitLoader" to {}), call
strategy.run(history, make_object_data()), and assert that
result[ResourceType.Cpu].limit (and result[ResourceType.Memory].limit) is NaN
(use math.isnan) or that request is NaN as appropriate to match other tests;
reference the existing helpers make_strategy, make_history_data,
make_object_data, ResourceType, and strategy.run so the new tests follow the
same patterns as the surrounding tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2c60d03b-e80d-493a-95bf-a2070e82ab16
📒 Files selected for processing (5)
robusta_krr/core/integrations/prometheus/metrics/__init__.pyrobusta_krr/core/integrations/prometheus/metrics/memory.pyrobusta_krr/strategies/__init__.pyrobusta_krr/strategies/burstable.pytests/test_burstable.py
| raise ValueError("percentile must be between 0 and 100") | ||
|
|
||
| class PercentileMemoryLoader(PrometheusMetric): | ||
| def get_query(self, object: K8sObjectData, duration: str, step: str) -> str: |
There was a problem hiding this comment.
Rename the object parameter to avoid lint failure (Ruff A002).
Line 83 introduces a builtin-shadowing argument name (object), which is flagged by Ruff and may break CI lint gates.
💡 Suggested fix
- def get_query(self, object: K8sObjectData, duration: str, step: str) -> str:
- pods_selector = "|".join(pod.name for pod in object.pods)
+ def get_query(self, object_data: K8sObjectData, duration: str, step: str) -> str:
+ pods_selector = "|".join(pod.name for pod in object_data.pods)
cluster_label = self.get_prometheus_cluster_label()
return f"""
quantile_over_time(
{percentile / 100},
max(
container_memory_working_set_bytes{{
- namespace="{object.namespace}",
+ namespace="{object_data.namespace}",
pod=~"{pods_selector}",
- container="{object.container}"
+ container="{object_data.container}"
{cluster_label}
}}🧰 Tools
🪛 Ruff (0.15.15)
[error] 83-83: Function argument object is shadowing a Python builtin
(A002)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@robusta_krr/core/integrations/prometheus/metrics/memory.py` at line 83, The
parameter name `object` in the method get_query(self, object: K8sObjectData,
duration: str, step: str) shadows a Python builtin and triggers Ruff A002;
rename the parameter (e.g., to `obj` or `k8s_obj`) in the get_query signature
and update all uses inside the method (and any callers in this module/class) to
the new name, keeping the type K8sObjectData unchanged so behavior is preserved.
Source: Linters/SAST tools
| cluster_label = self.get_prometheus_cluster_label() | ||
| return f""" | ||
| quantile_over_time( | ||
| {round(percentile / 100, 2)}, |
There was a problem hiding this comment.
Do not round the quantile value before emitting PromQL.
Line 88 rounds percentile / 100 to 2 decimals, which silently shifts requested percentiles (e.g., 95.5 → 0.95, 99.9 → 1.0). That changes recommendation behavior.
💡 Suggested fix
- {round(percentile / 100, 2)},
+ {percentile / 100},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {round(percentile / 100, 2)}, | |
| {percentile / 100}, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@robusta_krr/core/integrations/prometheus/metrics/memory.py` at line 88, The
code is rounding the quantile value before emitting PromQL: replace the use of
round(percentile / 100, 2) with the precise percentile/100 value (or format it
as a full-precision float string) so the emitted PromQL uses the exact quantile;
locate the place constructing the query (the expression containing
{round(percentile / 100, 2)}) where the variable percentile is used and remove
the rounding so quantiles like 95.5 and 99.9 become 0.955 and 0.999 in the
emitted query.
| cpu_limit = ( | ||
| None | ||
| if self.settings.disable_cpu_limit | ||
| else np.max([v[0, 1] for v in history_data["BurstableCPULimitLoader"].values()]) | ||
| ) |
There was a problem hiding this comment.
Guard limit-series loaders before np.max to avoid runtime exceptions.
In both CPU (Lines 137-141) and memory (Lines 178-182), np.max is called on loader values without checking that the loader dict is non-empty. If the limit series is missing/sparse while request series exists, this throws ValueError and fails the strategy run.
💡 Suggested fix
cpu_request = np.max([v[0, 1] for v in data_req.values()])
+ data_limit = history_data["BurstableCPULimitLoader"]
+ if not self.settings.disable_cpu_limit and len(data_limit) == 0:
+ return ResourceRecommendation.undefined(info="No data")
cpu_limit = (
None
if self.settings.disable_cpu_limit
- else np.max([v[0, 1] for v in history_data["BurstableCPULimitLoader"].values()])
+ else np.max([v[0, 1] for v in data_limit.values()])
)
@@
memory_request = np.max([v[0, 1] for v in data_req.values()])
- memory_limit_base = np.max([v[0, 1] for v in history_data["BurstableMemoryLimitLoader"].values()])
+ data_limit = history_data["BurstableMemoryLimitLoader"]
+ if len(data_limit) == 0:
+ return ResourceRecommendation.undefined(info="No data")
+ memory_limit_base = np.max([v[0, 1] for v in data_limit.values()])Also applies to: 178-182
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@robusta_krr/strategies/burstable.py` around lines 137 - 141, The cpu and
memory limit computations call np.max on history_data["BurstableCPULimitLoader"]
/ "BurstableMemoryLimitLoader" values without checking for emptiness; modify the
cpu_limit and memory_limit expressions (the code that computes cpu_limit and
memory_limit in burstable.py) to first check that the respective loader dict
exists and has at least one non-empty value (e.g.,
len(history_data.get("BurstableCPULimitLoader", {})) > 0 and that any(v.size for
v in values) or similar) and only then call np.max, otherwise set the limit to
None (respecting disable_cpu_limit / disable_memory_limit). Use the existing
loader keys ("BurstableCPULimitLoader", "BurstableMemoryLimitLoader") and
history_data variable names so the fix is local and avoids ValueError from
np.max on empty sequences.
The purpose of this PR is to add a basic "burstable" recommendation strategy.
As far as I can tell, the current strategies enforce a QoS recommendation (limit == request) for memory, while allowing CPU to be burstable.
While this makes sense in certain scenarios, I'm also running KRR in my homelab to propose resource adjustments. In such cases, apps are most of the time idling, with occasional usage spikes. Setting requests and limits to the same value as to accommodate that spike would result In massive over provisioning in this situation.
Hence I'm working on getting this strategy going that can recommend individual values for requests and limits based on percentiles, just like CPU recommendations work in the other strategies.
For reference, I'm running my fork of KRR (the code in this PR) like so: https://github.com/mirceanton/home-ops/tree/main/apps%2Fjobs%2Fkrr%2Fapp
To generate this "dashboard": mirceanton/home-ops#559