[server] Add disk-usage write protection to TabletServer#3340
[server] Add disk-usage write protection to TabletServer#3340swuferhong wants to merge 1 commit into
Conversation
5949356 to
ac209de
Compare
zuston
left a comment
There was a problem hiding this comment.
If exceeding the disk usage ratio threshold (or disk corruption), do we need to make this tablet server as offline or unhealthy status? I think the writer side fencing is not enough, sometimes the disk usage exceeding will not recover automaticlly at the many cases
ac209de to
feb7dc6
Compare
Hi, @zuston. Writer-side fencing is the minimum-sufficient response for a capacity event; promoting it to node-level offline turns a localized capacity problem into a cluster-wide availability incident and triggers cascading failover. Disk corruption is a separate fault domain (IOException-driven Log Directory Failure) and should be addressed in a dedicated PR. Happy to add a follow-up issue tracking the Log Directory Failure work if that helps. |
Purpose
Linked issue: close #3338
Introduce a periodic disk-usage monitoring mechanism that proactively rejects client writes when the TabletServer's data disk usage exceeds a configurable high-water-mark ratio, preventing ENOSPC errors and potential data corruption.
Key design decisions:
New configuration:
New metrics:
Brief change log
Tests
API and Format
Documentation