Issue 7462 - Fix SERIAL LOCK and entry cache lock ordering on BDB dea…#7463
Issue 7462 - Fix SERIAL LOCK and entry cache lock ordering on BDB dea…#7463vashirov wants to merge 1 commit into389ds:mainfrom
Conversation
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The retry paths now contain a lot of duplicated logic for unlocking entry-cache locks, aborting the txn, sleeping, then re-locking (with similar error handling); consider factoring this pattern into small helper functions to reduce complexity and lower the risk of divergence between add/modify/delete/modrdn implementations.
- You’ve introduced several
*_lockedflags that are updated in many branches; it might be worth double-checking and consolidating the ownership/lifecycle handling (e.g., via small helpers that paircache_lock_entrywith flag updates) to make it harder to accidentally get the flag and the actual lock state out of sync in future changes. - In
ldbm_back_modify, the new error log on failed re-lock ("Failed to re-lock entry after deadlock retry") omits theconn_id/op_idcontext that the other new log messages include; consider adding those identifiers for consistency and easier debugging.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The retry paths now contain a lot of duplicated logic for unlocking entry-cache locks, aborting the txn, sleeping, then re-locking (with similar error handling); consider factoring this pattern into small helper functions to reduce complexity and lower the risk of divergence between add/modify/delete/modrdn implementations.
- You’ve introduced several `*_locked` flags that are updated in many branches; it might be worth double-checking and consolidating the ownership/lifecycle handling (e.g., via small helpers that pair `cache_lock_entry` with flag updates) to make it harder to accidentally get the flag and the actual lock state out of sync in future changes.
- In `ldbm_back_modify`, the new error log on failed re-lock (`"Failed to re-lock entry after deadlock retry"`) omits the `conn_id`/`op_id` context that the other new log messages include; consider adding those identifiers for consistency and easier debugging.
## Individual Comments
### Comment 1
<location path="ldap/servers/slapd/back-ldbm/ldbm_modify.c" line_range="708-709" />
<code_context>
+ if (retry_count > 0 && e) {
+ int lock_rc = cache_lock_entry(&inst->inst_cache, e);
+ if (lock_rc != 0) {
+ slapi_log_err(SLAPI_LOG_ERR, "ldbm_back_modify",
+ "Failed to re-lock entry after deadlock "
+ "retry (rc=%d)\n", lock_rc);
+ CACHE_RETURN(&inst->inst_cache, &e);
</code_context>
<issue_to_address>
**suggestion:** Log message for re-lock failure in modify lacks connection/operation context, unlike the other new log messages.
To keep logging consistent and aid correlation in production, please include `conn_id` and `op_id` in this re-lock failure message as well, similar to the re-lock logs in `ldbm_back_add`, `ldbm_back_delete`, and `ldbm_back_modrdn`.
Suggested implementation:
```c
int lock_rc = cache_lock_entry(&inst->inst_cache, e);
if (lock_rc != 0) {
slapi_log_err(SLAPI_LOG_ERR, "ldbm_back_modify",
"conn=%" PRIu64 " op=%d Failed to re-lock entry after deadlock "
"retry (rc=%d)\n",
conn_id, op_id, lock_rc);
```
This change assumes that `conn_id` and `op_id` are already available in this scope as in the other logging sites you referenced, and that `<inttypes.h>`/`PRIu64` are already included for the file (as is common elsewhere in this codebase). If the existing re-lock logs in `ldbm_back_add`, `ldbm_back_delete`, or `ldbm_back_modrdn` use a slightly different format string (e.g. including `ldbm_back_modify -` or different spacing), you should align this message to match that exact pattern for consistency.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…dlock retry Bug Description: In IPA context, server can freeze under heavy write load (add/mod/del users and hosts, deferred memberof enabled). I found 2 issues: 1. In deadlock retry path, a thread grabs SERIAL LOCK, during deadlock sleeps still holding the SERIAL LOCK. Every other writer thread waits for that sleep to finish. 2. After fixing the first issue, I encountered lock ordering issues with SERIAL LOCK and entry locks. On retry Thread A holds entry lock from the previous attempt, tries to grab SERIAL LOCK. Thread B holds SERIAL LOCK, wants the same entry lock. Neither can proceed, resulting in a deadlock. Fix Description: 1. Release SERIAL LOCK before backoff sleep on BDB deadlock retry and cache lock retry paths. 2. Fix AB-BA deadlock between SERIAL LOCK and entry cache locks by releasing/re-acquiring entry locks around retry. Fixes: 389ds#7462
|
Congratulations! One of the builds has completed. 🍾 You can install the built RPMs by following these steps:
Please note that the RPMs should be used only in a testing environment. |
progier389
left a comment
There was a problem hiding this comment.
Looks good except a minor point (use bool for the flags instead of int32_t)
| Connection *pb_conn; | ||
| int32_t parent_op = 0; | ||
| int32_t betxn_callback_fails = 0; /* if a BETXN fails we need to revert entry cache */ | ||
| int32_t e_locked = 0; /* non-zero when entry 'e' cache lock is held */ |
There was a problem hiding this comment.
Should better use bool and true/false
droideck
left a comment
There was a problem hiding this comment.
Overall, looks great!
Regarding testing, do you think it's possible to reproduce it via stress test?
Something that spawns N writer threads doing add/mod/del under deferred-memberof probably would exercise the retry path. But that's a stretch and it's not a blocker for the PR merge.
| ldap_result_code = -1; | ||
| goto error_return; /* error result sent by find_entry2modify() */ | ||
| } | ||
| e_locked = 1; |
There was a problem hiding this comment.
Do I understand correctly that for MANAGE_ENTRY_BEFORE_DBLOCK case on iter 1 we acquire E first and then SERIAL LOCK? But then on iter 2 -- it's another way around (SERIAL LOCK first and then E)?
IIUC, even though very rarely (and especially because MANAGE_ENTRY_BEFORE_DBLOCK is not a default mode but also experimental mode) but we still might have the AB-BA deadlock here.
…dlock retry
Bug Description:
In IPA context, server can freeze under heavy write load (add/mod/del users and hosts, deferred memberof enabled).
I found 2 issues:
Fix Description:
Fixes: #7462
Summary by Sourcery
Adjust BDB transaction retry handling to avoid deadlocks involving the global SERIAL LOCK and entry cache locks during high-write workloads.
Bug Fixes:
Enhancements: