Skip to content

Issue 7462 - Fix SERIAL LOCK and entry cache lock ordering on BDB dea…#7463

Open
vashirov wants to merge 1 commit into389ds:mainfrom
vashirov:i7462
Open

Issue 7462 - Fix SERIAL LOCK and entry cache lock ordering on BDB dea…#7463
vashirov wants to merge 1 commit into389ds:mainfrom
vashirov:i7462

Conversation

@vashirov
Copy link
Copy Markdown
Member

@vashirov vashirov commented Apr 29, 2026

…dlock retry

Bug Description:
In IPA context, server can freeze under heavy write load (add/mod/del users and hosts, deferred memberof enabled).

I found 2 issues:

  1. In deadlock retry path, a thread grabs SERIAL LOCK, during deadlock sleeps still holding the SERIAL LOCK. Every other writer thread waits for that sleep to finish.
  2. After fixing the first issue, I encountered lock ordering issues with SERIAL LOCK and entry locks. On retry Thread A holds entry lock from the previous attempt, tries to grab SERIAL LOCK. Thread B holds SERIAL LOCK, wants the same entry lock. Neither can proceed, resulting in a deadlock.

Fix Description:

  1. Release SERIAL LOCK before backoff sleep on BDB deadlock retry and cache lock retry paths.
  2. Fix AB-BA deadlock between SERIAL LOCK and entry cache locks by releasing/re-acquiring entry locks around retry.

Fixes: #7462

Summary by Sourcery

Adjust BDB transaction retry handling to avoid deadlocks involving the global SERIAL LOCK and entry cache locks during high-write workloads.

Bug Fixes:

  • Release the SERIAL LOCK before backoff sleeps on BDB deadlock and cache lock retry paths so other writers are not blocked behind a sleeping thread.
  • Resolve AB-BA deadlocks between the SERIAL LOCK and entry cache locks by unlocking entries before aborting on retry and re-locking them after re-establishing the transaction.
  • Prevent double-unlock and inconsistent cache state on error paths by tracking whether entry and parent cache locks are held before unlocking or returning entries to the cache.

Enhancements:

  • Standardize transaction retry behavior in add, delete, modify, and modrdn operations to always acquire the SERIAL LOCK consistently before (re)locking entry cache objects.

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The retry paths now contain a lot of duplicated logic for unlocking entry-cache locks, aborting the txn, sleeping, then re-locking (with similar error handling); consider factoring this pattern into small helper functions to reduce complexity and lower the risk of divergence between add/modify/delete/modrdn implementations.
  • You’ve introduced several *_locked flags that are updated in many branches; it might be worth double-checking and consolidating the ownership/lifecycle handling (e.g., via small helpers that pair cache_lock_entry with flag updates) to make it harder to accidentally get the flag and the actual lock state out of sync in future changes.
  • In ldbm_back_modify, the new error log on failed re-lock ("Failed to re-lock entry after deadlock retry") omits the conn_id/op_id context that the other new log messages include; consider adding those identifiers for consistency and easier debugging.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The retry paths now contain a lot of duplicated logic for unlocking entry-cache locks, aborting the txn, sleeping, then re-locking (with similar error handling); consider factoring this pattern into small helper functions to reduce complexity and lower the risk of divergence between add/modify/delete/modrdn implementations.
- You’ve introduced several `*_locked` flags that are updated in many branches; it might be worth double-checking and consolidating the ownership/lifecycle handling (e.g., via small helpers that pair `cache_lock_entry` with flag updates) to make it harder to accidentally get the flag and the actual lock state out of sync in future changes.
- In `ldbm_back_modify`, the new error log on failed re-lock (`"Failed to re-lock entry after deadlock retry"`) omits the `conn_id`/`op_id` context that the other new log messages include; consider adding those identifiers for consistency and easier debugging.

## Individual Comments

### Comment 1
<location path="ldap/servers/slapd/back-ldbm/ldbm_modify.c" line_range="708-709" />
<code_context>
+        if (retry_count > 0 && e) {
+            int lock_rc = cache_lock_entry(&inst->inst_cache, e);
+            if (lock_rc != 0) {
+                slapi_log_err(SLAPI_LOG_ERR, "ldbm_back_modify",
+                              "Failed to re-lock entry after deadlock "
+                              "retry (rc=%d)\n", lock_rc);
+                CACHE_RETURN(&inst->inst_cache, &e);
</code_context>
<issue_to_address>
**suggestion:** Log message for re-lock failure in modify lacks connection/operation context, unlike the other new log messages.

To keep logging consistent and aid correlation in production, please include `conn_id` and `op_id` in this re-lock failure message as well, similar to the re-lock logs in `ldbm_back_add`, `ldbm_back_delete`, and `ldbm_back_modrdn`.

Suggested implementation:

```c
            int lock_rc = cache_lock_entry(&inst->inst_cache, e);
            if (lock_rc != 0) {
                slapi_log_err(SLAPI_LOG_ERR, "ldbm_back_modify",
                              "conn=%" PRIu64 " op=%d Failed to re-lock entry after deadlock "
                              "retry (rc=%d)\n",
                              conn_id, op_id, lock_rc);

```

This change assumes that `conn_id` and `op_id` are already available in this scope as in the other logging sites you referenced, and that `<inttypes.h>`/`PRIu64` are already included for the file (as is common elsewhere in this codebase). If the existing re-lock logs in `ldbm_back_add`, `ldbm_back_delete`, or `ldbm_back_modrdn` use a slightly different format string (e.g. including `ldbm_back_modify -` or different spacing), you should align this message to match that exact pattern for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread ldap/servers/slapd/back-ldbm/ldbm_modify.c
…dlock retry

Bug Description:
In IPA context, server can freeze under heavy write load (add/mod/del
users and hosts, deferred memberof enabled).

I found 2 issues:
1. In deadlock retry path, a thread grabs SERIAL LOCK, during deadlock
sleeps still holding the SERIAL LOCK. Every other writer thread waits
for that sleep to finish.
2. After fixing the first issue, I encountered lock ordering issues with
SERIAL LOCK and entry locks. On retry Thread A holds entry lock from the
previous attempt, tries to grab SERIAL LOCK. Thread B holds SERIAL LOCK,
wants the same entry lock. Neither can proceed, resulting in a deadlock.

Fix Description:
1. Release SERIAL LOCK before backoff sleep on BDB deadlock retry and
cache lock retry paths.
2. Fix AB-BA deadlock between SERIAL LOCK and entry cache locks by
releasing/re-acquiring entry locks around retry.

Fixes: 389ds#7462
@packit-as-a-service
Copy link
Copy Markdown

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo dnf install -y 'dnf*-command(copr)'
  • dnf copr enable packit/389ds-389-ds-base-7463
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

Copy link
Copy Markdown
Contributor

@progier389 progier389 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except a minor point (use bool for the flags instead of int32_t)

Connection *pb_conn;
int32_t parent_op = 0;
int32_t betxn_callback_fails = 0; /* if a BETXN fails we need to revert entry cache */
int32_t e_locked = 0; /* non-zero when entry 'e' cache lock is held */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should better use bool and true/false

Copy link
Copy Markdown
Member

@droideck droideck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great!

Regarding testing, do you think it's possible to reproduce it via stress test?
Something that spawns N writer threads doing add/mod/del under deferred-memberof probably would exercise the retry path. But that's a stretch and it's not a blocker for the PR merge.

ldap_result_code = -1;
goto error_return; /* error result sent by find_entry2modify() */
}
e_locked = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that for MANAGE_ENTRY_BEFORE_DBLOCK case on iter 1 we acquire E first and then SERIAL LOCK? But then on iter 2 -- it's another way around (SERIAL LOCK first and then E)?

IIUC, even though very rarely (and especially because MANAGE_ENTRY_BEFORE_DBLOCK is not a default mode but also experimental mode) but we still might have the AB-BA deadlock here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Freeze under heavy write load with BDB backend

3 participants