Issue 7462 - Fix SERIAL LOCK and entry cache lock ordering on BDB dea… by vashirov · Pull Request #7463 · 389ds/389-ds-base

vashirov · 2026-04-29T13:45:22Z

…dlock retry

Bug Description:
In IPA context, server can freeze under heavy write load (add/mod/del users and hosts, deferred memberof enabled).

I found 2 issues:

In deadlock retry path, a thread grabs SERIAL LOCK, during deadlock sleeps still holding the SERIAL LOCK. Every other writer thread waits for that sleep to finish.
After fixing the first issue, I encountered lock ordering issues with SERIAL LOCK and entry locks. On retry Thread A holds entry lock from the previous attempt, tries to grab SERIAL LOCK. Thread B holds SERIAL LOCK, wants the same entry lock. Neither can proceed, resulting in a deadlock.

Fix Description:

Release SERIAL LOCK before backoff sleep on BDB deadlock retry and cache lock retry paths.
Fix AB-BA deadlock between SERIAL LOCK and entry cache locks by releasing/re-acquiring entry locks around retry.

Fixes: #7462

Summary by Sourcery

Adjust BDB transaction retry handling to avoid deadlocks involving the global SERIAL LOCK and entry cache locks during high-write workloads.

Bug Fixes:

Release the SERIAL LOCK before backoff sleeps on BDB deadlock and cache lock retry paths so other writers are not blocked behind a sleeping thread.
Resolve AB-BA deadlocks between the SERIAL LOCK and entry cache locks by unlocking entries before aborting on retry and re-locking them after re-establishing the transaction.
Prevent double-unlock and inconsistent cache state on error paths by tracking whether entry and parent cache locks are held before unlocking or returning entries to the cache.

Enhancements:

Standardize transaction retry behavior in add, delete, modify, and modrdn operations to always acquire the SERIAL LOCK consistently before (re)locking entry cache objects.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The retry paths now contain a lot of duplicated logic for unlocking entry-cache locks, aborting the txn, sleeping, then re-locking (with similar error handling); consider factoring this pattern into small helper functions to reduce complexity and lower the risk of divergence between add/modify/delete/modrdn implementations.
You’ve introduced several *_locked flags that are updated in many branches; it might be worth double-checking and consolidating the ownership/lifecycle handling (e.g., via small helpers that pair cache_lock_entry with flag updates) to make it harder to accidentally get the flag and the actual lock state out of sync in future changes.
In ldbm_back_modify, the new error log on failed re-lock ("Failed to re-lock entry after deadlock retry") omits the conn_id/op_id context that the other new log messages include; consider adding those identifiers for consistency and easier debugging.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The retry paths now contain a lot of duplicated logic for unlocking entry-cache locks, aborting the txn, sleeping, then re-locking (with similar error handling); consider factoring this pattern into small helper functions to reduce complexity and lower the risk of divergence between add/modify/delete/modrdn implementations.
- You’ve introduced several `*_locked` flags that are updated in many branches; it might be worth double-checking and consolidating the ownership/lifecycle handling (e.g., via small helpers that pair `cache_lock_entry` with flag updates) to make it harder to accidentally get the flag and the actual lock state out of sync in future changes.
- In `ldbm_back_modify`, the new error log on failed re-lock (`"Failed to re-lock entry after deadlock retry"`) omits the `conn_id`/`op_id` context that the other new log messages include; consider adding those identifiers for consistency and easier debugging.

## Individual Comments

### Comment 1
<location path="ldap/servers/slapd/back-ldbm/ldbm_modify.c" line_range="708-709" />
<code_context>
+        if (retry_count > 0 && e) {
+            int lock_rc = cache_lock_entry(&inst->inst_cache, e);
+            if (lock_rc != 0) {
+                slapi_log_err(SLAPI_LOG_ERR, "ldbm_back_modify",
+                              "Failed to re-lock entry after deadlock "
+                              "retry (rc=%d)\n", lock_rc);
+                CACHE_RETURN(&inst->inst_cache, &e);
</code_context>
<issue_to_address>
**suggestion:** Log message for re-lock failure in modify lacks connection/operation context, unlike the other new log messages.

To keep logging consistent and aid correlation in production, please include `conn_id` and `op_id` in this re-lock failure message as well, similar to the re-lock logs in `ldbm_back_add`, `ldbm_back_delete`, and `ldbm_back_modrdn`.

Suggested implementation:

```c
            int lock_rc = cache_lock_entry(&inst->inst_cache, e);
            if (lock_rc != 0) {
                slapi_log_err(SLAPI_LOG_ERR, "ldbm_back_modify",
                              "conn=%" PRIu64 " op=%d Failed to re-lock entry after deadlock "
                              "retry (rc=%d)\n",
                              conn_id, op_id, lock_rc);

```

This change assumes that `conn_id` and `op_id` are already available in this scope as in the other logging sites you referenced, and that `<inttypes.h>`/`PRIu64` are already included for the file (as is common elsewhere in this codebase). If the existing re-lock logs in `ldbm_back_add`, `ldbm_back_delete`, or `ldbm_back_modrdn` use a slightly different format string (e.g. including `ldbm_back_modify -` or different spacing), you should align this message to match that exact pattern for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

…dlock retry Bug Description: In IPA context, server can freeze under heavy write load (add/mod/del users and hosts, deferred memberof enabled). I found 2 issues: 1. In deadlock retry path, a thread grabs SERIAL LOCK, during deadlock sleeps still holding the SERIAL LOCK. Every other writer thread waits for that sleep to finish. 2. After fixing the first issue, I encountered lock ordering issues with SERIAL LOCK and entry locks. On retry Thread A holds entry lock from the previous attempt, tries to grab SERIAL LOCK. Thread B holds SERIAL LOCK, wants the same entry lock. Neither can proceed, resulting in a deadlock. Fix Description: 1. Release SERIAL LOCK before backoff sleep on BDB deadlock retry and cache lock retry paths. 2. Fix AB-BA deadlock between SERIAL LOCK and entry cache locks by releasing/re-acquiring entry locks around retry. Fixes: 389ds#7462

packit-as-a-service · 2026-04-29T15:05:42Z

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

sudo dnf install -y 'dnf*-command(copr)'
dnf copr enable packit/389ds-389-ds-base-7463
And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

progier389

Looks good except a minor point (use bool for the flags instead of int32_t)

progier389 · 2026-04-30T13:02:50Z

    Connection *pb_conn;
    int32_t parent_op = 0;
    int32_t betxn_callback_fails = 0; /* if a BETXN fails we need to revert entry cache */
+    int32_t e_locked = 0;             /* non-zero when entry 'e' cache lock is held */


Should better use bool and true/false

droideck

Overall, looks great!

Regarding testing, do you think it's possible to reproduce it via stress test?
Something that spawns N writer threads doing add/mod/del under deferred-memberof probably would exercise the retry path. But that's a stretch and it's not a blocker for the PR merge.

droideck · 2026-05-06T05:26:57Z

            ldap_result_code = -1;
            goto error_return; /* error result sent by find_entry2modify() */
        }
+        e_locked = 1;


Do I understand correctly that for MANAGE_ENTRY_BEFORE_DBLOCK case on iter 1 we acquire E first and then SERIAL LOCK? But then on iter 2 -- it's another way around (SERIAL LOCK first and then E)?

IIUC, even though very rarely (and especially because MANAGE_ENTRY_BEFORE_DBLOCK is not a default mode but also experimental mode) but we still might have the AB-BA deadlock here.

sourcery-ai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread ldap/servers/slapd/back-ldbm/ldbm_modify.c

vashirov force-pushed the i7462 branch from 4889c99 to e2a549b Compare April 29, 2026 13:50

vashirov force-pushed the i7462 branch from e2a549b to 8025e13 Compare April 29, 2026 14:53

progier389 approved these changes Apr 30, 2026

View reviewed changes

droideck reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 7462 - Fix SERIAL LOCK and entry cache lock ordering on BDB dea…#7463

Issue 7462 - Fix SERIAL LOCK and entry cache lock ordering on BDB dea…#7463
vashirov wants to merge 1 commit into389ds:mainfrom
vashirov:i7462

vashirov commented Apr 29, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

packit-as-a-service Bot commented Apr 29, 2026

Uh oh!

progier389 left a comment

Uh oh!

progier389 Apr 30, 2026

Uh oh!

droideck left a comment

Uh oh!

droideck May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vashirov commented Apr 29, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

packit-as-a-service Bot commented Apr 29, 2026

Uh oh!

progier389 left a comment

Choose a reason for hiding this comment

Uh oh!

progier389 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

droideck left a comment

Choose a reason for hiding this comment

Uh oh!

droideck May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vashirov commented Apr 29, 2026 •

edited by sourcery-ai Bot

Loading