Skip to content

Fix cross-test mock leak behind the flaky parallel test failures#388

Merged
cigamit merged 1 commit into
ctrliq:mainfrom
blaipr:fix/transactiontestcase-flake
Jun 11, 2026
Merged

Fix cross-test mock leak behind the flaky parallel test failures#388
cigamit merged 1 commit into
ctrliq:mainfrom
blaipr:fix/transactiontestcase-flake

Conversation

@blaipr

@blaipr blaipr commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Merge PR #388 first — it makes full parallel Python test runs deterministic for reviewing everything else.

SUMMARY

Fixes #379 — and the root cause turned out to be none of the suspects in the issue. Six unit tests assigned a Mock directly onto the model manager, with no patch machinery:

task.model.objects.get = mock.Mock(return_value=job)   # task.model is the Job / AdHocCommand class

That permanently replaces Manager.get for the remainder of the pytest-xdist worker process. Whichever functional test files later shared that worker became the 'flaky' victims, and the leaked mock explains every observed failure family:

  • api/test_survey_spec.py: Job.objects.get(...) returned a MagicMock → json.loads(job.extra_vars) failed with 'must be str … not MagicMock' (up to 13 parametrized failures per run).
  • utils/test_update_model.py: update_model() returned the unit test's leftover in-memory Job built with Project(pk=1); the test saved it, inserting a row whose project_id=1 references nothing — which Django's whole-DB check_constraints() at teardown reported as the mysterious 'main_job … invalid foreign key' IntegrityError.
  • commands/test_secret_key_regeneration.py: same frankenjob → garbage decrypted start_args.

The xdist scheduling lottery decided which files shared a worker with the leakers each run — hence the random victim set, why every victim passed in isolation, and why pairwise reproductions failed.

How it was found

A temporary pytest plugin scanned every model manager for Mock instance attributes after each test and logged transitions per worker. Both leak sources (unit/test_tasks.py, unit/tasks/test_jobs.py) were caught within two full parallel runs, attributed to the exact tests.

The fix

All six sites become mocker.patch.object(task.model.objects, 'get', return_value=...) — same behavior inside the test, automatically unwound by pytest-mock at teardown.

ISSUE TYPE

  • Bug, Docs Fix or other nominal change

COMPONENT NAME

  • API

ASCENDER VERSION

awx: 25.4.1.dev5+gcda0899.d20260610

ADDITIONAL INFORMATION

Validation: five consecutive full parallel runs (py.test --create-db -n auto --dist=loadfile awx/main/tests/unit awx/main/tests/functional awx/conf/tests awx/sso/tests) all returned 3476 passed, 0 failed, with the leak detector confirming no manager is ever left mocked. Before the fix, roughly two of every three such runs failed.

Note: the TransactionTestCase classes called out in the original issue text are innocent — their teardown flush verifiably leaves all tables empty. Test-only change; production code untouched. Independent of all open PRs.

Six unit tests assigned a Mock directly onto the Job/AdHocCommand
manager (task.model.objects.get = mock.Mock(return_value=job)) with no
patch machinery, permanently replacing Manager.get for the remainder of
the worker process. Under pytest-xdist, whichever functional test files
later shared that worker became the 'flaky' victims:

- api/test_survey_spec.py: Job.objects.get returned a MagicMock, so
  json.loads(job.extra_vars) blew up with 'must be str ... not
  MagicMock' (up to 13 parametrized failures per run)
- utils/test_update_model.py: update_model() returned the unit test's
  leftover in-memory Job (built with Project(pk=1)); saving it inserted
  a job row whose project_id=1 references nothing, which Django's
  whole-DB check_constraints() at teardown reported as the mysterious
  'main_job ... invalid foreign key' IntegrityError
- commands/test_secret_key_regeneration.py: same frankenjob, decrypted
  start_args garbage

The leaks were localized with a temporary pytest plugin that scanned
every model manager for Mock instance attributes after each test and
logged the transition - both leak sources reproduced within two full
parallel runs.

Fix: scope all six with mocker.patch.object(task.model.objects, 'get',
return_value=...), which pytest-mock unwinds at teardown.

Validation: five consecutive full parallel runs
(py.test --create-db -n auto --dist=loadfile awx/main/tests/unit
awx/main/tests/functional awx/conf/tests awx/sso/tests) all came back
3476 passed, 0 failed, with the leak detector confirming no manager is
ever left mocked. Before the fix, roughly two of every three runs
failed.

Fixes ctrliq#379

@cigamit cigamit left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down, was bugging me for a while but was so sporadic (only failed for me 1 in 10 times) that it was hard to pin down. Ran through the tests multiple times and it has not occurred again.

@cigamit cigamit merged commit 73fbea8 into ctrliq:main Jun 11, 2026
@cigamit cigamit self-assigned this Jun 11, 2026
@cigamit cigamit added the bug Something isn't working label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Development

Successfully merging this pull request may close these issues.

Flaky functional test failures in parallel runs caused by TransactionTestCase state leaking into the per-worker test DB

2 participants