Feat(DLQ): Add DLQ service with retry logic and scheduler integration by Br0wnHammer · Pull Request #3451 · bluewave-labs/Checkmate

Br0wnHammer · 2026-03-30T07:25:23Z

Describe your changes

Adds DLQService with enqueue, exponential backoff retry (30s base, 1h cap, 5 max attempts), and staleness checks that skip "down" notifications if the monitor has since recovered
Replaces fire-and-forget .catch(log) patterns in the heartbeat job with .catch(enqueue to DLQ) for both notification and incident failures
Registers two new scheduler jobs: DLQ retry (every 30s) and DLQ cleanup (daily, 7-day TTL)
Wires DLQ repository and service into the dependency injection chain

Please ensure all items are checked off before requesting a review. "Checked off" means you need to add an "x" character between brackets so they turn into checkmarks.

(Do not skip this or your PR will be closed) I deployed the application locally.
(Do not skip this or your PR will be closed) I have performed a self-review and testing of my code.
I have included the issue # in the PR.
I have added i18n support to visible strings (instead of <div>Add</div>, use):

const { t } = useTranslation();
<div>{t('add')}</div>

I have not included any files that are not related to my pull request, including package-lock and package-json if dependencies have not changed
I didn't use any hardcoded values (otherwise it will not scale, and will make it difficult to maintain consistency across the application).
I made sure font sizes, color choices etc are all referenced from the theme. I don't have any hardcoded dimensions.
My PR is granular and targeted to one specific feature.
I ran npm run format in server and client directories, which automatically formats your code.
I took a screenshot or a video and attached to this PR if there is a UI change.

…d staleness checks

…and add retry/cleanup jobs

ajhollid

The general idea looks OK to me, but the devil is in the details. There's a couple of things I've noticed off the bat, as well as a critical runtime issue that needs to be fixed.

This is also a very high-frequency job; let's make the whole thing optional so this is opt in for performance-conscious users.

How about adding an env flag to enable/disable this service?

ajhollid · 2026-03-30T16:53:59Z

+				const monitor = payload.monitor as Parameters<INotificationsService["handleNotifications"]>[0];
+				const monitorStatusResponse = payload.monitorStatusResponse as Parameters<INotificationsService["handleNotifications"]>[1];
+				const decision = payload.decision as Parameters<INotificationsService["handleNotifications"]>[2];


Casting is forbidden in this application 😂 We lose all type safety when we cast, all types should be explicit.

ajhollid · 2026-03-30T16:56:53Z

+		for (const item of items) {
+			try {
+				await this.executeRetry(item);
+				await this.dlqRepository.deleteById(item.id, item.teamId);


What happens if for some reason there's a mismatch in the monitor, ie it has the wrong teamId?

This will silently fail, and the item will never be removed from the queue. If the retry is successful, it seems to me the item should be removed regardless of whether it belongs to the correct team or not.

ajhollid · 2026-03-30T17:00:18Z

+			case "incident_create":
+			case "incident_resolve": {
+				const monitor = payload.monitor as Parameters<IIncidentService["handleIncident"]>[0];
+				const code = payload.code as number;
+				const decision = payload.decision as Parameters<IIncidentService["handleIncident"]>[2];
+				const monitorStatusResponse = payload.monitorStatusResponse as Parameters<IIncidentService["handleIncident"]>[3];
+
+				await this.incidentService.handleIncident(monitor, code, decision, monitorStatusResponse);
+				break;
+			}


Shouldn't these also have a staleness check? The logic for creating/resolving incidents is identical to notifications 🤔 We don't want duplicate incidents in the same way we don't want duplicate notifications

ajhollid · 2026-03-30T17:02:14Z


 			this.scheduler.addJob({ id: "cleanup-orphaned", template: "cleanup-orphaned", active: true });
 			this.scheduler.addJob({ id: "cleanup-retention", template: "cleanup-retention-job", active: true, repeat: 24 * 60 * 60 * 1000 });
+			this.scheduler.addJob({ id: "dlq-retry", template: "dlq-retry-job", active: true, repeat: 30 * 1000 });


This runs every 30 seconds, what happens if a retry takes longer than 30 seconds? We'll have duplicate notifications then will we not? 🤔 There should be some sort of concurrency guard here to abort the run if the previous run is not finished.

ajhollid · 2026-03-30T17:04:15Z

 		incidentsRepository = new TimescaleIncidentsRepository(pool);
 		teamsRepository = new TimescaleTeamsRepository(pool);
 		maintenanceWindowsRepository = new TimescaleMaintenanceWindowsRepository(pool);
+		dlqRepository = new MongoDLQRepository(); // TODO: Replace with TimescaleDLQRepository(pool) once implemented


Hard no, this will cause a runtime crash. At least stub out the TimescaleDB implementation so it fails gracefully

Br0wnHammer added 4 commits March 30, 2026 12:41

feat(dlq): add DLQService with enqueue, exponential backoff retry, an…

3d274fe

…d staleness checks

feat(dlq): integrate DLQ enqueue into heartbeat job failure handlers …

cd1f0de

…and add retry/cleanup jobs

feat(dlq): register DLQ retry and cleanup jobs in scheduler

036ad20

feat(dlq): wire DLQ repository and service into dependency injection

44d100d

Br0wnHammer added this to the 3.5 milestone Mar 30, 2026

Br0wnHammer self-assigned this Mar 30, 2026

Br0wnHammer requested review from Owaiseimdad, ajhollid and karenvicent as code owners March 30, 2026 07:25

Br0wnHammer added enhancement New feature or request backend labels Mar 30, 2026

Format Checks

c93f990

ajhollid requested changes Mar 30, 2026

View reviewed changes

akashmannil mentioned this pull request Mar 31, 2026

Prevent sending "Recovered" notification when no prior "Down" notification was sent for short outages #3438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat(DLQ): Add DLQ service with retry logic and scheduler integration#3451

Feat(DLQ): Add DLQ service with retry logic and scheduler integration#3451
Br0wnHammer wants to merge 5 commits intodevelopfrom
feat/dlq-service-integration

Br0wnHammer commented Mar 30, 2026

Uh oh!

ajhollid left a comment

Uh oh!

ajhollid Mar 30, 2026

Uh oh!

ajhollid Mar 30, 2026

Uh oh!

ajhollid Mar 30, 2026

Uh oh!

ajhollid Mar 30, 2026

Uh oh!

ajhollid Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Br0wnHammer commented Mar 30, 2026

Describe your changes

Please ensure all items are checked off before requesting a review. "Checked off" means you need to add an "x" character between brackets so they turn into checkmarks.

Uh oh!

ajhollid left a comment

Choose a reason for hiding this comment

Uh oh!

ajhollid Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ajhollid Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ajhollid Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ajhollid Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ajhollid Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants