Skip to content

Feat(DLQ): Add DLQ service with retry logic and scheduler integration#3451

Open
Br0wnHammer wants to merge 5 commits intodevelopfrom
feat/dlq-service-integration
Open

Feat(DLQ): Add DLQ service with retry logic and scheduler integration#3451
Br0wnHammer wants to merge 5 commits intodevelopfrom
feat/dlq-service-integration

Conversation

@Br0wnHammer
Copy link
Copy Markdown
Member

Describe your changes

  • Adds DLQService with enqueue, exponential backoff retry (30s base, 1h cap, 5 max attempts), and staleness checks that skip "down" notifications if the monitor has since recovered
  • Replaces fire-and-forget .catch(log) patterns in the heartbeat job with .catch(enqueue to DLQ) for both notification and incident failures
  • Registers two new scheduler jobs: DLQ retry (every 30s) and DLQ cleanup (daily, 7-day TTL)
  • Wires DLQ repository and service into the dependency injection chain

Please ensure all items are checked off before requesting a review. "Checked off" means you need to add an "x" character between brackets so they turn into checkmarks.

  • (Do not skip this or your PR will be closed) I deployed the application locally.
  • (Do not skip this or your PR will be closed) I have performed a self-review and testing of my code.
  • I have included the issue # in the PR.
  • I have added i18n support to visible strings (instead of <div>Add</div>, use):
const { t } = useTranslation();
<div>{t('add')}</div>
  • I have not included any files that are not related to my pull request, including package-lock and package-json if dependencies have not changed
  • I didn't use any hardcoded values (otherwise it will not scale, and will make it difficult to maintain consistency across the application).
  • I made sure font sizes, color choices etc are all referenced from the theme. I don't have any hardcoded dimensions.
  • My PR is granular and targeted to one specific feature.
  • I ran npm run format in server and client directories, which automatically formats your code.
  • I took a screenshot or a video and attached to this PR if there is a UI change.

@Br0wnHammer Br0wnHammer added this to the 3.5 milestone Mar 30, 2026
@Br0wnHammer Br0wnHammer self-assigned this Mar 30, 2026
@Br0wnHammer Br0wnHammer added enhancement New feature or request backend labels Mar 30, 2026
Copy link
Copy Markdown
Collaborator

@ajhollid ajhollid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea looks OK to me, but the devil is in the details. There's a couple of things I've noticed off the bat, as well as a critical runtime issue that needs to be fixed.

This is also a very high-frequency job; let's make the whole thing optional so this is opt in for performance-conscious users.

How about adding an env flag to enable/disable this service?

Comment on lines +178 to +180
const monitor = payload.monitor as Parameters<INotificationsService["handleNotifications"]>[0];
const monitorStatusResponse = payload.monitorStatusResponse as Parameters<INotificationsService["handleNotifications"]>[1];
const decision = payload.decision as Parameters<INotificationsService["handleNotifications"]>[2];
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting is forbidden in this application 😂 We lose all type safety when we cast, all types should be explicit.

for (const item of items) {
try {
await this.executeRetry(item);
await this.dlqRepository.deleteById(item.id, item.teamId);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if for some reason there's a mismatch in the monitor, ie it has the wrong teamId?

This will silently fail, and the item will never be removed from the queue. If the retry is successful, it seems to me the item should be removed regardless of whether it belongs to the correct team or not.

Comment on lines +209 to +218
case "incident_create":
case "incident_resolve": {
const monitor = payload.monitor as Parameters<IIncidentService["handleIncident"]>[0];
const code = payload.code as number;
const decision = payload.decision as Parameters<IIncidentService["handleIncident"]>[2];
const monitorStatusResponse = payload.monitorStatusResponse as Parameters<IIncidentService["handleIncident"]>[3];

await this.incidentService.handleIncident(monitor, code, decision, monitorStatusResponse);
break;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these also have a staleness check? The logic for creating/resolving incidents is identical to notifications 🤔 We don't want duplicate incidents in the same way we don't want duplicate notifications


this.scheduler.addJob({ id: "cleanup-orphaned", template: "cleanup-orphaned", active: true });
this.scheduler.addJob({ id: "cleanup-retention", template: "cleanup-retention-job", active: true, repeat: 24 * 60 * 60 * 1000 });
this.scheduler.addJob({ id: "dlq-retry", template: "dlq-retry-job", active: true, repeat: 30 * 1000 });
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs every 30 seconds, what happens if a retry takes longer than 30 seconds? We'll have duplicate notifications then will we not? 🤔 There should be some sort of concurrency guard here to abort the run if the previous run is not finished.

incidentsRepository = new TimescaleIncidentsRepository(pool);
teamsRepository = new TimescaleTeamsRepository(pool);
maintenanceWindowsRepository = new TimescaleMaintenanceWindowsRepository(pool);
dlqRepository = new MongoDLQRepository(); // TODO: Replace with TimescaleDLQRepository(pool) once implemented
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard no, this will cause a runtime crash. At least stub out the TimescaleDB implementation so it fails gracefully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants