Feat(DLQ): Add DLQ service with retry logic and scheduler integration#3451
Feat(DLQ): Add DLQ service with retry logic and scheduler integration#3451Br0wnHammer wants to merge 5 commits intodevelopfrom
Conversation
…d staleness checks
…and add retry/cleanup jobs
ajhollid
left a comment
There was a problem hiding this comment.
The general idea looks OK to me, but the devil is in the details. There's a couple of things I've noticed off the bat, as well as a critical runtime issue that needs to be fixed.
This is also a very high-frequency job; let's make the whole thing optional so this is opt in for performance-conscious users.
How about adding an env flag to enable/disable this service?
| const monitor = payload.monitor as Parameters<INotificationsService["handleNotifications"]>[0]; | ||
| const monitorStatusResponse = payload.monitorStatusResponse as Parameters<INotificationsService["handleNotifications"]>[1]; | ||
| const decision = payload.decision as Parameters<INotificationsService["handleNotifications"]>[2]; |
There was a problem hiding this comment.
Casting is forbidden in this application 😂 We lose all type safety when we cast, all types should be explicit.
| for (const item of items) { | ||
| try { | ||
| await this.executeRetry(item); | ||
| await this.dlqRepository.deleteById(item.id, item.teamId); |
There was a problem hiding this comment.
What happens if for some reason there's a mismatch in the monitor, ie it has the wrong teamId?
This will silently fail, and the item will never be removed from the queue. If the retry is successful, it seems to me the item should be removed regardless of whether it belongs to the correct team or not.
| case "incident_create": | ||
| case "incident_resolve": { | ||
| const monitor = payload.monitor as Parameters<IIncidentService["handleIncident"]>[0]; | ||
| const code = payload.code as number; | ||
| const decision = payload.decision as Parameters<IIncidentService["handleIncident"]>[2]; | ||
| const monitorStatusResponse = payload.monitorStatusResponse as Parameters<IIncidentService["handleIncident"]>[3]; | ||
|
|
||
| await this.incidentService.handleIncident(monitor, code, decision, monitorStatusResponse); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Shouldn't these also have a staleness check? The logic for creating/resolving incidents is identical to notifications 🤔 We don't want duplicate incidents in the same way we don't want duplicate notifications
|
|
||
| this.scheduler.addJob({ id: "cleanup-orphaned", template: "cleanup-orphaned", active: true }); | ||
| this.scheduler.addJob({ id: "cleanup-retention", template: "cleanup-retention-job", active: true, repeat: 24 * 60 * 60 * 1000 }); | ||
| this.scheduler.addJob({ id: "dlq-retry", template: "dlq-retry-job", active: true, repeat: 30 * 1000 }); |
There was a problem hiding this comment.
This runs every 30 seconds, what happens if a retry takes longer than 30 seconds? We'll have duplicate notifications then will we not? 🤔 There should be some sort of concurrency guard here to abort the run if the previous run is not finished.
| incidentsRepository = new TimescaleIncidentsRepository(pool); | ||
| teamsRepository = new TimescaleTeamsRepository(pool); | ||
| maintenanceWindowsRepository = new TimescaleMaintenanceWindowsRepository(pool); | ||
| dlqRepository = new MongoDLQRepository(); // TODO: Replace with TimescaleDLQRepository(pool) once implemented |
There was a problem hiding this comment.
Hard no, this will cause a runtime crash. At least stub out the TimescaleDB implementation so it fails gracefully
Describe your changes
Please ensure all items are checked off before requesting a review. "Checked off" means you need to add an "x" character between brackets so they turn into checkmarks.
<div>Add</div>, use):npm run formatin server and client directories, which automatically formats your code.