Postgres-native simplifications + testcontainer test tier#2
Merged
Conversation
Recurring every()/daily() jobs fired from JobScheduler's in-process timer ran on every instance, so a job with side effects (email, billing, webhooks) ran N times across N instances — and the ops-flush jobs double-counted ops_timeseries. Gate executeJob behind a time-slot claim: each tick computes slot = floor(now / period) and row-locks the job's brace_scheduled_runs row (SELECT ... FOR UPDATE). The first instance to reach a new slot advances last_run_slot and runs; instances still on an already-claimed slot skip. Because every instance derives the same slot from wall-clock, this is exactly-once per interval regardless of tick stagger. Rows are pre-seeded at start() so the hot path never hits an INSERT race. The no-DB path (start(null), e.g. tests) runs unconditionally as before. Uses a row lock (portable to H2 for tests), not pg_try_advisory_lock — a bare advisory lock only stops concurrent dupes, not staggered ones. Adds V5__brace_scheduled_runs.sql (framework Flyway instance) and MultiInstanceSchedulerTest (two schedulers / one DB run once per interval). Full suite 582 green. Also lands the multi-server and shared-cache analysis docs that surfaced this bug, and the B1–B7 bug classification in TODO.md.
…ranch Two zero-tension Postgres-native quick wins from the simplification analysis (docs/2026-06-04-brace-postgres-native.md): B7 — JobPoller claimed a durable job by running UPDATE ... WHERE started_at IS NULL and proceeding on "no exception thrown". Under READ COMMITTED two instances can both commit that UPDATE without either throwing — one affects 1 row, the other 0 — and the 0-row loser would fall through and execute the body too (double-run). The claim now runs through raw JDBC and proceeds only when executeUpdate() == 1. Fully H2-portable; no migration, no dialect branch. The larger FOR UPDATE SKIP LOCKED batch-claim is deferred behind a Postgres testcontainer (H2 doesn't reliably support SKIP LOCKED) and tracked as Tier 1a. MySQL — removed the jdbc:mysql: -> MySQLDialect branch from detectDialect. The bundled migrations are Postgres/H2 SQL, so a MySQL URL would only fail later at migration time; failing loudly up front is clearer. Re-adding MySQL means dialect-specific migrations, not a dialect string. Also captures the Postgres-native simplification analysis as a doc with a TODO pointer, marks B7 done, and notes the MySQL-branch removal on the multi-database TODO line. Full suite 582 green.
Captures the two-tier testing plan (keep fast H2 suite, add a smaller real-Postgres tier) and the researched best practices to adopt: singleton container + withReuse for local speed, tmpfs/fsync-off knobs, truncation (not transaction-rollback) isolation given Brace manages its own transactions and the concurrency tests need committed rows across connections, the two-connection SKIP LOCKED test pattern, running migrations through Brace's own Flyway, and a failsafe *IT tier so `mvn test` stays H2-only and Docker-free. Forcing function: the multi-server concurrency fixes (B1, B7, ErrorStore upsert, durable-job lease) hinge on Postgres locking/isolation that H2 in-memory can't reproduce — they're currently validated on a DB that can't exhibit the bug they fix. Phase 0 is a pilot IT to de-risk the toolchain before migrating any tests.
…ons) Stands up the integration-test tier from docs/2026-06-05-pg-testcontainers.md: - pom: org.testcontainers:postgresql (test scope) + maven-failsafe-plugin bound to integration-test/verify. `mvn test` (surefire/H2) stays H2-only and Docker-free; `mvn verify` additionally runs *IT tests. - PostgresSkipLockedClaimIT: tmpfs/fsync-off postgres:16-alpine singleton, applies the framework V1-V5 migrations on real Postgres via Brace's own Flyway config, asserts they applied, and proves FOR UPDATE SKIP LOCKED hands disjoint batches to two concurrent JDBC connections — the property H2 in-memory can't express and the reason the durable-job work needs a real-Postgres tier. Truncation isolation; assumeTrue(Docker) so it skips cleanly without Docker. Surfaced building it: the test-app fixture db/migration/V1__create_posts.sql uses H2/MySQL-only BIGINT AUTO_INCREMENT, which fails on Postgres — exactly the dialect bug this tier catches, and why the pilot runs framework migrations directly rather than through DatabaseFactory. Verified: compiles; `mvn test` still 582 green and excludes the IT. NOT yet run against a live container — this machine has no Docker; needs `mvn verify` where Docker is present (CI) to confirm the IT itself green.
The Postgres testcontainer pilot caught a latent PRODUCTION bug on its
first real-Postgres run: Flyway 10 split per-database support out of
flyway-core. Core bundles only a few handlers (H2 among them — which is
why the H2 test suite always passed), but PostgreSQL needs its own
flyway-database-postgresql module. Without it, any Brace app deploying
on Postgres (the stated production database) fails at startup when
DatabaseFactory runs migrations:
FlywayException: No database found to handle jdbc:postgresql://...
Add the module at runtime scope (like the JDBC driver) so it propagates
transitively to downstream apps. The H2-only test suite never exercised
the real prod migration path, so this went unseen.
Also fix the test-app fixture db/migration/V1__create_posts.sql, which
used H2/MySQL-only BIGINT AUTO_INCREMENT (fails on Postgres) — switched
to portable BIGINT GENERATED BY DEFAULT AS IDENTITY, keeping Hibernate's
IDENTITY semantics on both H2 and Postgres.
Verified: `mvn verify` green — surefire 582 on H2, failsafe
PostgresSkipLockedClaimIT 2/2 on real Postgres (migrations apply +
SKIP LOCKED hands disjoint batches to two concurrent connections).
…ches Adds .github/workflows/ci.yml running `mvn verify` on pull_request and non-main pushes, so the failsafe Postgres IT tier runs before merge — the coverage that would have caught the missing flyway-database-postgresql module (the H2-only surefire suite never could). main/tags are intentionally excluded: publish.yml's `mvn deploy` already traverses the integration-test/verify phases (failsafe is bound there), so it runs the same IT before publishing — no need to duplicate it here. GitHub ubuntu-latest runners ship Docker, so Testcontainers needs no extra setup. Also records in the plan doc how real Brace+Postgres sites got past the Flyway bug: each app declared flyway-database-postgresql itself (benchmark/pom.xml does); shipping it in Brace makes that workaround unnecessary going forward.
Extracts PostgresTestBase: one container shared across all *IT classes (single static field), Docker-guarded lazy start, tmpfs/fsync-off speed knobs, truncate() isolation, raw connect(). PostgresSkipLockedClaimIT refactored onto it (drops the duplicated container/connection boilerplate). Adds DatabaseFactoryPostgresIT — runs the framework's own DatabaseFactory (Flyway migrations + Hibernate StatelessSession) against real Postgres: migrations apply and an IDENTITY id round-trips. This is the payoff of the portable create_posts fix and the template Phase 2 ports the existing DatabaseFactory-based tests (DurableJobTest, MultiInstanceSchedulerTest, ErrorStoreTest, ...) onto. Verified the singleton container is shared: the second IT class starts in <1s with no new container. Adds the upgrade note docs/migrations/brace-0.1.6-to-0.1.7.md (the agent-facing migration-guide system in BRACE-AGENTS.md, shipped in the dist zip): no breaking changes; projects can drop the manual flyway-database-postgresql (and postgresql JDBC) deps they added to work around the packaging gap, now that Brace bundles them transitively. Verified: mvn verify green — surefire 582 (H2), failsafe 3 across two IT classes on real Postgres.
The two tests H2 could never truly validate now run on the PG tier: - MultiInstanceSchedulerPostgresIT (B1): two JobSchedulers on one DB, recurring job fires once per interval cluster-wide. On real Postgres the slot claim runs over an actual SELECT ... FOR UPDATE row lock that serializes the late tick until the early one commits the advanced slot — the mechanism two in-memory-H2 "instances" can't exercise. - DurableJobConcurrencyPostgresIT (B7): 80 durable jobs drained by 4 concurrent pollers; each must execute exactly once. This is the real READ COMMITTED claim race the row-count fix closes — two pollers can both commit UPDATE ... WHERE started_at IS NULL (1-row and 0-row) without throwing, and the loser must not run the body. H2 single-process can't produce the race; real Postgres can. Both extend PostgresTestBase and share the singleton container. The H2 MultiInstanceSchedulerTest/DurableJobTest stay as fast Docker-free smoke/functional tests. mvn verify green: 582 H2 + 5 across four IT classes on real Postgres.
Convert ErrorStore's check-then-insert to a Postgres-native atomic upsert, closing the duplicate-row/lost-increment race that H2 in-memory can't show. ErrorStore.record now branches on DatabaseFactory.isPostgres(): - Postgres: single INSERT ... ON CONFLICT (error_type, route) WHERE resolved_at IS NULL DO UPDATE ... RETURNING (xmax = 0), via raw JDBC (so Hibernate doesn't misclassify the RETURNING; xmax=0 is the new-vs-repeat signal the regression listener needs). Backed by a partial unique index so resolved errors stay out of the dedupe and a recurrence after resolve still gets a fresh row. - H2: keeps check-then-insert unchanged. Prune + listener logic stays shared. The index ships as a Postgres-only framework migration (migration_pg/V6), added to Flyway's locations only on Postgres -- a separate top-level dir, not a subdir of brace/db/migration, because Flyway scans recursively and the H2 tier would otherwise pick up DDL it can't parse. ErrorStorePostgresIT (5 tests) guards the fix: parity with the H2 branch plus the property H2 can't express -- 50 concurrent writers recording the same (type, route) fold into exactly one row with an exact count. mvn test (H2) green; mvn verify (full PG IT tier) green.
…ers (Tier 1b)
Convert every framework instant column from zone-less TIMESTAMP to
TIMESTAMP WITH TIME ZONE, fixing a latent multi-server correctness bug and
deleting the timestamp parsers that existed only to normalize zone-less reads.
Why: zone-less TIMESTAMP stores a bare wall-clock with no anchor, so an instant
round-tripped through a server whose JVM/OS zone isn't UTC can shift. And
because drivers hand zone-less values back in different string shapes, the code
carried hand-rolled multi-format parsers (ErrorStore.parseFirstSeen,
RegressionTracker.parseInstant) just to guess them back into Instants.
- V7__brace_timestamptz.sql: one shared migration ALTERing all instant columns
(scheduled_jobs, ops_errors, ops_timeseries, ops_profiling_snapshots,
brace_scheduled_runs) via the SQL-standard SET DATA TYPE form, which parses on
both H2 and Postgres. The Postgres-only USING ... AT TIME ZONE clause is
omitted so one file runs on both tiers; correct on the UTC-session
deployments Brace targets. V7 not V6: migration_pg/V6 already holds version 6
in the shared flyway_brace_history sequence on Postgres.
- ErrorStore: list()/resolve() put a typed Instant in the result map via a
toInstant(Object) type-dispatch helper (OffsetDateTime/Timestamp/Instant, and
throws loudly on anything else instead of silently returning null); the since
filter compares Instants directly. parseFirstSeen deleted.
- RegressionTracker.seed: casts the map's firstSeen straight to Instant.
parseInstant + its now-unused imports deleted.
- Json already serializes Instant as ISO-8601, so /ops/errors timestamps are now
canonical ISO instead of driver-shaped strings.
Also fixes a latent test-tier bug surfaced by this work: PostgresSkipLockedClaimIT
hand-rolled Flyway with only the base location, so when a DatabaseFactory-based
IT applied the pg-only migration_pg/V6 into the shared history first, its
migrate() failed validation ("applied migration not resolved locally: 6").
Fixed by mirroring DatabaseFactory's full location set (base + migration_pg).
mvn test (H2) green; mvn verify (full PG IT tier) green.
Fold the durable-job selection + claim into a single FOR UPDATE SKIP LOCKED transaction on Postgres, so concurrent instances get disjoint batches and each job is claimed exactly once by construction — replacing the per-row re-claim dance on that path. JobPoller.pollAndExecute now branches on DatabaseFactory.isPostgres(): - Postgres (claimBatchPostgres): one raw-JDBC transaction runs UPDATE scheduled_jobs SET started_at = CURRENT_TIMESTAMP, attempts = attempts + 1 WHERE id IN (SELECT id ... ORDER BY run_at LIMIT 50 FOR UPDATE SKIP LOCKED) RETURNING ... . SKIP LOCKED makes concurrent pollers step over each other's locked rows, so the rows come back already claimed and runJobBody executes them directly. RETURNING attempts - 1 hands back the pre-increment count so the shared retry math is unchanged. - H2 (selectCandidatesH2 + claimAndExecute): unchanged portable path — unlocked candidate select, then each row defends its own claim with the B7 row-count guard. H2 can't express SKIP LOCKED, hence the dialect branch. executeJob is split into runJobBody (execute + complete/retry/fail, shared) and claimAndExecute (H2 per-row claim then runJobBody). DurableJobConcurrencyPostgresIT now exercises the batch path (80 jobs, 4 concurrent pollers, exactly-once) and its Javadoc updated to describe it; the B7 per-row guard remains the H2 path's defense. mvn test (H2, 582) green; mvn verify (full PG IT tier, 10 tests) green.
…on (Tier 1c) The ops-flush jobs wrote each metric as its own single-row INSERT (up to ~21 per flush, up to 40 for the profiling snapshot), and ops_timeseries / ops_profiling_snapshots had no retention, so they grew unbounded. - Batch inserts: the http/cache/jvm flushes build a LinkedHashMap and go through a new Brace.insertMetrics helper that emits one multi-row INSERT INTO ops_timeseries (ts, metric, val) VALUES (?,?,?), (?,?,?), ... (Database.sql renumbers the ? placeholders). The profiling flush builds one multi-row INSERT for its up-to-40 rows. ~21 round-trips/flush -> 1; 40 -> 1. - Retention: new ops-metrics-prune daily job deletes ops_timeseries and ops_profiling_snapshots rows older than 14 days, closing the unbounded-growth foot-gun. Runs once cluster-wide via the existing B1 coordination. The date_trunc rollup from the plan is deliberately deferred: ops_timeseries has no reader today (dashboard sparklines use in-memory Stats.snapshot()), so a rollup query would be dead code until a historical-metrics consumer exists. insertMetrics is package-private so OpsMetricsFlushTest can exercise the dynamic multi-row SQL directly on H2; OpsIntegrationTest now also asserts the prune job is registered. mvn test (H2, 585) green; mvn verify (full PG IT tier, 10) green.
…omplete Record the empirical findings rather than leave them as open "useful, incremental" items that would be re-litigated: - 3a (INSERT ... RETURNING id in Jobs.schedule): premise is stale — schedule already runs through db.jdbc(), which increments queryCount, so the insert is already counted. And H2 2.3.232 rejects INSERT ... RETURNING (verified), so it would need a Postgres/H2 dialect branch — more code than today — for no gain. The current raw-JDBC getGeneratedKeys is already the portable, counted form. - 3b (JSONB for queries_before/request_headers/job_data): deferred. The value is read-side only and there is no reader; converting write-first is storage churn needing Postgres-only DDL plus dialect-branched ::jsonb casts (text->jsonb isn't an implicit write cast). Land it with the query feature that consumes it. The simplification pass is effectively complete: everything that simplifies or corrects existing code shipped (1a/1b/1c/1d/2a). The remainder is gated on consumers that don't exist yet — 2b is part of the unbuilt shared-cache backend (its own task), 3b waits for a JSONB reader.
megamattron
added a commit
that referenced
this pull request
Jun 14, 2026
…sion.sh) Fills gap #2 in docs/2026-06-11-runtime-performance-review-todos.md: the TFB endpoints never configure sessionSecret, so session decrypt, CSRF validation, and form parsing (H5/M2/M3) were invisible to the existing suite. SessionApp is a separate no-database app (port 8081) so the standard App.java stays byte-identical across checkpoints. run-session.sh primes a session cookie + CSRF token, sanity-checks all three scenarios, then drives wrk: session-read GET, CSRF form POST, and a csrf(false) API POST with the cookie attached. BENCH_JAR override enables before/after framework comparisons.
megamattron
added a commit
that referenced
this pull request
Jun 14, 2026
Follow-ups from the merge-gate code review of the Low batch: #1 Verify ?v= against the current fingerprint before promising immutable. serveStaticFile trusted any ?v= param's presence and emitted 1-year immutable; a stale or hand-rolled ?v= (or one a CDN/client appended) could pin wrong/old bytes for a year. Now Assets.currentVersion(path) returns the current content hash (shared (path,mtime) cache) and only an exact match earns immutable; everything else is revalidate-always. #2 Bundled htmx.min.js now carries an ETag + revalidate Cache-Control and honors conditional GETs (304), so browsers skip the ~50KB re-download each page. Not immutable — a brace upgrade can change the bytes at that fixed URL. #3 serveStaticFile uses one Files.readAttributes (size+mtime+isRegularFile) instead of four separate File stat syscalls (exists/isFile/length/lastModified). #4 Removed the now-dead null-invoker fallback in the request path and the unused null-producing Route(method,pattern,handler) constructor, so 'every Route has a non-null invoker' (L1) is enforced by construction. #8 isNotModified: dropped the unused trimmed-copy var (split ran on the original). StaticFilesTest +2 (stale ?v= -> revalidate, htmx revalidates); suite 846/846.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lands the Postgres testcontainer test tier and the postgres-native simplifications it unblocks. All changes are proven on both the H2 fast suite (
mvn test) and the real-Postgres IT tier (mvn verify).Test tier (earlier commits)
*ITtier via maven-failsafe:PostgresTestBase(singleton container, tmpfs/fsync-off, truncation isolation), run bymvn verify;mvn teststays H2-only and Docker-free.flyway-database-postgresql(migrations would fail at startup on Postgres). Fixed.ci.ymlrunsmvn verifyon PRs + non-main branches.Postgres-native simplifications (this batch)
ErrorStoreON CONFLICTupsert + partial unique index (migration_pg/V6, Postgres-only). Closes the duplicate-row/lost-increment race;ErrorStorePostgresITproves exactly-once under 50 concurrent writers. H2 keeps check-then-insert.TIMESTAMPTZfor every framework instant column (V7), and deleted both hand-rolled multi-format timestamp parsers (ErrorStore.parseFirstSeen,RegressionTracker.parseInstant) in favor of typedInstantreads. Fixes a latent non-UTC-server correctness bug.UPDATE … WHERE id IN (SELECT … FOR UPDATE SKIP LOCKED) RETURNING …transaction (disjoint batches, exactly-once by construction); H2 keeps the portable per-row B7 claim.ops-metrics-pruneretention job closes the unbounded-growth foot-gun.Assessed and deferred (documented, not built)
INSERT … RETURNING id): not worth it — insert is already counted viadb.jdbc, and H2 2.3.232 rejectsRETURNING(would need a dialect branch for no gain).TEXT[]+GIN): deferred until a consumer exists; 2b is part of the unbuilt shared-cache backend (its own task).See
docs/2026-06-04-brace-postgres-native.mdanddocs/2026-06-05-pg-testcontainers.mdfor the full plan/status.