Skip to content

Postgres-native simplifications + testcontainer test tier#2

Merged
megamattron merged 14 commits into
mainfrom
postgres-native-simplifications
Jun 6, 2026
Merged

Postgres-native simplifications + testcontainer test tier#2
megamattron merged 14 commits into
mainfrom
postgres-native-simplifications

Conversation

@megamattron

Copy link
Copy Markdown
Member

Lands the Postgres testcontainer test tier and the postgres-native simplifications it unblocks. All changes are proven on both the H2 fast suite (mvn test) and the real-Postgres IT tier (mvn verify).

Test tier (earlier commits)

  • Postgres Testcontainers *IT tier via maven-failsafe: PostgresTestBase (singleton container, tmpfs/fsync-off, truncation isolation), run by mvn verify; mvn test stays H2-only and Docker-free.
  • Caught a real prod bug on first run: missing flyway-database-postgresql (migrations would fail at startup on Postgres). Fixed.
  • CI: ci.yml runs mvn verify on PRs + non-main branches.
  • B1 (cluster-wide recurring scheduler) and B7 (durable-job double-run guard) concurrency fixes, ported to real-Postgres ITs.

Postgres-native simplifications (this batch)

  • 2a — ErrorStore ON CONFLICT upsert + partial unique index (migration_pg/V6, Postgres-only). Closes the duplicate-row/lost-increment race; ErrorStorePostgresIT proves exactly-once under 50 concurrent writers. H2 keeps check-then-insert.
  • 1b — TIMESTAMPTZ for every framework instant column (V7), and deleted both hand-rolled multi-format timestamp parsers (ErrorStore.parseFirstSeen, RegressionTracker.parseInstant) in favor of typed Instant reads. Fixes a latent non-UTC-server correctness bug.
  • 1a — SKIP LOCKED durable-job batch claim. Postgres folds select+claim into one UPDATE … WHERE id IN (SELECT … FOR UPDATE SKIP LOCKED) RETURNING … transaction (disjoint batches, exactly-once by construction); H2 keeps the portable per-row B7 claim.
  • 1c — batched ops flush inserts + retention. Per-flush single-row INSERTs collapsed into one multi-row INSERT each (~21→1, 40→1); new daily ops-metrics-prune retention job closes the unbounded-growth foot-gun.

Assessed and deferred (documented, not built)

  • 3a (INSERT … RETURNING id): not worth it — insert is already counted via db.jdbc, and H2 2.3.232 rejects RETURNING (would need a dialect branch for no gain).
  • 3b (JSONB) and 2b (TEXT[]+GIN): deferred until a consumer exists; 2b is part of the unbuilt shared-cache backend (its own task).

See docs/2026-06-04-brace-postgres-native.md and docs/2026-06-05-pg-testcontainers.md for the full plan/status.

Recurring every()/daily() jobs fired from JobScheduler's in-process timer
ran on every instance, so a job with side effects (email, billing,
webhooks) ran N times across N instances — and the ops-flush jobs
double-counted ops_timeseries.

Gate executeJob behind a time-slot claim: each tick computes
slot = floor(now / period) and row-locks the job's brace_scheduled_runs
row (SELECT ... FOR UPDATE). The first instance to reach a new slot
advances last_run_slot and runs; instances still on an already-claimed
slot skip. Because every instance derives the same slot from wall-clock,
this is exactly-once per interval regardless of tick stagger. Rows are
pre-seeded at start() so the hot path never hits an INSERT race. The
no-DB path (start(null), e.g. tests) runs unconditionally as before.

Uses a row lock (portable to H2 for tests), not pg_try_advisory_lock —
a bare advisory lock only stops concurrent dupes, not staggered ones.

Adds V5__brace_scheduled_runs.sql (framework Flyway instance) and
MultiInstanceSchedulerTest (two schedulers / one DB run once per
interval). Full suite 582 green.

Also lands the multi-server and shared-cache analysis docs that surfaced
this bug, and the B1–B7 bug classification in TODO.md.
…ranch

Two zero-tension Postgres-native quick wins from the simplification
analysis (docs/2026-06-04-brace-postgres-native.md):

B7 — JobPoller claimed a durable job by running UPDATE ... WHERE
started_at IS NULL and proceeding on "no exception thrown". Under READ
COMMITTED two instances can both commit that UPDATE without either
throwing — one affects 1 row, the other 0 — and the 0-row loser would
fall through and execute the body too (double-run). The claim now runs
through raw JDBC and proceeds only when executeUpdate() == 1. Fully
H2-portable; no migration, no dialect branch. The larger FOR UPDATE
SKIP LOCKED batch-claim is deferred behind a Postgres testcontainer
(H2 doesn't reliably support SKIP LOCKED) and tracked as Tier 1a.

MySQL — removed the jdbc:mysql: -> MySQLDialect branch from
detectDialect. The bundled migrations are Postgres/H2 SQL, so a MySQL
URL would only fail later at migration time; failing loudly up front is
clearer. Re-adding MySQL means dialect-specific migrations, not a
dialect string.

Also captures the Postgres-native simplification analysis as a doc with
a TODO pointer, marks B7 done, and notes the MySQL-branch removal on the
multi-database TODO line. Full suite 582 green.
Captures the two-tier testing plan (keep fast H2 suite, add a smaller
real-Postgres tier) and the researched best practices to adopt:
singleton container + withReuse for local speed, tmpfs/fsync-off knobs,
truncation (not transaction-rollback) isolation given Brace manages its
own transactions and the concurrency tests need committed rows across
connections, the two-connection SKIP LOCKED test pattern, running
migrations through Brace's own Flyway, and a failsafe *IT tier so
`mvn test` stays H2-only and Docker-free.

Forcing function: the multi-server concurrency fixes (B1, B7, ErrorStore
upsert, durable-job lease) hinge on Postgres locking/isolation that H2
in-memory can't reproduce — they're currently validated on a DB that
can't exhibit the bug they fix. Phase 0 is a pilot IT to de-risk the
toolchain before migrating any tests.
…ons)

Stands up the integration-test tier from docs/2026-06-05-pg-testcontainers.md:

- pom: org.testcontainers:postgresql (test scope) + maven-failsafe-plugin
  bound to integration-test/verify. `mvn test` (surefire/H2) stays
  H2-only and Docker-free; `mvn verify` additionally runs *IT tests.
- PostgresSkipLockedClaimIT: tmpfs/fsync-off postgres:16-alpine singleton,
  applies the framework V1-V5 migrations on real Postgres via Brace's own
  Flyway config, asserts they applied, and proves FOR UPDATE SKIP LOCKED
  hands disjoint batches to two concurrent JDBC connections — the property
  H2 in-memory can't express and the reason the durable-job work needs a
  real-Postgres tier. Truncation isolation; assumeTrue(Docker) so it skips
  cleanly without Docker.

Surfaced building it: the test-app fixture db/migration/V1__create_posts.sql
uses H2/MySQL-only BIGINT AUTO_INCREMENT, which fails on Postgres — exactly
the dialect bug this tier catches, and why the pilot runs framework
migrations directly rather than through DatabaseFactory.

Verified: compiles; `mvn test` still 582 green and excludes the IT. NOT yet
run against a live container — this machine has no Docker; needs `mvn verify`
where Docker is present (CI) to confirm the IT itself green.
The Postgres testcontainer pilot caught a latent PRODUCTION bug on its
first real-Postgres run: Flyway 10 split per-database support out of
flyway-core. Core bundles only a few handlers (H2 among them — which is
why the H2 test suite always passed), but PostgreSQL needs its own
flyway-database-postgresql module. Without it, any Brace app deploying
on Postgres (the stated production database) fails at startup when
DatabaseFactory runs migrations:

    FlywayException: No database found to handle jdbc:postgresql://...

Add the module at runtime scope (like the JDBC driver) so it propagates
transitively to downstream apps. The H2-only test suite never exercised
the real prod migration path, so this went unseen.

Also fix the test-app fixture db/migration/V1__create_posts.sql, which
used H2/MySQL-only BIGINT AUTO_INCREMENT (fails on Postgres) — switched
to portable BIGINT GENERATED BY DEFAULT AS IDENTITY, keeping Hibernate's
IDENTITY semantics on both H2 and Postgres.

Verified: `mvn verify` green — surefire 582 on H2, failsafe
PostgresSkipLockedClaimIT 2/2 on real Postgres (migrations apply +
SKIP LOCKED hands disjoint batches to two concurrent connections).
…ches

Adds .github/workflows/ci.yml running `mvn verify` on pull_request and
non-main pushes, so the failsafe Postgres IT tier runs before merge —
the coverage that would have caught the missing flyway-database-postgresql
module (the H2-only surefire suite never could).

main/tags are intentionally excluded: publish.yml's `mvn deploy` already
traverses the integration-test/verify phases (failsafe is bound there),
so it runs the same IT before publishing — no need to duplicate it here.
GitHub ubuntu-latest runners ship Docker, so Testcontainers needs no extra
setup.

Also records in the plan doc how real Brace+Postgres sites got past the
Flyway bug: each app declared flyway-database-postgresql itself
(benchmark/pom.xml does); shipping it in Brace makes that workaround
unnecessary going forward.
Extracts PostgresTestBase: one container shared across all *IT classes
(single static field), Docker-guarded lazy start, tmpfs/fsync-off speed
knobs, truncate() isolation, raw connect(). PostgresSkipLockedClaimIT
refactored onto it (drops the duplicated container/connection boilerplate).

Adds DatabaseFactoryPostgresIT — runs the framework's own DatabaseFactory
(Flyway migrations + Hibernate StatelessSession) against real Postgres:
migrations apply and an IDENTITY id round-trips. This is the payoff of the
portable create_posts fix and the template Phase 2 ports the existing
DatabaseFactory-based tests (DurableJobTest, MultiInstanceSchedulerTest,
ErrorStoreTest, ...) onto. Verified the singleton container is shared:
the second IT class starts in <1s with no new container.

Adds the upgrade note docs/migrations/brace-0.1.6-to-0.1.7.md (the
agent-facing migration-guide system in BRACE-AGENTS.md, shipped in the
dist zip): no breaking changes; projects can drop the manual
flyway-database-postgresql (and postgresql JDBC) deps they added to work
around the packaging gap, now that Brace bundles them transitively.

Verified: mvn verify green — surefire 582 (H2), failsafe 3 across two IT
classes on real Postgres.
The two tests H2 could never truly validate now run on the PG tier:

- MultiInstanceSchedulerPostgresIT (B1): two JobSchedulers on one DB,
  recurring job fires once per interval cluster-wide. On real Postgres
  the slot claim runs over an actual SELECT ... FOR UPDATE row lock that
  serializes the late tick until the early one commits the advanced slot
  — the mechanism two in-memory-H2 "instances" can't exercise.

- DurableJobConcurrencyPostgresIT (B7): 80 durable jobs drained by 4
  concurrent pollers; each must execute exactly once. This is the real
  READ COMMITTED claim race the row-count fix closes — two pollers can
  both commit UPDATE ... WHERE started_at IS NULL (1-row and 0-row)
  without throwing, and the loser must not run the body. H2 single-process
  can't produce the race; real Postgres can.

Both extend PostgresTestBase and share the singleton container. The H2
MultiInstanceSchedulerTest/DurableJobTest stay as fast Docker-free
smoke/functional tests. mvn verify green: 582 H2 + 5 across four IT
classes on real Postgres.
Convert ErrorStore's check-then-insert to a Postgres-native atomic upsert,
closing the duplicate-row/lost-increment race that H2 in-memory can't show.

ErrorStore.record now branches on DatabaseFactory.isPostgres():
- Postgres: single INSERT ... ON CONFLICT (error_type, route)
  WHERE resolved_at IS NULL DO UPDATE ... RETURNING (xmax = 0), via raw JDBC
  (so Hibernate doesn't misclassify the RETURNING; xmax=0 is the new-vs-repeat
  signal the regression listener needs). Backed by a partial unique index so
  resolved errors stay out of the dedupe and a recurrence after resolve still
  gets a fresh row.
- H2: keeps check-then-insert unchanged. Prune + listener logic stays shared.

The index ships as a Postgres-only framework migration (migration_pg/V6),
added to Flyway's locations only on Postgres -- a separate top-level dir, not a
subdir of brace/db/migration, because Flyway scans recursively and the H2 tier
would otherwise pick up DDL it can't parse.

ErrorStorePostgresIT (5 tests) guards the fix: parity with the H2 branch plus
the property H2 can't express -- 50 concurrent writers recording the same
(type, route) fold into exactly one row with an exact count.

mvn test (H2) green; mvn verify (full PG IT tier) green.
…ers (Tier 1b)

Convert every framework instant column from zone-less TIMESTAMP to
TIMESTAMP WITH TIME ZONE, fixing a latent multi-server correctness bug and
deleting the timestamp parsers that existed only to normalize zone-less reads.

Why: zone-less TIMESTAMP stores a bare wall-clock with no anchor, so an instant
round-tripped through a server whose JVM/OS zone isn't UTC can shift. And
because drivers hand zone-less values back in different string shapes, the code
carried hand-rolled multi-format parsers (ErrorStore.parseFirstSeen,
RegressionTracker.parseInstant) just to guess them back into Instants.

- V7__brace_timestamptz.sql: one shared migration ALTERing all instant columns
  (scheduled_jobs, ops_errors, ops_timeseries, ops_profiling_snapshots,
  brace_scheduled_runs) via the SQL-standard SET DATA TYPE form, which parses on
  both H2 and Postgres. The Postgres-only USING ... AT TIME ZONE clause is
  omitted so one file runs on both tiers; correct on the UTC-session
  deployments Brace targets. V7 not V6: migration_pg/V6 already holds version 6
  in the shared flyway_brace_history sequence on Postgres.
- ErrorStore: list()/resolve() put a typed Instant in the result map via a
  toInstant(Object) type-dispatch helper (OffsetDateTime/Timestamp/Instant, and
  throws loudly on anything else instead of silently returning null); the since
  filter compares Instants directly. parseFirstSeen deleted.
- RegressionTracker.seed: casts the map's firstSeen straight to Instant.
  parseInstant + its now-unused imports deleted.
- Json already serializes Instant as ISO-8601, so /ops/errors timestamps are now
  canonical ISO instead of driver-shaped strings.

Also fixes a latent test-tier bug surfaced by this work: PostgresSkipLockedClaimIT
hand-rolled Flyway with only the base location, so when a DatabaseFactory-based
IT applied the pg-only migration_pg/V6 into the shared history first, its
migrate() failed validation ("applied migration not resolved locally: 6").
Fixed by mirroring DatabaseFactory's full location set (base + migration_pg).

mvn test (H2) green; mvn verify (full PG IT tier) green.
Fold the durable-job selection + claim into a single FOR UPDATE SKIP LOCKED
transaction on Postgres, so concurrent instances get disjoint batches and each
job is claimed exactly once by construction — replacing the per-row re-claim
dance on that path.

JobPoller.pollAndExecute now branches on DatabaseFactory.isPostgres():
- Postgres (claimBatchPostgres): one raw-JDBC transaction runs
  UPDATE scheduled_jobs SET started_at = CURRENT_TIMESTAMP, attempts = attempts + 1
  WHERE id IN (SELECT id ... ORDER BY run_at LIMIT 50 FOR UPDATE SKIP LOCKED)
  RETURNING ... . SKIP LOCKED makes concurrent pollers step over each other's
  locked rows, so the rows come back already claimed and runJobBody executes them
  directly. RETURNING attempts - 1 hands back the pre-increment count so the
  shared retry math is unchanged.
- H2 (selectCandidatesH2 + claimAndExecute): unchanged portable path — unlocked
  candidate select, then each row defends its own claim with the B7 row-count
  guard. H2 can't express SKIP LOCKED, hence the dialect branch.

executeJob is split into runJobBody (execute + complete/retry/fail, shared) and
claimAndExecute (H2 per-row claim then runJobBody).

DurableJobConcurrencyPostgresIT now exercises the batch path (80 jobs, 4
concurrent pollers, exactly-once) and its Javadoc updated to describe it; the
B7 per-row guard remains the H2 path's defense.

mvn test (H2, 582) green; mvn verify (full PG IT tier, 10 tests) green.
…on (Tier 1c)

The ops-flush jobs wrote each metric as its own single-row INSERT (up to ~21
per flush, up to 40 for the profiling snapshot), and ops_timeseries /
ops_profiling_snapshots had no retention, so they grew unbounded.

- Batch inserts: the http/cache/jvm flushes build a LinkedHashMap and go through
  a new Brace.insertMetrics helper that emits one multi-row
  INSERT INTO ops_timeseries (ts, metric, val) VALUES (?,?,?), (?,?,?), ...
  (Database.sql renumbers the ? placeholders). The profiling flush builds one
  multi-row INSERT for its up-to-40 rows. ~21 round-trips/flush -> 1; 40 -> 1.
- Retention: new ops-metrics-prune daily job deletes ops_timeseries and
  ops_profiling_snapshots rows older than 14 days, closing the unbounded-growth
  foot-gun. Runs once cluster-wide via the existing B1 coordination.

The date_trunc rollup from the plan is deliberately deferred: ops_timeseries has
no reader today (dashboard sparklines use in-memory Stats.snapshot()), so a
rollup query would be dead code until a historical-metrics consumer exists.

insertMetrics is package-private so OpsMetricsFlushTest can exercise the
dynamic multi-row SQL directly on H2; OpsIntegrationTest now also asserts the
prune job is registered.

mvn test (H2, 585) green; mvn verify (full PG IT tier, 10) green.
…omplete

Record the empirical findings rather than leave them as open "useful, incremental"
items that would be re-litigated:

- 3a (INSERT ... RETURNING id in Jobs.schedule): premise is stale — schedule
  already runs through db.jdbc(), which increments queryCount, so the insert is
  already counted. And H2 2.3.232 rejects INSERT ... RETURNING (verified), so it
  would need a Postgres/H2 dialect branch — more code than today — for no gain.
  The current raw-JDBC getGeneratedKeys is already the portable, counted form.
- 3b (JSONB for queries_before/request_headers/job_data): deferred. The value is
  read-side only and there is no reader; converting write-first is storage churn
  needing Postgres-only DDL plus dialect-branched ::jsonb casts (text->jsonb isn't
  an implicit write cast). Land it with the query feature that consumes it.

The simplification pass is effectively complete: everything that simplifies or
corrects existing code shipped (1a/1b/1c/1d/2a). The remainder is gated on
consumers that don't exist yet — 2b is part of the unbuilt shared-cache backend
(its own task), 3b waits for a JSONB reader.
@megamattron megamattron merged commit dab7056 into main Jun 6, 2026
2 checks passed
@megamattron megamattron deleted the postgres-native-simplifications branch June 6, 2026 18:23
megamattron added a commit that referenced this pull request Jun 14, 2026
…sion.sh)

Fills gap #2 in docs/2026-06-11-runtime-performance-review-todos.md: the TFB
endpoints never configure sessionSecret, so session decrypt, CSRF validation,
and form parsing (H5/M2/M3) were invisible to the existing suite. SessionApp
is a separate no-database app (port 8081) so the standard App.java stays
byte-identical across checkpoints. run-session.sh primes a session cookie +
CSRF token, sanity-checks all three scenarios, then drives wrk: session-read
GET, CSRF form POST, and a csrf(false) API POST with the cookie attached.
BENCH_JAR override enables before/after framework comparisons.
megamattron added a commit that referenced this pull request Jun 14, 2026
Follow-ups from the merge-gate code review of the Low batch:

#1 Verify ?v= against the current fingerprint before promising immutable.
   serveStaticFile trusted any ?v= param's presence and emitted 1-year
   immutable; a stale or hand-rolled ?v= (or one a CDN/client appended) could
   pin wrong/old bytes for a year. Now Assets.currentVersion(path) returns the
   current content hash (shared (path,mtime) cache) and only an exact match
   earns immutable; everything else is revalidate-always.

#2 Bundled htmx.min.js now carries an ETag + revalidate Cache-Control and
   honors conditional GETs (304), so browsers skip the ~50KB re-download each
   page. Not immutable — a brace upgrade can change the bytes at that fixed URL.

#3 serveStaticFile uses one Files.readAttributes (size+mtime+isRegularFile)
   instead of four separate File stat syscalls (exists/isFile/length/lastModified).

#4 Removed the now-dead null-invoker fallback in the request path and the
   unused null-producing Route(method,pattern,handler) constructor, so 'every
   Route has a non-null invoker' (L1) is enforced by construction.

#8 isNotModified: dropped the unused trimmed-copy var (split ran on the original).

StaticFilesTest +2 (stale ?v= -> revalidate, htmx revalidates); suite 846/846.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant