Parallel Temp Tables - PG18 by danolivo · Pull Request #19 · danolivo/pgdev

danolivo · 2026-03-09T15:16:55Z

No description provided.

Builds PostgreSQL with cassert, debug, TAP tests, OpenSSL, ICU, LZ4, and zstd, then runs the full make check-world suite. On failure, collects regression diffs, logs, and TAP output as downloadable artifacts and inlines diffs in the job summary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

It is a preliminary commit that tracks the state of temp buffers. The main goal of this statistic is to provide the optimiser with numbers to compute the cost estimation of flushing temporary buffers. Such a flush may be necessary if the optimiser decides to build a plan that includes a parallel section of the query, which involves scanning a temporary table.

Change the concept of parallel safety slightly: a query subtree may be parallel-safe if it includes a temporary table scan, but each buffer of this temporary table is flushed to disk. In this case, minor changes within the planner and executor may allow scanning the temporary table in parallel. By this commit, the optimiser uses the 'parallel_safe' flag to indicate that the subtree refers to a source with temporary storage. Path's parallel_safe field may be used in cost-based optimisation, Plan's parallel_safe field indicates if a Gather or GatherMerge node should flush all temporary buffers before launching any parallel worker. We don't make this flag very selective. If different paths of the same RelOptInfo have various targets, we indicate that each path requires buffer flushing, even if only one of them actually needs it.

This commit adds a flag to Gather and GatherMerge that indicates whether the subtree contains temporary tables. Additionally, to prevent multiple flush attempts, EState has a flag that indicates whether temporary buffers have already been written to disk. Employing these two flags, Gather flushes temporary buffers before launching any parallel worker. Add some checks to detect accidential scanning of a temp table with not yet flushed buffers.

Consider the extra cost of flushing temporary tables in partial path comparisons. With this commit, the optimiser gains a rationale for cost-based decision on enabling the parallel scan of subtrees that include temporary tables. It is achieved by adding to the path comparison routine an extra 'flush buffers' weighting factor. It is trivial to calculate the cost by tracking the number of dirtied temporary buffers and multiplying it by the write_page_cost parameter. The functions compare_path_costs and compare_fractional_path_costs were modified to account for this additional factor.

The temp buffer flush penalty was applied during compare_path_costs() to paths marked NEEDS_TEMP_FLUSH. However, since Gather paths are always PARALLEL_UNSAFE, this penalty never affected the critical Gather-vs-serial comparison in pathlist -- the exact decision point where parallel vs serial plans compete. Move the flush cost accounting into cost_gather() and cost_gather_merge(), where it belongs: the flush is a one-time startup cost performed by the leader before launching workers. This makes the cost visible in the stored path cost, so add_path() correctly weighs it when comparing Gather-wrapped partial paths against serial alternatives. Remove the comparison-time adjustment from compare_path_costs(), which also eliminates false-positive penalties on paths that share a RelOptInfo with temp-touching paths but don't themselves require a flush.

The hasTempObject field in max_parallel_hazard_context was not initialized in is_parallel_safe() or max_parallel_hazard(). When the walker encountered a SubPlan it read the garbage value, which could falsely set NEEDS_TEMP_FLUSH on paths unrelated to temp tables, penalizing Gather Merge with flush cost and shifting plans to Gather + Sort even for non-temp queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

With parallel scanning of temporary tables, the leader and each worker keep their own private local-buffer pool over the same on-disk file. The pre-launch flush makes that file consistent only at the moment workers start; nothing keeps it consistent during the scan. A "read-only" scan is not write-free. Setting hint bits (MarkBufferDirtyHint -> MarkLocalBufferDirty) and opportunistic pruning (heap_page_prune_opt) both dirty the page. When such a dirty local buffer is evicted it is written back via FlushLocalBuffer -> smgrwrite, and that write races with the other participants reading the same file through their own pools. The result is torn pages and "invalid page in block N of relation base/..." errors. Suppress these read-time, purely optional modifications on temporary (local-buffer) relations while in a parallel section: - MarkBufferDirtyHint(): for a local buffer, still set the hint on the page image, but do not mark the buffer dirty, so the change is discarded on eviction rather than written. This also covers index LP_DEAD "kill" hints, which take the same path. - heap_page_prune_opt(): skip opportunistic pruning for local-buffer relations. Gate both on IsInParallelMode(), true for the leader and the workers alike, so the leader's own scan participation is covered too. Both operations are optimizations only, so correctness is unaffected; the hints and pruning are re-established by ordinary, non-parallel access later.

A parallel worker must never dirty or write back a local (temporary-relation) buffer: it shares the on-disk file with the leader and the other workers, so any such write corrupts the file. The previous commit already prevents workers from dirtying buffers by suppressing the read-time hint and pruning paths; this adds defense in depth against any path missed now or introduced later. Guard the two chokepoints, gated on IsParallelWorker(): - MarkLocalBufferDirty(): the point where a local buffer becomes dirty. - FlushLocalBuffer(): the point where one is written back. The check is placed before StartLocalBufferIO(), so an early return lets the caller (GetLocalVictimBuffer) invalidate the still-dirty buffer and discard the page rather than write it. In assert-enabled builds these fail hard. In production they log once per backend (to avoid flooding the log) and leave the buffer clean; refusing the write is the safe action, since a worker has no legitimate temporary write to lose. The gate is IsParallelWorker(), not IsInParallelMode(), because the leader legitimately writes temporary relations while in a parallel section -- for example CREATE TABLE AS / SELECT INTO a temp table, or the transient temp heap that REFRESH MATERIALIZED VIEW CONCURRENTLY fills -- using the transaction's existing XID while the data-source query runs in parallel. Those destinations are private to the leader and are never scanned by the workers, so dirtying them is safe.

danolivo self-assigned this Mar 9, 2026

danolivo added the enhancement New feature or request label Mar 9, 2026

danolivo force-pushed the ptt-rel-18 branch from efa0723 to dbb594a Compare March 11, 2026 09:37

danolivo changed the title ~~Parallel Temp Tables~~ Parallel Temp Tables - PG18 Mar 11, 2026

danolivo force-pushed the ptt-rel-18 branch 2 times, most recently from 478fc25 to c9244be Compare March 11, 2026 10:36

danolivo force-pushed the REL_18_STABLE branch from eaefeea to 0852643 Compare March 23, 2026 10:52

danolivo and others added 8 commits April 28, 2026 13:02

Arrange code after back-porting from the master to REL_18_STABLE

bae4233

danolivo force-pushed the ptt-rel-18 branch from 0b4ca52 to 6f48de5 Compare April 28, 2026 11:03

danolivo added 2 commits June 4, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Temp Tables - PG18#19

Parallel Temp Tables - PG18#19
danolivo wants to merge 10 commits into
REL_18_STABLEfrom
ptt-rel-18

danolivo commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danolivo commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant