Parallel Temp Tables - PG18#19
Open
danolivo wants to merge 10 commits into
Open
Conversation
478fc25 to
c9244be
Compare
Builds PostgreSQL with cassert, debug, TAP tests, OpenSSL, ICU, LZ4, and zstd, then runs the full make check-world suite. On failure, collects regression diffs, logs, and TAP output as downloadable artifacts and inlines diffs in the job summary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
It is a preliminary commit that tracks the state of temp buffers. The main goal of this statistic is to provide the optimiser with numbers to compute the cost estimation of flushing temporary buffers. Such a flush may be necessary if the optimiser decides to build a plan that includes a parallel section of the query, which involves scanning a temporary table.
Change the concept of parallel safety slightly: a query subtree may be parallel-safe if it includes a temporary table scan, but each buffer of this temporary table is flushed to disk. In this case, minor changes within the planner and executor may allow scanning the temporary table in parallel. By this commit, the optimiser uses the 'parallel_safe' flag to indicate that the subtree refers to a source with temporary storage. Path's parallel_safe field may be used in cost-based optimisation, Plan's parallel_safe field indicates if a Gather or GatherMerge node should flush all temporary buffers before launching any parallel worker. We don't make this flag very selective. If different paths of the same RelOptInfo have various targets, we indicate that each path requires buffer flushing, even if only one of them actually needs it.
This commit adds a flag to Gather and GatherMerge that indicates whether the subtree contains temporary tables. Additionally, to prevent multiple flush attempts, EState has a flag that indicates whether temporary buffers have already been written to disk. Employing these two flags, Gather flushes temporary buffers before launching any parallel worker. Add some checks to detect accidential scanning of a temp table with not yet flushed buffers.
Consider the extra cost of flushing temporary tables in partial path comparisons. With this commit, the optimiser gains a rationale for cost-based decision on enabling the parallel scan of subtrees that include temporary tables. It is achieved by adding to the path comparison routine an extra 'flush buffers' weighting factor. It is trivial to calculate the cost by tracking the number of dirtied temporary buffers and multiplying it by the write_page_cost parameter. The functions compare_path_costs and compare_fractional_path_costs were modified to account for this additional factor.
The temp buffer flush penalty was applied during compare_path_costs() to paths marked NEEDS_TEMP_FLUSH. However, since Gather paths are always PARALLEL_UNSAFE, this penalty never affected the critical Gather-vs-serial comparison in pathlist -- the exact decision point where parallel vs serial plans compete. Move the flush cost accounting into cost_gather() and cost_gather_merge(), where it belongs: the flush is a one-time startup cost performed by the leader before launching workers. This makes the cost visible in the stored path cost, so add_path() correctly weighs it when comparing Gather-wrapped partial paths against serial alternatives. Remove the comparison-time adjustment from compare_path_costs(), which also eliminates false-positive penalties on paths that share a RelOptInfo with temp-touching paths but don't themselves require a flush.
The hasTempObject field in max_parallel_hazard_context was not initialized in is_parallel_safe() or max_parallel_hazard(). When the walker encountered a SubPlan it read the garbage value, which could falsely set NEEDS_TEMP_FLUSH on paths unrelated to temp tables, penalizing Gather Merge with flush cost and shifting plans to Gather + Sort even for non-temp queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With parallel scanning of temporary tables, the leader and each worker
keep their own private local-buffer pool over the same on-disk file. The
pre-launch flush makes that file consistent only at the moment workers
start; nothing keeps it consistent during the scan.
A "read-only" scan is not write-free. Setting hint bits
(MarkBufferDirtyHint -> MarkLocalBufferDirty) and opportunistic pruning
(heap_page_prune_opt) both dirty the page. When such a dirty local buffer
is evicted it is written back via FlushLocalBuffer -> smgrwrite, and that
write races with the other participants reading the same file through
their own pools. The result is torn pages and "invalid page in block N of
relation base/..." errors.
Suppress these read-time, purely optional modifications on temporary
(local-buffer) relations while in a parallel section:
- MarkBufferDirtyHint(): for a local buffer, still set the hint on the
page image, but do not mark the buffer dirty, so the change is
discarded on eviction rather than written. This also covers index
LP_DEAD "kill" hints, which take the same path.
- heap_page_prune_opt(): skip opportunistic pruning for local-buffer
relations.
Gate both on IsInParallelMode(), true for the leader and the workers
alike, so the leader's own scan participation is covered too. Both
operations are optimizations only, so correctness is unaffected; the hints
and pruning are re-established by ordinary, non-parallel access later.
A parallel worker must never dirty or write back a local
(temporary-relation) buffer: it shares the on-disk file with the leader
and the other workers, so any such write corrupts the file. The previous
commit already prevents workers from dirtying buffers by suppressing the
read-time hint and pruning paths; this adds defense in depth against any
path missed now or introduced later.
Guard the two chokepoints, gated on IsParallelWorker():
- MarkLocalBufferDirty(): the point where a local buffer becomes dirty.
- FlushLocalBuffer(): the point where one is written back. The check is
placed before StartLocalBufferIO(), so an early return lets the caller
(GetLocalVictimBuffer) invalidate the still-dirty buffer and discard
the page rather than write it.
In assert-enabled builds these fail hard. In production they log once per
backend (to avoid flooding the log) and leave the buffer clean; refusing
the write is the safe action, since a worker has no legitimate temporary
write to lose.
The gate is IsParallelWorker(), not IsInParallelMode(), because the leader
legitimately writes temporary relations while in a parallel section -- for
example CREATE TABLE AS / SELECT INTO a temp table, or the transient temp
heap that REFRESH MATERIALIZED VIEW CONCURRENTLY fills -- using the
transaction's existing XID while the data-source query runs in parallel.
Those destinations are private to the leader and are never scanned by the
workers, so dirtying them is safe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.