Skip to content

Parallel Temp Tables - PG18#19

Open
danolivo wants to merge 10 commits into
REL_18_STABLEfrom
ptt-rel-18
Open

Parallel Temp Tables - PG18#19
danolivo wants to merge 10 commits into
REL_18_STABLEfrom
ptt-rel-18

Conversation

@danolivo
Copy link
Copy Markdown
Owner

@danolivo danolivo commented Mar 9, 2026

No description provided.

@danolivo danolivo self-assigned this Mar 9, 2026
@danolivo danolivo added the enhancement New feature or request label Mar 9, 2026
@danolivo danolivo changed the title Parallel Temp Tables Parallel Temp Tables - PG18 Mar 11, 2026
@danolivo danolivo force-pushed the ptt-rel-18 branch 2 times, most recently from 478fc25 to c9244be Compare March 11, 2026 10:36
danolivo and others added 8 commits April 28, 2026 13:02
Builds PostgreSQL with cassert, debug, TAP tests, OpenSSL, ICU,
LZ4, and zstd, then runs the full make check-world suite.
On failure, collects regression diffs, logs, and TAP output as
downloadable artifacts and inlines diffs in the job summary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
It is a preliminary commit that tracks the state of temp buffers. The main goal
of this statistic is to provide the optimiser with numbers to compute
the cost estimation of flushing temporary buffers.

Such a flush may be necessary if the optimiser decides to build a plan that
includes a parallel section of the query, which involves scanning
a temporary table.
Change the concept of parallel safety slightly: a query subtree may be
parallel-safe if it includes a temporary table scan, but each buffer of this
temporary table is flushed to disk. In this case, minor changes within
the planner and executor may allow scanning the temporary table in parallel.

By this commit, the optimiser uses the 'parallel_safe' flag to indicate that
the subtree refers to a source with temporary storage.

Path's parallel_safe field may be used in cost-based optimisation, Plan's
parallel_safe field indicates if a Gather or GatherMerge node should flush all
temporary buffers before launching any parallel worker.

We don't make this flag very selective. If different paths of the same
RelOptInfo have various targets, we indicate that each path requires buffer
flushing, even if only one of them actually needs it.
This commit adds a flag to Gather and GatherMerge that indicates whether
the subtree contains temporary tables. Additionally, to prevent multiple flush
attempts, EState has a flag that indicates whether temporary buffers have
already been written to disk.

Employing these two flags, Gather flushes temporary buffers before launching
any parallel worker.

Add some checks to detect accidential scanning of a temp table with not yet
flushed buffers.
Consider the extra cost of flushing temporary tables in partial path
comparisons. With this commit, the optimiser gains a rationale for cost-based
decision on enabling the parallel scan of subtrees that include temporary
tables. It is achieved by adding to the path comparison routine an extra
'flush buffers' weighting factor.

It is trivial to calculate the cost by tracking the number of dirtied temporary
buffers and multiplying it by the write_page_cost parameter.
The functions compare_path_costs and compare_fractional_path_costs
were modified to account for this additional factor.
The temp buffer flush penalty was applied during compare_path_costs() to
paths marked NEEDS_TEMP_FLUSH. However, since Gather paths are always
PARALLEL_UNSAFE, this penalty never affected the critical Gather-vs-serial
comparison in pathlist -- the exact decision point where parallel vs serial
plans compete.

Move the flush cost accounting into cost_gather() and cost_gather_merge(),
where it belongs: the flush is a one-time startup cost performed by the
leader before launching workers. This makes the cost visible in the stored
path cost, so add_path() correctly weighs it when comparing Gather-wrapped
partial paths against serial alternatives.

Remove the comparison-time adjustment from compare_path_costs(), which also
eliminates false-positive penalties on paths that share a RelOptInfo with
temp-touching paths but don't themselves require a flush.
The hasTempObject field in max_parallel_hazard_context was not
initialized in is_parallel_safe() or max_parallel_hazard(). When
the walker encountered a SubPlan it read the garbage value, which
could falsely set NEEDS_TEMP_FLUSH on paths unrelated to temp
tables, penalizing Gather Merge with flush cost and shifting plans
to Gather + Sort even for non-temp queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
danolivo added 2 commits June 4, 2026 22:58
With parallel scanning of temporary tables, the leader and each worker
keep their own private local-buffer pool over the same on-disk file. The
pre-launch flush makes that file consistent only at the moment workers
start; nothing keeps it consistent during the scan.

A "read-only" scan is not write-free. Setting hint bits
(MarkBufferDirtyHint -> MarkLocalBufferDirty) and opportunistic pruning
(heap_page_prune_opt) both dirty the page. When such a dirty local buffer
is evicted it is written back via FlushLocalBuffer -> smgrwrite, and that
write races with the other participants reading the same file through
their own pools. The result is torn pages and "invalid page in block N of
relation base/..." errors.

Suppress these read-time, purely optional modifications on temporary
(local-buffer) relations while in a parallel section:

  - MarkBufferDirtyHint(): for a local buffer, still set the hint on the
    page image, but do not mark the buffer dirty, so the change is
    discarded on eviction rather than written. This also covers index
    LP_DEAD "kill" hints, which take the same path.
  - heap_page_prune_opt(): skip opportunistic pruning for local-buffer
    relations.

Gate both on IsInParallelMode(), true for the leader and the workers
alike, so the leader's own scan participation is covered too. Both
operations are optimizations only, so correctness is unaffected; the hints
and pruning are re-established by ordinary, non-parallel access later.
A parallel worker must never dirty or write back a local
(temporary-relation) buffer: it shares the on-disk file with the leader
and the other workers, so any such write corrupts the file. The previous
commit already prevents workers from dirtying buffers by suppressing the
read-time hint and pruning paths; this adds defense in depth against any
path missed now or introduced later.

Guard the two chokepoints, gated on IsParallelWorker():

  - MarkLocalBufferDirty(): the point where a local buffer becomes dirty.
  - FlushLocalBuffer(): the point where one is written back. The check is
    placed before StartLocalBufferIO(), so an early return lets the caller
    (GetLocalVictimBuffer) invalidate the still-dirty buffer and discard
    the page rather than write it.

In assert-enabled builds these fail hard. In production they log once per
backend (to avoid flooding the log) and leave the buffer clean; refusing
the write is the safe action, since a worker has no legitimate temporary
write to lose.

The gate is IsParallelWorker(), not IsInParallelMode(), because the leader
legitimately writes temporary relations while in a parallel section -- for
example CREATE TABLE AS / SELECT INTO a temp table, or the transient temp
heap that REFRESH MATERIALIZED VIEW CONCURRENTLY fills -- using the
transaction's existing XID while the data-source query runs in parallel.
Those destinations are private to the leader and are never scanned by the
workers, so dirtying them is safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant