Skip to content

Disconnect stale automation peers from UIA to fix memory leak in virtualized ItemsControls#11657

Open
amarinov-msft wants to merge 3 commits into
dotnet:mainfrom
amarinov-msft:amarinov-fix/automation-peer-disconnect-stale
Open

Disconnect stale automation peers from UIA to fix memory leak in virtualized ItemsControls#11657
amarinov-msft wants to merge 3 commits into
dotnet:mainfrom
amarinov-msft:amarinov-fix/automation-peer-disconnect-stale

Conversation

@amarinov-msft

@amarinov-msft amarinov-msft commented May 21, 2026

Copy link
Copy Markdown

Fixes #11337

Description

When a UIA client (screen reader, test-automation, inspector) repeatedly enumerates a virtualized ItemsControl (e.g. a large DataGrid), WPF accumulates two kinds of never-released objects, producing an unbounded leak:

  1. AutomationPeersElementProxy held its peer with a strong reference, and the UIA CCW held the proxy, so every peer surfaced to a client (and the data row / visual sub-tree it roots) was pinned forever.
  2. ElementProxy CCWs — even once a peer is no longer referenced, the proxy itself stays rooted by its UIA COM-callable wrapper (CCW), dragging its WeakReference, runtime-id and name strings with it.

This PR addresses both:

  • Weakly-held peers. ElementProxy now holds UIElement/ContentElement/ UIElement3D peers - and newly data-item peers - via WeakReference, so UIA references no longer pin recycled/virtualized peers. Data-item behavior is gated by the Switch.System.Windows.Automation.Peers.UseStrongReferenceForItemAutomationPeers AppContext switch (default false = weak; set true to restore legacy strong refs).
  • Keep-alive window. Because a weakly-held data-item peer could otherwise be collected in the middle of a live UIA traversal (surfacing ElementNotAvailableException), each peer surfaced to or touched by UIA is strong-rooted in a small rotating bucket for a bounded time window (one-to-two windows ≈ 9–18 s) via a Background-priority DispatcherTimer, then released. Retention is bounded by the in-flight working set, not the data-set size.
  • Proxy/CCW disconnect. A second Background-priority DispatcherTimer sweep (~2 s cadence, capped at 5000 disconnects/tick) calls the new UiaDisconnectProvider on ElementProxy instances whose weak peer has already been collected, releasing the CCW so the proxy and everything it roots can be reclaimed. Gated by Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect (default false = sweep enabled). Both timers self-stop when there is nothing to track.

A new public API, AutomationInteropProvider.DisconnectProvider(IRawElementProviderSimple), is added (with matching reference-assembly entry) to expose UiaDisconnectProvider to the bridge.

UIA clients may observe ElementNotAvailableException when accessing an element whose peer has been collected/disconnected — standard UIA contract behavior that well-behaved clients already handle.

Customer Impact

WPF applications that expose a virtualized ItemsControl (DataGrid, ListView, etc.) to UI Automation experience an unbounded memory leak (≈260 MB/min in the customer's 200k-row DataGrid repro), ending in OutOfMemoryException and crashes in long-running sessions. Any app with accessibility active (assistive technology or test automation present) is affected. The customer is currently held on a workaround.

Regression

No.

Testing

Controlled customer repro (200k-row DataGrid + a UIA client driving FindAll ~1×/s), memory sampled after forced Gen2 GC:

  • **Baseline (no fix): ** ≈ +260 MB/min; ElementProxy and data-item peers grow without bound.
  • **Peer fix only (weak refs + keep-alive): ** AutomationPeers bounded (e.g. DataGridCellItemAutomationPeer flat/declining), but ElementProxy CCWs still grew (~3.7× over 15 min).
  • **Full fix (peer fix + proxy disconnect sweep): ** ElementProxy bounded over a 12-min run (2.5k ->3.6k, not growing); process working set flat (≈614–892 MB); no OOM.
  • Kill-switch paths verified: setting either AppContext switch restores the prior behavior.

Risk

Moderate, well-contained, and fully reversible via two AppContext kill-switches.
The risk areas are:

  1. Weak data-item peers (behavioral default change). Data-item peers are now held weakly by default. Any code that implicitly relied on a peer staying alive purely because UIA had seen it could see it collected earlier. Mitigated by the keep-alive window and fully reversible via UseStrongReferenceForItemAutomationPeers=true.
  2. Keep-alive retention window. Peers are deliberately retained for ≈9–18 s after their last UIA touch. This is the main new behavioral trade-off: a small, bounded increase in peak retention in exchange for safety against mid-walk collection. The 9 s window is sized around 2x the measured worst-case gap (around 4.4 s under forced-Gen2 GC) between an element's surface-time touch and its read-back touch; pathological client timing slower than the window could still surface ElementNotAvailableException.
  3. New public API surface. AutomationInteropProvider.DisconnectProvider is additive (plus a ref-assembly entry) and needs API-review sign-off; it is a thin wrapper over the documented UiaDisconnectProvider.
  4. Background disconnect sweep. Runs on the UI Dispatcher at Background priority and only disconnects proxies whose peer is already collected (which already throw ENA), so it cannot disturb a walk of a live element. Work is capped per tick (5000). Failed-disconnect handling. AutomationInteropProvider.DisconnectProvider can occasionally fail transiently (e.g. re-entrant input-sync, or a proxy whose runtime id was never cached because GetRuntimeId was never called on it). ElementProxy.Disconnect() returns a success/failure result, and the sweep re-queues a failed proxy for a later tick - its CCW is a GC root, so silently dropping it would leak the proxy. A per-proxy attempt counter ( failedAttempts) caps this at MaxDisconnectAttempts (8): genuine transients clear within 1–2 ticks and never approach the cap, while a proxy that fails every time is abandoned after 8 tries (leaking only the lightweight shell) so the timer can still self-stop instead of spinning forever. Reversible via DisableUiaProviderDisconnect=true.
  5. ENA timing. Clients may now receive ElementNotAvailableException during steady state (not only on tree changes). Conformant clients (Narrator, NVDA, JAWS) handle this; non-conformant tools that don't catch it could surface new errors.
  6. A client that retrieves an element for an unrealized item via FindItemByProperty and then calls VirtualizedItemPattern.Realize after the keep-alive window (~10–18 s) will get  ElementNotAvailableException, since the peer was collected and its proxy disconnected. This is within the UIA ENA contract but changes the pattern's previously indefinite realizability. Kill switch DisableUiaProviderDisconnect restores the old behavior.
  7. WPF runtime ids are {7, ProcessId, GetHashCode()}, so identity-hash collisions are possible. Because UiaDisconnectProvider matches by runtime id, a dead proxy's cached id that collides with a live peer's could invalidate a client's view of a healthy element. Pre-existing weakness; low probability, but the disconnect sweep is the first path to exercise it at scale.
Microsoft Reviewers: Open in CodeFlow

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a memory leak in WPF UI Automation (UIA) for virtualized ItemsControl scenarios by explicitly disconnecting UIA providers for removed automation peers, allowing COM references held by UIA Core to be released and enabling GC of stale peers and their visual subtrees.

Changes:

  • Added a helper to disconnect an AutomationPeer’s ElementProxy from UIA via UiaDisconnectProvider.
  • Updated UpdateChildrenInternal to disconnect removed children even when no StructureChanged listeners are registered.
  • Added the required interop import to call into UIAutomationCore.dll.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1838 to +1843
if (proxyWeakRef?.Target is ElementProxy proxy)
{
UiaDisconnectProvider(proxy);
}

peer._elementProxyWeakReference = null;

@h3xds1nz h3xds1nz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amarinov-msft I'm really worried about perf with these changes. UIA implementation itself is already a really heavy mess (things like #10676 and more), and this will further increase the weight (an async DispatcherOperation for every disconnected peer), further enumeration over the intermediate collections.

Now, I don't have an exact solution here since in our wpf fork I've reworked this extensively but I think it's worth a further investigation and profiling to fix this.

if (dispatcher != null && !dispatcher.HasShutdownStarted)
{
dispatcher.BeginInvoke(DispatcherPriority.Background,
new Action(() => UiaDisconnectProvider(proxy)));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a closure unnecessarily and a new action everytime. Ideally use the parameter variant, it's an object you're passing so no boxing here.

// UiaDisconnectProvider must NOT be called during a UIA callback (e.g. during Navigate/FindAll handling).
// UpdateChildrenInternal is invoked from within UIA callbacks, so defer the actual COM disconnect to a separate dispatcher operation.
Dispatcher dispatcher = peer.Dispatcher;
if (dispatcher != null && !dispatcher.HasShutdownStarted)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is even required/true; the UIA callbacks should already run on the Dispatcher synchronously iirc, so that check seems redundant. If invoked from proxy directly, this would already throw if it was the case.

/// </remarks>
private static void DisconnectPeerFromUia(AutomationPeer peer)
{
if (peer == null)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is redundant since you're already checking in the calling function.

}

[DllImport("UIAutomationCore.dll", EntryPoint = "UiaDisconnectProvider", CharSet = CharSet.Unicode)]
private static extern int UiaDisconnectProvider(IRawElementProviderSimple provider);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be defined in this file.

// This must happen regardless of StructureChanged event registration.
if (removedCount > 0)
{
foreach (AutomationPeer removedChild in hs)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a sigh for another PR but this whole thing should be turned into a simple merge walk to differentiate old and new and avoid the HashSet allocations altogether. Sometimes this is tens of MBs for something that could have been a simple loop.

The least that can be done here is to repeat this loop over the HashSet only once; by either doing all the StructureChanged stuff or not.

if (old._elementProxyWeakReference == null)
continue;

if (_children == null || !_children.Contains(old))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically O(n*m), if this is a top automation peer, it's gonna be pretty heavy.

amarinov-msft added a commit to amarinov-msft/wpf that referenced this pull request Jun 18, 2026
- Cache static DispatcherOperationCallback; drop closure/Action allocation per call.
- Remove redundant Dispatcher null/HasShutdownStarted guard (matches InvalidatePeer pattern).
- Remove redundant peer null check inside DisconnectPeerFromUia (callers guarantee non-null).
- Move UiaDisconnectProvider P/Invoke onto ElementProxy as a single Disconnect() method; collapse ClearPeer() into it. Use DllImport.UIAutomationCore string constant per codebase convention.
- Build HashSet of new children lazily in EnsureChildren to avoid O(n*m) Contains scan; common path stays allocation-free.
- Drop redundant disconnect loop from UpdateChildrenInternal (EnsureChildren already handles it on this path); revert the merged loop back to events-only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft

Copy link
Copy Markdown
Author

@h3xds1nz thank you for your review. I tried addressing your comments. While the fundamental question about this issue is resolved (performance vs behavior), please check the updates when you have the opportunity and let me know if they are in the direction that you had in mind.

@h3xds1nz h3xds1nz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amarinov-msft Just a few nits, I guess this is acceptable at the current codebase state to resolve the issue.

peer._childrenValid = false;
}

private static readonly DispatcherOperationCallback _disconnectProxyCallback = static arg =>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use SendOrPostCallback which doesn't return but is special-cased.

if (callback is SendOrPostCallback sendOrPostCallback)

_peer = null;
}

[DllImport(DllImport.UIAutomationCore, EntryPoint = "UiaDisconnectProvider", CharSet = CharSet.Unicode)]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I've mentioned it before, I meant that the common practice is to put it in specialized file; in this case that would be https://github.com/dotnet/wpf/blob/main/src/Microsoft.DotNet.Wpf/src/UIAutomation/UIAutomationProvider/MS/Internal/Automation/UiaCoreProviderApi.cs ... the provider library would have been loaded at this point anyway.

@amarinov-msft

Copy link
Copy Markdown
Author

@h3xds1nz thank you for your review and effort! However, while testing this - I found a regression introduced by my changes where visible peers are being treated as stale and disconnected. The disconnect-based approach in the PR seems to be fundamentally incompatible with how UIA traverses peers. We probably need to change the approach completely.

@h3xds1nz

Copy link
Copy Markdown
Member

@amarinov-msft No worries; I unfortunately cannot test this as I'm unable to replicate the memory leak with the repro whatsoever. Plus WPFAutomationClient report seems invalid, given that there's literally no GC pressure, just forcing a GC manually clears out everything (not that this fix would be an attempt to solve that).

@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from c0b8c49 to cff011c Compare June 24, 2026 08:42
…item peers

ElementProxy now holds a weak reference to data-item automation peers (those
for which AutomationPeer.IsDataItemAutomationPeer() returns true: ItemAutomationPeer
and its subclasses, DataGridCellItemAutomationPeer, and DateTimeAutomationPeer),
matching the weak-reference treatment already applied to UIElement/ContentElement/
UIElement3D peers. This lets UI Automation client references release virtualized
item peers - and the controls they transitively root - so they can be collected,
fixing unbounded provider-side growth (customer OOM on a ~200k-row DataGrid under
continuous UIA querying).

To avoid an ElementNotAvailableException if a peer is collected mid-walk or during
property readback, a short-lived strong "keep-alive" (PeerKeepAlive) roots every
peer touched at StaticWrap/Peer access for a bounded window using a two-bucket
rotation. The 9s window was sized from a measured worst-case FindAll+readback
(~4.4s under GC stress, validated 16/16 trials clean across GC modes).

A new opt-out AppContext switch,
Switch.System.Windows.Automation.Peers.UseStrongReferenceForItemAutomationPeers,
restores the legacy strong-reference behavior.

Supersedes the earlier disconnect-based approach, which had an unrecoverable
mid-walk regression.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from cff011c to 6604bff Compare June 24, 2026 09:00
amarinov-msft added a commit to amarinov-msft/wpf that referenced this pull request Jun 29, 2026
PR dotnet#11657 bounded automation peers but ElementProxy bridge objects whose
weak peer had been collected stayed rooted by their UI Automation wrapper
(CCW), so memory still grew under a sustained UIA client.

Add a background sweep that calls UiaDisconnectProvider on ElementProxy
instances whose weak peer is gone (which already throw ENA, so the sweep
cannot disturb a walk of a live element). The sweep runs on a Background
DispatcherTimer, is bounded per tick, and self-stops when idle. Weak
proxies cache their runtime id so disconnect can still serve it.

Gated by the Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect
AppContext kill-switch (default off = sweep enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from 510e300 to 95d4f3b Compare June 29, 2026 16:53
amarinov-msft added a commit to amarinov-msft/wpf that referenced this pull request Jun 29, 2026
PR dotnet#11657 bounded automation peers but ElementProxy bridge objects whose
weak peer had been collected stayed rooted by their UI Automation wrapper
(CCW), so memory still grew under a sustained UIA client.

Add a background sweep that calls UiaDisconnectProvider on ElementProxy
instances whose weak peer is gone (which already throw ENA, so the sweep
cannot disturb a walk of a live element). The sweep runs on a Background
DispatcherTimer, is bounded per tick, and self-stops when idle. Weak
proxies cache their runtime id so disconnect can still serve it.

Gated by the Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect
AppContext kill-switch (default off = sweep enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from 95d4f3b to 25feec4 Compare June 29, 2026 17:03
amarinov-msft added a commit to amarinov-msft/wpf that referenced this pull request Jun 29, 2026
PR dotnet#11657 bounded automation peers but ElementProxy bridge objects whose
weak peer had been collected stayed rooted by their UI Automation wrapper
(CCW), so memory still grew under a sustained UIA client.

Add a background sweep that calls UiaDisconnectProvider on ElementProxy
instances whose weak peer is gone (which already throw ENA, so the sweep
cannot disturb a walk of a live element). The sweep runs on a Background
DispatcherTimer, is bounded per tick, and self-stops when idle. Weak
proxies cache their runtime id so disconnect can still serve it.

Gated by the Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect
AppContext kill-switch (default off = sweep enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from 25feec4 to 59b1f02 Compare June 29, 2026 18:55
amarinov-msft added a commit to amarinov-msft/wpf that referenced this pull request Jun 30, 2026
PR dotnet#11657 bounded automation peers but ElementProxy bridge objects whose
weak peer had been collected stayed rooted by their UI Automation wrapper
(CCW), so memory still grew under a sustained UIA client.

Add a background sweep that calls UiaDisconnectProvider on ElementProxy
instances whose weak peer is gone (which already throw ENA, so the sweep
cannot disturb a walk of a live element). The sweep runs on a Background
DispatcherTimer, is bounded per tick, and self-stops when idle. Weak
proxies cache their runtime id so disconnect can still serve it.

Gated by the Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect
AppContext kill-switch (default off = sweep enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from 59b1f02 to 1d053aa Compare June 30, 2026 05:19
amarinov-msft added a commit to amarinov-msft/wpf that referenced this pull request Jun 30, 2026
PR dotnet#11657 bounded automation peers but ElementProxy bridge objects whose
weak peer had been collected stayed rooted by their UI Automation wrapper
(CCW), so memory still grew under a sustained UIA client.

Add a background sweep that calls UiaDisconnectProvider on ElementProxy
instances whose weak peer is gone (which already throw ENA, so the sweep
cannot disturb a walk of a live element). The sweep runs on a Background
DispatcherTimer, is bounded per tick, and self-stops when idle. Weak
proxies cache their runtime id so disconnect can still serve it.

Gated by the Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect
AppContext kill-switch (default off = sweep enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from 1d053aa to 0fbde13 Compare June 30, 2026 08:45
amarinov-msft added a commit to amarinov-msft/wpf that referenced this pull request Jul 1, 2026
PR dotnet#11657 bounded automation peers but ElementProxy bridge objects whose
weak peer had been collected stayed rooted by their UI Automation wrapper
(CCW), so memory still grew under a sustained UIA client.

Add a background sweep that calls UiaDisconnectProvider on ElementProxy
instances whose weak peer is gone (which already throw ENA, so the sweep
cannot disturb a walk of a live element). The sweep runs on a Background
DispatcherTimer, is bounded per tick, and self-stops when idle. Weak
proxies cache their runtime id so disconnect can still serve it.

Gated by the Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect
AppContext kill-switch (default off = sweep enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from 0fbde13 to 3ae37fc Compare July 1, 2026 05:52
PR dotnet#11657 bounded automation peers but ElementProxy bridge objects whose
weak peer had been collected stayed rooted by their UI Automation wrapper
(CCW), so memory still grew under a sustained UIA client.

Add a background sweep that calls UiaDisconnectProvider on ElementProxy
instances whose weak peer is gone (which already throw ENA, so the sweep
cannot disturb a walk of a live element). The sweep runs on a Background
DispatcherTimer, is bounded per tick, and self-stops when idle. Weak
proxies cache their runtime id so disconnect can still serve it.

Gated by the Switch.System.Windows.Automation.Peers.DisableUiaProviderDisconnect
AppContext kill-switch (default off = sweep enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@amarinov-msft amarinov-msft force-pushed the amarinov-fix/automation-peer-disconnect-stale branch from 3ae37fc to a256bd2 Compare July 1, 2026 06:03
Apply low-risk review suggestions to the data-item peer keep-alive:
- Move the (UseStrongReference / IsDataItemAutomationPeer) eligibility
  guard into PeerKeepAlive.KeepAlive so the two call sites cannot drift,
  and additionally honor the legacy AutomationWeakReferenceDisallow key
  (AutomationInteropReferenceType) that the StaticWrap stamp ignored.
- Skip the striped-lock write when a peer is already stamped for the
  current generation (repeat touches are the common case).
- Document that IsPeerCollected must not touch Peer, since doing so would
  hand out a transient strong reference and defeat the sweep.

No behavioral change to the leak fix; keep-alive and disconnect semantics
are unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR metadata: Label to tag PRs, to facilitate with triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WPF leaks ElementProxy instances when UI Automation is used

3 participants