feat(agents): secure-by-default private networking via azure.yaml#8708
Merged
m5i-work merged 26 commits intoJun 24, 2026
Conversation
Add a declarative network: block to the Foundry service in azure.yaml and
teach the bicep-less synthesizer to provision a VNet-bound (network-secured)
account from it. Additive: an absent block yields today's public account.
- schema: network: surface (mode byo|managed, byo vnet/subnets tri-state,
managed isolationMode, dns create-or-reference) on microsoft.foundry.json
- synthesizer: decode network:, resolve ${VAR}, validate (mode coherence,
vnet ARM id, subnet tri-state/CIDR, DNS rg/sub), emit network params +
NetworkMode for telemetry
- templates: new modules/network.bicep, subnet.bicep, private-endpoint-dns.bicep;
resources.bicep/main.bicep guard the network path on enableNetworkIsolation
(publicNetworkAccess Disabled, networkAcls Deny, agent networkInjections,
account private endpoint + 3 AI DNS zones); main.arm.json regenerated
- provider: pass azd env for ${VAR} resolution, emit provision.network_mode,
warn that network: is ignored when endpoint: (brownfield) is set
- docs/tests: synthesizer network tests, eject module assertions, extension
README network section, telemetry-data.md provision.network_mode
The existing on-disk provision flow resolves ${VAR} in main.parameters.json
from the azd environment at provision time. Eject must therefore keep ${VAR}
references verbatim instead of resolving them eagerly from the process env and
freezing a literal into the ejected file.
- synthesis.Input gains PreserveVarRefs; when set, byo.vnet.id and
dns.subscription pass through verbatim and the format checks that cannot run
on an unexpanded placeholder are skipped (concrete-but-malformed still fails)
- eject (init --infra) sets PreserveVarRefs so the ejected main.parameters.json
stays environment-portable; the provision path still resolves and validates
- tests: synthesizer preserve-mode (pass-through + concrete validation) and an
eject e2e asserting ${AZURE_VNET_ID}/${AZURE_DNS_SUBSCRIPTION_ID} survive
Bash E2E harness validating host: microsoft.foundry private networking,
designed to minimize Azure resource-operation time:
- ONE real network account is provisioned (create+own matrix cell) with a BYO
--image agent, then deploy + invoke prove the agent works under the VNet.
- Scenario 1 (bicep-less) and the other 3 matrix cells (subnet create/reference
x DNS own/reference) are verified with 'azd provision --preview' (ARM what-if),
which creates nothing.
- Scenario 2 (eject) is verified against the live account: eject -> what-if
reports no changes (idempotent), proving the on-disk template + provision-time
${VAR} resolution reproduces the in-memory topology; a manual infra/ edit then
surfaces as the only delta. Guards the ${VAR}-preservation fix end-to-end.
- A shared BYO VNet (+ reference subnets / external DNS zones) is created once
and reused across cells.
Files: run-network-e2e.sh (phases 0-6 orchestrator), assert-resources.sh (live
az topology checks: publicNetworkAccess Disabled, account PE groupIds, 3 AI DNS
zones, agent-subnet delegation), lib.sh (logging/assert/azure.yaml mutation),
README.md (cost rationale, prerequisites, cleanup). Westus account region per
requirement; AcrPull granted to the project MI on the ABAC registry.
Decouple the private-networking E2E from the BYO-image init UX (PR 8689) so it
runs against the current branch today:
- Replace 'azd ai agent init --image' with a hand-authored azure.yaml fixture
(foundry service + network: block + agent image:), created via 'azd env new'.
image: yields includeAcr=false, matching BYO image, so no ACR at provision.
Verified the fixture synthesizes: mode=byo, enableNetworkIsolation=true,
includeAcr=false, ${VAR} resolved.
- Gate phase 5 (deploy + invoke) behind RUN_DEPLOY=true: the headless BYO-image
deploy needs the AZD_AGENT_SKIP_ACR short-circuit from PR 8689, otherwise
deploy defaults to build and fails. Phases 0-4 (local gates, shared VNet,
what-if matrix, one real provision, eject idempotency) validate all the
networking code without it.
- Fix the ABAC registry role: grant 'Container Registry Repository Reader'
(ABAC-aware) instead of AcrPull; move the grant into the gated deploy phase.
- Drop the --image preflight; README updated (scenario table, prerequisites,
RUN_DEPLOY usage, role).
…sing Two product bugs surfaced by live E2E provisioning (ARM what-if does not catch either; both require a real deployment): 1. networkInjections preflight failure. The account and the network module deploy in the same template, so subnetArmId: network!.outputs.agentSubnetId compiled to an unresolved reference() at the CognitiveServices RP preflight, which then failed to convert networkInjections to its typed contract (InvalidResourceProperties). Build the subnet ARM id as a deterministic string from the concrete vnetId param instead, and add an explicit dependsOn so the subnet still exists before injection. Recompiled main.arm.json. 2. AZURE_FOUNDRY_NETWORK_MODE missing from canonicalOutputNames. ARM mangles output-name casing (AZURE_..._MODE -> azurE_..._MODE); without the canonical remapping the env key was stored mis-cased and azd env get-value AZURE_FOUNDRY_NETWORK_MODE returned empty. Added it to the restore list and a regression case to TestArmOutputsToProto_RepairsMangledKeyCase. Validated end-to-end: real westus network-isolated Foundry account provisions green with all topology assertions passing (publicNetworkAccess Disabled, networkAcls Deny, private endpoint, agent-subnet delegation, 3 AI DNS zones, network mode byo), across the full subnet create/reference x DNS own/reference matrix, plus eject idempotency (what-if reports no changes).
Fixes found while running the harness against live Azure (phases 0-4): - Hand-authored project must include an agent.yaml (kind: hosted + image:) alongside azure.yaml; the foundry provider requires an agent definition file. - setup_project now sets AZURE_RESOURCE_GROUP (the subscription-scoped template creates the RG but the provider needs the name) and AZD_AGENT_SKIP_ACR=true (BYO-image deploy signal). - Phase 0 refreshes the dev extension from current source (build -> pack -> publish -> install) so the run tests local code, registering the provisioning-provider capability + microsoft.foundry provider. Gated by SKIP_EXT_REFRESH. - What-if matrix gates on a successful ARM what-if (exit 0) rather than grepping a summary-only preview; this still validates reference-mode subnet/zone existence and delegation against the real VNet. - Idempotent private-dns zone creation (reruns no longer fail on existing zones). - Add MAX_PHASE to stop early while iterating. - ACR grant uses the ABAC-aware Container Registry Repository Reader role. - Fix set -u unbound-variable crash in the phase-4 assert message. - .gitignore the transient per-run log directories. Phases 0-4 (local gates, shared VNet, what-if matrix, one real provision + topology assertions, eject idempotency) pass green. Phase 5 (deploy + invoke) stays gated behind RUN_DEPLOY=true and needs a reachable BYO agent image.
Update the Foundry private-network E2E harness so phase 5 can build the ~/agents/echo-dual image itself instead of requiring a prebuilt external image. - Add BUILD_IMAGE=true, ECHO_DUAL_DIR, ACR_NAME/ACR_RG, IMAGE_REPO/IMAGE_TAG. - Create the target ACR with --role-assignment-mode rbac-abac and reject reuse of non-ABAC registries. - Grant the caller Container Registry Repository Writer before the ACR Task push. Resolve the caller object id from the ARM token oid claim to avoid Microsoft Graph / Conditional Access failures. - Build with the required `az acr build --source-acr-auth-id [caller]` form. - Keep the project MI grant on the ABAC-aware Container Registry Repository Reader role for image pull. - Add TARGET_RG support so investigation runs can keep VNet, DNS, ACR, and the real Foundry env in a single RG. Live validation: the harness created an ABAC ACR, granted caller writer, built and pushed ~/agents/echo-dual with --source-acr-auth-id [caller], provisioned a private-networked Foundry account, and granted the project MI Repository Reader. The subsequent deploy failed from this public runner with the expected private endpoint 403, which is documented.
Live phase-5 validation showed hosted-agent image pull uses the Foundry project managed identity, not the parent account identity. Update the network E2E harness to resolve AZURE_AI_PROJECT_ID via ARM and grant the project MI the ABAC-aware Container Registry Repository Reader role on the BYO ACR, falling back to the account MI only for older API shapes. Also persist AZURE_TENANT_ID in the azd env so postdeploy hooks do not fail on VM/managed-identity runners after deploy succeeds.
Add a concise README cheatsheet for initializing, provisioning, deploying, and invoking a hosted Foundry agent with a BYO container image under VNet private networking. Include ACR requirements for ABAC and private-only registries.
Keep the extension README concise by moving the detailed Foundry private networking schema, requirements, and BYO-image VNet cheatsheet into `docs/private-networking.md`, with a short README pointer.
Live managed-network provisioning showed that the resources module emitted AZURE_FOUNDRY_MANAGED_ISOLATION_MODE but the subscription-scoped main template never forwarded it, so azd env only received AZURE_FOUNDRY_NETWORK_MODE. Forward the output from main.bicep, add it to the provider canonical output-name restore list, and cover ARM casing repair with a regression test. Also document the managed VNet provisioning scenario in the private-networking guide. Live validation: provisioned network.mode=managed in westus and verified the account had publicNetworkAccess Disabled, networkAcls Deny, networkInjections with useMicrosoftManagedNetwork=true, AZURE_FOUNDRY_NETWORK_MODE=managed, and AZURE_FOUNDRY_MANAGED_ISOLATION_MODE=AllowInternetOutbound.
Live managed-network deploy validation showed that managed mode configures the hosted-agent runtime to use a Microsoft-managed network but does not create a customer private endpoint for the Foundry data plane. Disabling public access in that mode makes azd deploy/invoke fail with `403 Public access is disabled`. Keep public data-plane access enabled for managed mode while preserving BYO mode behavior (public access disabled + private endpoint). Update the private networking guide with managed deploy/invoke guidance. Live validation: provisioned managed mode, converted the test ACR to ABAC, built the echo-dual image with `az acr build --source-acr-auth-id [caller]`, granted the Foundry project MI `Container Registry Repository Reader`, deployed successfully, and invoked the hosted agent successfully.
Realign the azure.yaml `network:` surface to the natural Azure resource shape and make a network-bound Foundry account private in every mode. Reverses the prior managed-mode regression that flipped the account's publicNetworkAccess back to Enabled. Service sample 18 confirms managed mode supports a private data plane (customer private endpoint + the Microsoft-managed egress network), so declaring `network:` now always disables public data-plane access. Config: flat `network:` block with two orthogonal axes, no `mode` enum. - peSubnet (required) -> account private endpoint; omitting it while `network:` is declared is an error, never a silent public fallback. - agentSubnet (optional) -> present injects the agent into a customer subnet (BYO egress); absent uses the Microsoft-managed network (managed egress), where isolationMode becomes valid. Synthesizer/templates: - derive egress from agentSubnet presence (useManagedEgress); replace the networkMode param with a useManagedEgress bool. - disablePublicDataPlaneAccess = enableNetworkIsolation (always private). - add a managedNetworks/default child resource carrying isolationMode for managed egress. - validate peSubnet-required, isolationMode-managed-only, and single-VNet. Docs/tests/e2e: - rewrite docs/private-networking.md (host: azure.ai.agent, the value the provision provider actually accepts on this branch). - add synthesizer unit tests + a compiled-ARM regression guard. - add a live E2E harness (8-cell what-if matrix, BYO + managed-iso real provisions, eject idempotency) with an automatic jumpbox SOCKS tunnel so deploy/invoke can reach the private data plane; assert real account network-injection state rather than azd's echoed output.
…n private-networking cheatsheet Managed-egress cheatsheet now scaffolds the agent via 'azd ai agent init --image' (writes agent.yaml) instead of assuming a hand-authored manifest, matching the BYO cheatsheet. Replace the env-output 'Expected outputs' block (azd echoing its own AZURE_FOUNDRY_* classification) with real resource-state validation: account publicNetworkAccess=Disabled and the managedNetworks/default isolationMode, with the invoke echo response as the end-to-end proof.
'azd ai agent init --image' scaffolds azure.yaml/agent.yaml, so it must come before the network: block the reader adds to the generated service. Matches the actual timing and the BYO cheatsheet ordering.
- Drop the redundant 'azd env set AZD_AGENT_SKIP_ACR true': passing --image to 'azd ai agent init' already derives skipACR() and writes the env var into the environment init creates/selects. Reuse that env (no separate 'azd env new') so init and provision share one environment. - BYO cheatsheet: remove the export-variable indirection; inline placeholders directly into 'azd env set', matching the managed-egress cheatsheet. - Managed-egress cheatsheet: remove the weak 'validate via az show' / env-output block; a successful invoke over the private endpoint is the end-to-end proof.
In peered fallback mode, populate JB_HOST and wait for SSH before writing /etc/hosts. DNS-reference accounts can also assign different PE IPs per privatelink zone, so pin each account FQDN from the private DNS A record instead of mapping every FQDN to the first PE NIC IP.
Remove the ad-hoc live validation harness from the committed diff while keeping it available locally for investigation runs.
…m5i/foundry-private-network # Conflicts: # cli/azd/extensions/azure.ai.agents/go.mod
1c71cfc to
ee27801
Compare
Compare normalized ARM JSON instead of raw bytes so CI/dev Bicep version metadata does not cause false stale-template failures while still catching semantic template drift.
ee27801 to
ea84132
Compare
…m5i/foundry-private-network
The Terraform module has no VNet/private-endpoint/DNS/networkInjections resources, so ejecting it for a network: service would silently drop the config and provision a public account. Fail fast with a clear error instead, preserving the network: secure-by-default contract. Bicep eject remains the supported path for private networking. Also fixes a merge artifact (single-arg ejectInfra call in a test).
…twork param set - docs: add 'Advanced: eject the Bicep and customize it' section to the private-networking cheatsheet (eject -> manual edit -> provision -> deploy -> invoke), with two worked Bicep edits and a Terraform-unsupported note. - test: assert the complete network parameter set lands in the ejected main.parameters.json for BYO egress, and extend the managed-egress eject test with the full param contract.
trangevi
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds declarative, secure-by-default private networking for
host: azure.ai.agentservices in the Azure AI agents extension. Declaring anetwork:block on a Foundry-hosted service provisions a network-bound Foundry account/project fromazure.yamlwith the data plane private in every mode (accountpublicNetworkAccess: Disabled+ customer private endpoint).The config surface is flat and mirrors the natural Azure resource shape — two orthogonal axes, no
modeenum:peSubnet— the account private endpoint. Its presence is what makes the data plane private; omitting it while declaringnetwork:is an error (never a silent public fallback).agentSubnet— present ⇒ the agent is injected into your subnet (BYO egress); absent ⇒ the Microsoft-managed network is used (managed egress), whereisolationModebecomes valid.Changes
network:schema for Foundry-hosted services (agentSubnet,peSubnet,dns,isolationMode); the VNet rides on each subnet.publicNetworkAccess: Disabled(disablePublicDataPlaneAccess = enableNetworkIsolation), Foundry private endpoint inpeSubnet(+ private DNS zones / VNet links, or referenced existing zones viadns:), BYO egress via delegated hosted-agent subnet + accountnetworkInjectionspointing at that subnet, and managed egress viauseManagedEgressinjection + amanagedNetworks/defaultchild resource carryingisolationMode.agentSubnetpresence (useManagedEgress = agentSubnet == nil); replace thenetworkModetemplate param with auseManagedEgressbool.peSubnetrequired whennetwork:is declared;isolationModevalid only for managed egress; all subnets share one VNet.${VAR}placeholders duringazd ai agent init --infraeject; resolve only at provision time.networkInjections.subnetArmId(an inter-modulereference()is unresolved at the CognitiveServices RP preflight and what-if does not catch it).agentSubnet.nameandpeSubnet.namediffer when they share one VNet (a single VNet cannot hold two subnets with the same name).azd ai agent init --infra=terraformfor a service that declaresnetwork:. The Terraform IaC module (added in feat(agents): support Terraform as an IaC option for azd ai agent #8756) has no VNet / private-endpoint / DNS resources, so ejecting it would silently provision a public account — the guard preserves the secure-by-default invariant. Full Terraform network parity is a fast-follow.--infra), edit it directly (e.g. add a subnet the schema can't express, or set an extra account property), then provision/deploy/invoke the edited tree (azd compiles the on-disk Bicep instead of synthesizing).Test coverage
Three tiers; only the live tier creates resources.
create· DNS createcreate·AllowOnlyApprovedOutboundreference· DNSreferencereferenceAllowInternetOutboundpeSubnetomitted whilenetwork:declaredendpoint:brownfield +network:The full 8-cell BYO/managed × create/reference × own/reference × isolation-mode matrix was exercised by ARM what-if (template compiles and the CognitiveServices RP accepts the shape, but nothing is created) and by deterministic synthesizer unit tests. Three representative cells were live-provisioned: BYO create/DNS-create, managed
AllowOnlyApprovedOutbound, and BYO with referenced subnets + referenced DNS zones. The real deploy/invoke data path was validated against both BYO live cells (DNS-create and DNS-reference). Managed subnet-reference mode andAllowInternetOutboundremain what-if + unit-test only.E2E validation performed
Scenario 1 — Provision a private-networked Foundry from
azure.yamlpublicNetworkAccess: Disabled,networkAcls.defaultAction: Deny; Foundry private endpoint (accountgroup) inpeSubnet; agent subnet delegated toMicrosoft.App/environments; the accountnetworkInjectionsreferences the customer agent subnet; the threeprivatelink.*DNS zones created and linked.AllowOnlyApprovedOutboundinto a dedicated VNet; verified themanagedNetworks/defaultchild resource was accepted withisolationMode = AllowOnlyApprovedOutbound(the one thing what-if cannot confirm, since the V2 managed network is created, not planned).privatelink.*zones linked to the VNet. Verified the private endpoint DNS-zone group points at the external DNS RG for all three zones and that azd did not create duplicate private DNS zones in the account RG. Then ranazd deployandazd ai agent invokethrough the jumpbox/SOCKS path. Validated runazd-network-e2e-drdbg104254: accountcog-bmgby7ooar752, deploy succeeded in 1m45s, invoke returned[netagent] 🔊 Echo: hello, are you up?.Live assertions read real resource state (account properties, private endpoint, subnet delegation, DNS zones / DNS-zone group, the account's network injection), not azd's own output variables. Deploy/invoke runs prove the private data-plane path is usable for the live BYO DNS-create and BYO DNS-reference cells.
Scenario 2 — Eject preserves private-networking config
azd ai agent init --infraejects equivalent Bicep and preserves${VAR}placeholders (e.g.${AZURE_VNET_ID}resolves at provision time, not eject time). The ejected template what-ifs as no changes against the already-provisioned account (idempotent).Eject → edit → re-provision → deploy → invoke (live, full power-user path): applied two manual Bicep edits to the ejected tree — an extra subnet not expressible via the
network:schema (infra/modules/network.bicep) and an extra account tag (infra/modules/resources.bicep) — re-provisioned the edited template against the live account, and asserted both landed in real resource state (subnet192.168.30.0/24present; account tageditedByPowerUser=true). Thenazd deploy+azd ai agent invokesucceeded through the jumpbox/SOCKS path against the edited account, proving the manual edit did not break the private data plane. Validated runazd-net-e2e-live-20260623-152842: accountcog-fgdp4sboffk2a, invoke returned[netagent] 🔊 Echo: hello, are you up?.Scenario 3 — Deploy and invoke a hosted agent over the private data plane
Because the data plane is private in every mode, deploy/invoke must run with line-of-sight to the private endpoint. The manual validation used a jumpbox VM and local SOCKS5 proxy so
azd deploy/invokeran on the dev host with data-plane HTTPS tunneled into the VNet.Validated against private BYO accounts (
publicNetworkAccess: Disabled): the Foundry project MI was grantedContainer Registry Repository Readeron an ABAC-enabled ACR for the BYO image pull,azd deployreached hosted-agentactive, andazd ai agent invokereturned the expected echo response. A direct deploy/invoke from the public internet fails as expected with403 Public access is disabled.Documentation
cli/azd/extensions/azure.ai.agents/docs/private-networking.mdKnown limitations
agentSubnetandpeSubnet; cross-VNet topologies are deferred (require customer-managed peering + DNS-zone links).dns-create account links the VNet to the three AIprivatelink.*zones, and a VNet allows only one link per namespace — a second account (or brownfield hub) must usedns:reference mode.azd ai agent init --infra=terraformis refused when a service declaresnetwork:. Use Bicep (--infra) — and customize via the eject workflow if needed. Full Terraform parity is a fast-follow.