ansible-devnet: 8-subnet homogeneous layout and full leanpoint upstreams#176
Open
ch4r10t33r wants to merge 31 commits into
Open
ansible-devnet: 8-subnet homogeneous layout and full leanpoint upstreams#176ch4r10t33r wants to merge 31 commits into
ch4r10t33r wants to merge 31 commits into
Conversation
Add support for configuring nodes as aggregators through validator-config.yaml. This allows selective designation of nodes to perform aggregation duties by setting isAggregator: true in the validator configuration. Changes: - Add isAggregator field (default: false) to all validators in both local and ansible configs - Update parse-vc.sh to extract and export isAggregator flag - Modify all client command scripts to pass --is-aggregator flag when enabled - Add isAggregator status to node information output
Resolved conflicts in client-cmds scripts by keeping both: - Aggregator flag support - Checkpoint sync URL support Updated Docker images: - zeam: 0xpartha/zeam:devnet3 - lantern: piertwo/lantern:v0.0.3-test - ethlambda: ghcr.io/lambdaclass/ethlambda:devnet3 Added httpPort support for lantern nodes.
Resolve zeam-cmd.sh: keep single attestation_committee block and zeam_global_flags in node_binary.
…int upstreams Regenerate validator-config.yaml for 64 validators across 8 attestation subnets (one client family per subnet). Aggregators sit on dedicated aggregator hosts; regular validators use the Validator_servers IP pool. Leanpoint convert/sync now emits one upstream per validator by default (removed per-subnet cap of two). Optional --subnet-sample restores the legacy subset behavior. Ansible localhost plays that use add_host force strategy: linear so they work with ansible.cfg strategy=free on large devnets.
--prepare now installs tools, opens firewall ports, and starts Prometheus, Promtail, node_exporter, and cadvisor on every host. Add apt retries/throttle, prepare fork cap, and a single retry pass for transient lock failures.
…n slots Aggregators now get --aggregate-subnet-ids for their committee only (validator_index % attestation_committee_count) via parse-vc.sh and ansible zeam/ethlambda roles, not the full 0..N-1 CSV. Client cmd scripts pass a single subnet id; peam allowed_topics match the same rule. Rename former qlean_* / lantern_* validator nodes to zeam_8..15 and ethlambda_8..15 in ansible and local genesis configs to avoid clashing with existing zeam_0..7 / ethlambda_0..7 names.
Stop every Docker container on each unique validator-config IP except the per-host observability stack (prometheus, promtail, cadvisor, node_exporter). Document in README.
Use ansible command+loop instead of bash process substitution so the playbook runs under /bin/sh. Clarify that all stale containers are removed, not only validator-config names.
Configure unless-stopped on all containers via prepare/deploy, systemd Restart=always for docker.service, and a shared group_vars policy for new docker run invocations.
Kernel logs showed ream using up to ~15GiB RSS on 16GiB hosts with 2–3 validators per IP. Add per-client docker --memory limits, tighter limits on the 8GiB host, and run docker-restart-policy only on prepare (not mid-deploy).
Set all client docker_memory_limits to 3g on 157.90.254.146.
Drop 157.90.254.146 overrides; all devnet hosts use 16gib defaults.
Set every per-client docker_memory_limits entry to a uniform 4g and make each role's docker run skip --memory/--memory-swap when the node is an aggregator (one container per IP, no co-tenant memory pressure). Previously ream (5g), ethlambda (1.5g) and the rest (3g) were inconsistent, and the cap was applied unconditionally — so aggregators were also being throttled even though they own their host.
Replace the ansible fallback when client-cmd extraction fails: 0xpartha/zeam:local → blockblaz/zeam:devnet4 (matches defaults and zeam-cmd.sh).
Replace the former nlean column with grandine aggregators and add a second ream column on subnet 6 so gean and grandine each own one subnet.
Put grandine_0 on the grandine aggregator host, move ream_13 to the nlean slot, place gean_1 on 95.217.158.60, and relocate validators off hosts not in lean_ethereum_servers.txt.
Replace the eight aggregator host IPs with the Aggregator_servers list from lean_ethereum_servers.txt and keep assign-aggregator-ips.py in sync.
Wire the new zeam --rayon-threads CLI flag (zeam #903 / #899) into both
the zeam-cmd.sh shell launcher and the ansible/roles/zeam docker run.
Two knobs so non-aggregators can stay on zeam's compiled-in auto-split:
ZEAM_RAYON_THREADS_AGGREGATOR / zeam_rayon_threads_aggregator
aggregator-only override (wins for aggregators)
ZEAM_RAYON_THREADS / zeam_rayon_threads
uniform override applied to both roles
Both unset (the default) is required for pre-#903 zeam images, which
would refuse the flag and fail to start. The 16-vCPU recommended
starting value is 12 (= cpu_count - 4 reserved system threads).
Apply twelve rayon workers whenever isAggregator is true unless ZEAM_RAYON_THREADS_AGGREGATOR overrides it.
Add an explicit docker pull step to every client Ansible role and use --pull=always on spin-node docker runs so registry tags are refreshed on each deploy.
Fix zeam chain-worker and rayon-threads CLI generation, set aggregator rayon to 12, replace ream subnet 2 with lantern, and harden stop-all-containers against unreachable hosts.
Use 0xpartha/zeam:local, set non-aggregator rayon to 6 on 8-vCPU hosts, and let 16-vCPU aggregators auto-tune (cpu_count - 4) when no override is set.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ansible-devnet/genesis/validator-config.yamlfor a 64-validator / 8-subnet homogeneous devnet: each subnet is a single client family (qlean → lantern → ream → zeam → ethlambda → gean → grandine → nlean).46.225.10.32is excluded.validator-config.yamlby default (removed the earlier cap of 2 upstreams per subnet).sync-leanpoint-upstreams.shpasses--all-upstreams. Legacy subsetting is opt-in via--subnet-sample.add_hostplays usestrategy: linearso they work whenansible.cfgsetsstrategy=freefor large devnet deploys.Test plan
python3 convert-validator-config.py ansible-devnet/genesis/validator-config.yaml /tmp/up.json→ 64 upstreamspython3 convert-validator-config.py ... /tmp/up.json --subnet-sample→ 16 upstreams (8 subnets × 2)spin-node.sh/ Ansible deploy against updatedvalidator-config.yaml(dry-run or staging)sync-leanpoint-upstreams.shregenerates toolingupstreams.jsonwith full validator list