Skip to content

ansible-devnet: 8-subnet homogeneous layout and full leanpoint upstreams#176

Open
ch4r10t33r wants to merge 31 commits into
mainfrom
feat/devnet8-homogeneous-leanpoint-full-upstreams
Open

ansible-devnet: 8-subnet homogeneous layout and full leanpoint upstreams#176
ch4r10t33r wants to merge 31 commits into
mainfrom
feat/devnet8-homogeneous-leanpoint-full-upstreams

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Summary

  • Regenerate ansible-devnet/genesis/validator-config.yaml for a 64-validator / 8-subnet homogeneous devnet: each subnet is a single client family (qlean → lantern → ream → zeam → ethlambda → gean → grandine → nlean).
  • Aggregators (indices 0–7) use dedicated Aggregator_servers IPs; regular validators use the Validator_servers pool (23 hosts, up to 4 containers per IP). Tooling host 46.225.10.32 is excluded.
  • Leanpoint now polls every validator in validator-config.yaml by default (removed the earlier cap of 2 upstreams per subnet). sync-leanpoint-upstreams.sh passes --all-upstreams. Legacy subsetting is opt-in via --subnet-sample.
  • Ansible localhost add_host plays use strategy: linear so they work when ansible.cfg sets strategy=free for large devnet deploys.

Test plan

  • python3 convert-validator-config.py ansible-devnet/genesis/validator-config.yaml /tmp/up.json → 64 upstreams
  • python3 convert-validator-config.py ... /tmp/up.json --subnet-sample → 16 upstreams (8 subnets × 2)
  • spin-node.sh / Ansible deploy against updated validator-config.yaml (dry-run or staging)
  • sync-leanpoint-upstreams.sh regenerates tooling upstreams.json with full validator list

ch4r10t33r and others added 30 commits February 6, 2026 14:56
Add support for configuring nodes as aggregators through validator-config.yaml.
This allows selective designation of nodes to perform aggregation duties by
setting isAggregator: true in the validator configuration.

Changes:
- Add isAggregator field (default: false) to all validators in both local and ansible configs
- Update parse-vc.sh to extract and export isAggregator flag
- Modify all client command scripts to pass --is-aggregator flag when enabled
- Add isAggregator status to node information output
Resolved conflicts in client-cmds scripts by keeping both:
- Aggregator flag support
- Checkpoint sync URL support

Updated Docker images:
- zeam: 0xpartha/zeam:devnet3
- lantern: piertwo/lantern:v0.0.3-test
- ethlambda: ghcr.io/lambdaclass/ethlambda:devnet3

Added httpPort support for lantern nodes.
Resolve zeam-cmd.sh: keep single attestation_committee block and zeam_global_flags in node_binary.
…int upstreams

Regenerate validator-config.yaml for 64 validators across 8 attestation
subnets (one client family per subnet). Aggregators sit on dedicated
aggregator hosts; regular validators use the Validator_servers IP pool.

Leanpoint convert/sync now emits one upstream per validator by default
(removed per-subnet cap of two). Optional --subnet-sample restores the
legacy subset behavior.

Ansible localhost plays that use add_host force strategy: linear so they
work with ansible.cfg strategy=free on large devnets.
--prepare now installs tools, opens firewall ports, and starts Prometheus,
Promtail, node_exporter, and cadvisor on every host. Add apt retries/throttle,
prepare fork cap, and a single retry pass for transient lock failures.
…n slots

Aggregators now get --aggregate-subnet-ids for their committee only
(validator_index % attestation_committee_count) via parse-vc.sh and
ansible zeam/ethlambda roles, not the full 0..N-1 CSV. Client cmd scripts
pass a single subnet id; peam allowed_topics match the same rule.

Rename former qlean_* / lantern_* validator nodes to zeam_8..15 and
ethlambda_8..15 in ansible and local genesis configs to avoid clashing
with existing zeam_0..7 / ethlambda_0..7 names.
Stop every Docker container on each unique validator-config IP except
the per-host observability stack (prometheus, promtail, cadvisor,
node_exporter). Document in README.
Use ansible command+loop instead of bash process substitution so the
playbook runs under /bin/sh. Clarify that all stale containers are
removed, not only validator-config names.
Configure unless-stopped on all containers via prepare/deploy, systemd
Restart=always for docker.service, and a shared group_vars policy for
new docker run invocations.
Kernel logs showed ream using up to ~15GiB RSS on 16GiB hosts with
2–3 validators per IP. Add per-client docker --memory limits, tighter
limits on the 8GiB host, and run docker-restart-policy only on prepare
(not mid-deploy).
Set all client docker_memory_limits to 3g on 157.90.254.146.
Drop 157.90.254.146 overrides; all devnet hosts use 16gib defaults.
Set every per-client docker_memory_limits entry to a uniform 4g and
make each role's docker run skip --memory/--memory-swap when the node
is an aggregator (one container per IP, no co-tenant memory pressure).

Previously ream (5g), ethlambda (1.5g) and the rest (3g) were
inconsistent, and the cap was applied unconditionally — so aggregators
were also being throttled even though they own their host.
Replace the ansible fallback when client-cmd extraction fails:
0xpartha/zeam:local → blockblaz/zeam:devnet4 (matches defaults and zeam-cmd.sh).
Replace the former nlean column with grandine aggregators and add a
second ream column on subnet 6 so gean and grandine each own one subnet.
Put grandine_0 on the grandine aggregator host, move ream_13 to the
nlean slot, place gean_1 on 95.217.158.60, and relocate validators off
hosts not in lean_ethereum_servers.txt.
Replace the eight aggregator host IPs with the Aggregator_servers list
from lean_ethereum_servers.txt and keep assign-aggregator-ips.py in sync.
Wire the new zeam --rayon-threads CLI flag (zeam #903 / #899) into both
the zeam-cmd.sh shell launcher and the ansible/roles/zeam docker run.

Two knobs so non-aggregators can stay on zeam's compiled-in auto-split:

  ZEAM_RAYON_THREADS_AGGREGATOR / zeam_rayon_threads_aggregator
      aggregator-only override (wins for aggregators)
  ZEAM_RAYON_THREADS / zeam_rayon_threads
      uniform override applied to both roles

Both unset (the default) is required for pre-#903 zeam images, which
would refuse the flag and fail to start. The 16-vCPU recommended
starting value is 12 (= cpu_count - 4 reserved system threads).
Apply twelve rayon workers whenever isAggregator is true unless
ZEAM_RAYON_THREADS_AGGREGATOR overrides it.
Add an explicit docker pull step to every client Ansible role and use
--pull=always on spin-node docker runs so registry tags are refreshed on
each deploy.
Fix zeam chain-worker and rayon-threads CLI generation, set aggregator
rayon to 12, replace ream subnet 2 with lantern, and harden
stop-all-containers against unreachable hosts.
Use 0xpartha/zeam:local, set non-aggregator rayon to 6 on 8-vCPU hosts,
and let 16-vCPU aggregators auto-tune (cpu_count - 4) when no override
is set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant