Skip to content

Commit c2e33d3

Browse files
authored
Merge branch 'main' into devnet4
2 parents c6c1688 + 0ee9ac1 commit c2e33d3

4 files changed

Lines changed: 230 additions & 5 deletions

File tree

.claude/skills/devnet-runner/SKILL.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,8 +201,24 @@ For persistent devnets on remote servers (e.g., `ssh admin@ethlambda-1`), use de
201201

202202
See `references/long-lived-devnet.md` for the full procedure, including starting the devnet, rolling restart steps, verification, and troubleshooting.
203203

204+
## Testing Network Instability
205+
206+
Use `docker pause` to simulate validator failures and observe consensus degradation (finalization stalls, attestation backlog, block processing time increases).
207+
208+
**Quick start:**
209+
```bash
210+
# Pause 2 non-aggregator nodes (causes loss of supermajority, finalization stalls)
211+
docker pause ethlambda_0 ethlambda_1
212+
213+
# Recover
214+
docker unpause ethlambda_0 ethlambda_1
215+
```
216+
217+
See `references/instability-testing.md` for detailed scenarios, data extraction scripts, and analysis procedures.
218+
204219
## Reference
205220

206221
- `references/clients.md`: Client-specific details (images, ports, known issues)
207222
- `references/validator-config.md`: Full config schema, field reference, adding/removing nodes, port allocation
208223
- `references/long-lived-devnet.md`: Persistent devnets with detached containers and rolling restarts
224+
- `references/instability-testing.md`: Simulating validator failures, non-finality, and measuring block processing under load
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Testing Network Instability and Non-Finality
2+
3+
Simulate validator failures, observe consensus degradation, and measure recovery. Useful for benchmarking block processing under load, testing finalization stall behavior, and validating fixes like parallel signature verification.
4+
5+
## Overview
6+
7+
The lean consensus protocol requires a supermajority (3 out of 4 validators in a 4-node devnet) to justify and finalize slots. Pausing containers with `docker pause` simulates sudden validator failures without destroying state, allowing clean recovery with `docker unpause`.
8+
9+
## Prerequisites
10+
11+
- A running devnet (local or long-lived) with 4 ethlambda nodes
12+
- Know which node is the aggregator (`--is-aggregator` flag, typically `ethlambda_3`)
13+
- The aggregator must remain running, since it aggregates attestation signatures into proofs
14+
15+
## Quick Start: Induce Non-Finality
16+
17+
```bash
18+
# 1. Verify all nodes are healthy
19+
docker ps --format 'table {{.Names}}\t{{.Status}}' | grep ethlambda
20+
21+
# 2. Pause 2 non-aggregator nodes (causes loss of supermajority)
22+
docker pause ethlambda_0 ethlambda_1
23+
24+
# 3. Verify they're paused
25+
docker inspect ethlambda_0 --format '{{.State.Paused}}'
26+
# Should output: true
27+
28+
# 4. Observe: finalization stalls, attestation backlog accumulates
29+
docker logs --tail 20 ethlambda_2 2>&1 | sed 's/\x1b\[[0-9;]*m//g'
30+
31+
# 5. Recover: unpause both nodes
32+
docker unpause ethlambda_0 ethlambda_1
33+
```
34+
35+
## What Happens When You Pause 2 of 4 Nodes
36+
37+
### Immediate effects
38+
- Block production continues from the 2 active validators (slots assigned to paused validators are missed)
39+
- Attestation signatures from paused validators stop arriving
40+
- The aggregator can only aggregate attestations from 2 validators
41+
42+
### Within ~20 slots
43+
- Justification stalls (need 3/4 supermajority, only have 2/4)
44+
- Finalized and justified slots stop advancing
45+
- Attestation target falls behind the head (nodes vote for the last justified checkpoint)
46+
47+
### Steady state (50+ slots)
48+
- Blocks carry up to 36 attestations (backlog from all prior slots)
49+
- Block processing time increases proportionally with attestation count
50+
- Target-to-head gap grows linearly (~1 slot per slot)
51+
52+
### Observable metrics
53+
54+
```bash
55+
# Check finalization progress (should be stuck)
56+
docker logs --tail 20 ethlambda_2 2>&1 | \
57+
sed 's/\x1b\[[0-9;]*m//g' | grep 'Fork Choice Tree' -A 6 | tail -8
58+
59+
# Check attestation target gap (should be growing)
60+
docker logs ethlambda_2 2>&1 | \
61+
sed 's/\x1b\[[0-9;]*m//g' | grep 'Published attestation' | \
62+
sed 's/.*slot=\([0-9]*\).*target_slot=\([0-9]*\).*/\1 \2/' | \
63+
awk 'NF==2 {print "slot=" $1 " target=" $2 " gap=" $1-$2}' | tail -10
64+
```
65+
66+
## Extracting Block Processing Data
67+
68+
Extract attestation count and block processing time from all nodes for analysis:
69+
70+
```bash
71+
for c in ethlambda_0 ethlambda_1 ethlambda_2 ethlambda_3; do
72+
docker logs "$c" 2>&1 | sed "s/\x1b\[[0-9;]*m//g" | awk -v node="$c" '
73+
NR==FNR {
74+
if (/Received block from gossip|Published block to gossipsub/) {
75+
match($0, /slot=[0-9]+/); s=substr($0, RSTART+5, RLENGTH-5)
76+
match($0, /attestation_count=[0-9]+/); a=substr($0, RSTART+18, RLENGTH-18)
77+
att[s]=a
78+
}
79+
next
80+
}
81+
function to_ms(raw) {
82+
if (index(raw, "ms") > 0) { gsub(/ms/, "", raw); return raw+0 }
83+
if (index(raw, "µs") > 0) { gsub(/µs/, "", raw); return (raw+0)/1000 }
84+
gsub(/s/, "", raw); return (raw+0)*1000
85+
}
86+
/Processed new block/ {
87+
match($0, /slot=[0-9]+/); s=substr($0, RSTART+5, RLENGTH-5)
88+
match($0, /block_total=[^ ]+/); bt_raw=substr($0, RSTART+12, RLENGTH-12)
89+
match($0, /sig_verification=[^ ]+/); sv_raw=substr($0, RSTART+17, RLENGTH-17)
90+
bt=to_ms(bt_raw); sv=to_ms(sv_raw)
91+
if (s in att) ac=att[s]; else ac=0
92+
print node "," s "," ac "," bt "," sv
93+
}
94+
' <(docker logs "$c" 2>&1 | sed "s/\x1b\[[0-9;]*m//g") \
95+
<(docker logs "$c" 2>&1 | sed "s/\x1b\[[0-9;]*m//g")
96+
done > block_data.csv
97+
```
98+
99+
Output CSV format: `node,slot,attestation_count,block_total_ms,sig_verification_ms`
100+
101+
**Important:** The `block_total` field uses mixed units (`ms` for milliseconds, `s` for seconds, `µs` for microseconds). The awk `to_ms` function above normalizes everything to milliseconds.
102+
103+
## Quick Stats from Extracted Data
104+
105+
```bash
106+
# Max block processing time
107+
awk -F',' '{if($4>max){max=$4; line=$0}} END{print "MAX:", line}' block_data.csv
108+
109+
# Post-pause stats (replace 50 with your pause slot)
110+
PAUSE_SLOT=50
111+
awk -F',' -v ps="$PAUSE_SLOT" '$2>ps {sum+=$4; n++; if($4>max)max=$4}
112+
END{print "Post-pause: n=" n " avg=" sum/n "ms max=" max "ms"}' block_data.csv
113+
114+
# Attestation count distribution
115+
awk -F',' '{print $3}' block_data.csv | sort -n | uniq -c | sort -rn | head -10
116+
```
117+
118+
## Test Scenarios
119+
120+
### Scenario 1: Measure Signature Verification Scaling
121+
122+
**Goal:** Measure how block processing time scales with attestation count.
123+
124+
1. Start a 4-node devnet, let it stabilize (~20 slots)
125+
2. Record the pause slot: `PAUSE_SLOT=<current_slot>`
126+
3. Pause 2 non-aggregator nodes
127+
4. Wait for attestation backlog to build (100+ slots)
128+
5. Extract data and plot attestation count vs block processing time
129+
6. Unpause nodes, observe recovery
130+
131+
**Expected results (sequential verification):**
132+
- Pre-pause: ~90ms median, 0-6 attestations per block
133+
- Post-pause: ~1,400ms median, 36 attestations per block
134+
- Linear relationship between attestation count and processing time
135+
136+
**Expected results (parallel verification with rayon):**
137+
- Pre-pause: ~65ms median, 0-6 attestations per block
138+
- Post-pause: ~290ms median, 36 attestations per block
139+
- ~4.8x speedup on an 8-core machine
140+
141+
### Scenario 2: Finalization Stall and Recovery
142+
143+
**Goal:** Verify that finalization resumes after paused validators rejoin.
144+
145+
1. Start a 4-node devnet, wait for finalization to start advancing (slot ~10+)
146+
2. Note the current finalized slot
147+
3. Pause 2 nodes
148+
4. Confirm finalization stalls (finalized slot stops advancing for 50+ slots)
149+
5. Unpause both nodes simultaneously
150+
6. Verify finalization resumes within ~20 slots
151+
152+
### Scenario 3: Aggregator Failure
153+
154+
**Goal:** Observe the effect of losing the aggregator.
155+
156+
1. Start a 4-node devnet, confirm blocks include aggregated attestations (`attestation_count > 0`)
157+
2. Pause the aggregator (`ethlambda_3`)
158+
3. Observe: blocks are produced with `attestation_count=0`, finalization stalls immediately
159+
4. Unpause the aggregator
160+
5. Verify aggregation resumes and finalization recovers
161+
162+
**Note:** This is more severe than pausing non-aggregators because no attestation proofs are produced at all, not just a supermajority loss.
163+
164+
## Important Notes
165+
166+
### docker pause vs docker stop
167+
168+
| | `docker pause` | `docker stop` |
169+
|---|---|---|
170+
| Process state | Frozen (SIGSTOP) | Terminated (SIGTERM) |
171+
| Container state | Still "Up" | Exited |
172+
| Data preserved | Yes | Yes (if volume-mounted) |
173+
| Recovery | `docker unpause` (instant) | `docker start` (full restart, needs checkpoint sync) |
174+
| Gossipsub mesh | Peers detect timeout after ~30s | Peers detect disconnect immediately |
175+
| Use case | Simulate temporary network partition | Simulate node crash |
176+
177+
**Prefer `docker pause`** for instability testing because:
178+
- Recovery is instant (no re-peering, no checkpoint sync needed)
179+
- The paused node's state is exactly preserved
180+
- Simulates a network partition more accurately than a crash
181+
182+
### Never pause the aggregator unless testing aggregator failure
183+
184+
Without the aggregator, blocks contain zero attestation proofs. This is a different failure mode than losing non-aggregator validators. For signature verification benchmarking, always keep the aggregator running.
185+
186+
### Supermajority thresholds
187+
188+
| Validators | Supermajority (3/4) | Max paused for finality |
189+
|-----------|--------------------|-----------------------|
190+
| 4 | 3 | 1 |
191+
| 6 | 5 | 1 |
192+
| 8 | 6 | 2 |
193+
| 12 | 9 | 3 |
194+
195+
Pausing 2 of 4 nodes guarantees non-finality. Pausing 1 of 4 still allows finalization (3/4 supermajority met).

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,19 @@ Additional features:
9191
- [leanMetrics](docs/metrics.md) support for monitoring and observability
9292
- [lean-quickstart](https://github.com/blockblaz/lean-quickstart) integration for easier devnet running
9393

94+
### Container Releases
95+
96+
Docker images are published to `ghcr.io/lambdaclass/ethlambda` with the following tags:
97+
98+
| Tag | Description |
99+
|-----|-------------|
100+
| `devnetX` | Stable image for a specific devnet (e.g. `devnet3`) |
101+
| `latest` | Alias for the stable image of the currently running devnet |
102+
| `unstable` | Development builds; promoted to `devnetX`/`latest` once tested |
103+
| `sha-XXXXXXX` | Specific commit |
104+
105+
[`RELEASE.md`](./RELEASE.md) has more details on our release process and how to tag new images.
106+
94107
### pq-devnet-3
95108

96109
We are running the [pq-devnet-3 spec](https://github.com/leanEthereum/pm/blob/main/breakout-rooms/leanConsensus/pq-interop/pq-devnet-3.md). A Docker tag `devnet3` is available for this version.
@@ -112,3 +125,4 @@ Some features we are looking to implement in the near future, in order of priori
112125
- [Add support for pq-devnet-4](https://github.com/lambdaclass/ethlambda/issues/155)
113126
- [RPC endpoints for chain data consumption](https://github.com/lambdaclass/ethlambda/issues/75)
114127
- [Add guest program and ZK proving of the STF](https://github.com/lambdaclass/ethlambda/issues/156)
128+
- [Formal verification of the STF](https://github.com/lambdaclass/ethlambda/issues/272)

RELEASE.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,17 +21,17 @@ exact commit it was built from.
2121
On top of that, the workflow accepts a comma-separated list of custom tags as a
2222
parameter (e.g. `latest,devnet2`). We use the following tagging convention:
2323

24-
- `latest` - the latest image built from the `main` branch
25-
- `devnet2` - the latest image built with `devnet2` support
26-
- `devnet1` - *(deprecated)* `devnet1` support
24+
- `unstable` - the latest image built from the `main` branch, without any devnet-specific features
25+
- `latest` - the latest image built for the current devnet (`devnet3` at the time of writing)
26+
- `devnetX` - the latest image built with `devnetX` support (e.g. `devnet3`, `devnet4`)
2727

2828
Future devnets will introduce new tags, with previous ones left without updates.
2929

3030
### Pulling an image
3131

3232
```bash
33-
docker pull ghcr.io/lambdaclass/ethlambda:latest # latest from main
34-
docker pull ghcr.io/lambdaclass/ethlambda:devnet2 # devnet2-compatible
33+
docker pull ghcr.io/lambdaclass/ethlambda:unstable # latest from main
34+
docker pull ghcr.io/lambdaclass/ethlambda:devnet3 # devnet3-compatible
3535
docker pull ghcr.io/lambdaclass/ethlambda:sha-12f8377 # pinned to a specific commit
3636
```
3737

0 commit comments

Comments
 (0)