|
| 1 | +# Testing Network Instability and Non-Finality |
| 2 | + |
| 3 | +Simulate validator failures, observe consensus degradation, and measure recovery. Useful for benchmarking block processing under load, testing finalization stall behavior, and validating fixes like parallel signature verification. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The lean consensus protocol requires a supermajority (3 out of 4 validators in a 4-node devnet) to justify and finalize slots. Pausing containers with `docker pause` simulates sudden validator failures without destroying state, allowing clean recovery with `docker unpause`. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +- A running devnet (local or long-lived) with 4 ethlambda nodes |
| 12 | +- Know which node is the aggregator (`--is-aggregator` flag, typically `ethlambda_3`) |
| 13 | +- The aggregator must remain running, since it aggregates attestation signatures into proofs |
| 14 | + |
| 15 | +## Quick Start: Induce Non-Finality |
| 16 | + |
| 17 | +```bash |
| 18 | +# 1. Verify all nodes are healthy |
| 19 | +docker ps --format 'table {{.Names}}\t{{.Status}}' | grep ethlambda |
| 20 | + |
| 21 | +# 2. Pause 2 non-aggregator nodes (causes loss of supermajority) |
| 22 | +docker pause ethlambda_0 ethlambda_1 |
| 23 | + |
| 24 | +# 3. Verify they're paused |
| 25 | +docker inspect ethlambda_0 --format '{{.State.Paused}}' |
| 26 | +# Should output: true |
| 27 | + |
| 28 | +# 4. Observe: finalization stalls, attestation backlog accumulates |
| 29 | +docker logs --tail 20 ethlambda_2 2>&1 | sed 's/\x1b\[[0-9;]*m//g' |
| 30 | + |
| 31 | +# 5. Recover: unpause both nodes |
| 32 | +docker unpause ethlambda_0 ethlambda_1 |
| 33 | +``` |
| 34 | + |
| 35 | +## What Happens When You Pause 2 of 4 Nodes |
| 36 | + |
| 37 | +### Immediate effects |
| 38 | +- Block production continues from the 2 active validators (slots assigned to paused validators are missed) |
| 39 | +- Attestation signatures from paused validators stop arriving |
| 40 | +- The aggregator can only aggregate attestations from 2 validators |
| 41 | + |
| 42 | +### Within ~20 slots |
| 43 | +- Justification stalls (need 3/4 supermajority, only have 2/4) |
| 44 | +- Finalized and justified slots stop advancing |
| 45 | +- Attestation target falls behind the head (nodes vote for the last justified checkpoint) |
| 46 | + |
| 47 | +### Steady state (50+ slots) |
| 48 | +- Blocks carry up to 36 attestations (backlog from all prior slots) |
| 49 | +- Block processing time increases proportionally with attestation count |
| 50 | +- Target-to-head gap grows linearly (~1 slot per slot) |
| 51 | + |
| 52 | +### Observable metrics |
| 53 | + |
| 54 | +```bash |
| 55 | +# Check finalization progress (should be stuck) |
| 56 | +docker logs --tail 20 ethlambda_2 2>&1 | \ |
| 57 | + sed 's/\x1b\[[0-9;]*m//g' | grep 'Fork Choice Tree' -A 6 | tail -8 |
| 58 | + |
| 59 | +# Check attestation target gap (should be growing) |
| 60 | +docker logs ethlambda_2 2>&1 | \ |
| 61 | + sed 's/\x1b\[[0-9;]*m//g' | grep 'Published attestation' | \ |
| 62 | + sed 's/.*slot=\([0-9]*\).*target_slot=\([0-9]*\).*/\1 \2/' | \ |
| 63 | + awk 'NF==2 {print "slot=" $1 " target=" $2 " gap=" $1-$2}' | tail -10 |
| 64 | +``` |
| 65 | + |
| 66 | +## Extracting Block Processing Data |
| 67 | + |
| 68 | +Extract attestation count and block processing time from all nodes for analysis: |
| 69 | + |
| 70 | +```bash |
| 71 | +for c in ethlambda_0 ethlambda_1 ethlambda_2 ethlambda_3; do |
| 72 | + docker logs "$c" 2>&1 | sed "s/\x1b\[[0-9;]*m//g" | awk -v node="$c" ' |
| 73 | +NR==FNR { |
| 74 | + if (/Received block from gossip|Published block to gossipsub/) { |
| 75 | + match($0, /slot=[0-9]+/); s=substr($0, RSTART+5, RLENGTH-5) |
| 76 | + match($0, /attestation_count=[0-9]+/); a=substr($0, RSTART+18, RLENGTH-18) |
| 77 | + att[s]=a |
| 78 | + } |
| 79 | + next |
| 80 | +} |
| 81 | +function to_ms(raw) { |
| 82 | + if (index(raw, "ms") > 0) { gsub(/ms/, "", raw); return raw+0 } |
| 83 | + if (index(raw, "µs") > 0) { gsub(/µs/, "", raw); return (raw+0)/1000 } |
| 84 | + gsub(/s/, "", raw); return (raw+0)*1000 |
| 85 | +} |
| 86 | +/Processed new block/ { |
| 87 | + match($0, /slot=[0-9]+/); s=substr($0, RSTART+5, RLENGTH-5) |
| 88 | + match($0, /block_total=[^ ]+/); bt_raw=substr($0, RSTART+12, RLENGTH-12) |
| 89 | + match($0, /sig_verification=[^ ]+/); sv_raw=substr($0, RSTART+17, RLENGTH-17) |
| 90 | + bt=to_ms(bt_raw); sv=to_ms(sv_raw) |
| 91 | + if (s in att) ac=att[s]; else ac=0 |
| 92 | + print node "," s "," ac "," bt "," sv |
| 93 | +} |
| 94 | +' <(docker logs "$c" 2>&1 | sed "s/\x1b\[[0-9;]*m//g") \ |
| 95 | + <(docker logs "$c" 2>&1 | sed "s/\x1b\[[0-9;]*m//g") |
| 96 | +done > block_data.csv |
| 97 | +``` |
| 98 | + |
| 99 | +Output CSV format: `node,slot,attestation_count,block_total_ms,sig_verification_ms` |
| 100 | + |
| 101 | +**Important:** The `block_total` field uses mixed units (`ms` for milliseconds, `s` for seconds, `µs` for microseconds). The awk `to_ms` function above normalizes everything to milliseconds. |
| 102 | + |
| 103 | +## Quick Stats from Extracted Data |
| 104 | + |
| 105 | +```bash |
| 106 | +# Max block processing time |
| 107 | +awk -F',' '{if($4>max){max=$4; line=$0}} END{print "MAX:", line}' block_data.csv |
| 108 | + |
| 109 | +# Post-pause stats (replace 50 with your pause slot) |
| 110 | +PAUSE_SLOT=50 |
| 111 | +awk -F',' -v ps="$PAUSE_SLOT" '$2>ps {sum+=$4; n++; if($4>max)max=$4} |
| 112 | + END{print "Post-pause: n=" n " avg=" sum/n "ms max=" max "ms"}' block_data.csv |
| 113 | + |
| 114 | +# Attestation count distribution |
| 115 | +awk -F',' '{print $3}' block_data.csv | sort -n | uniq -c | sort -rn | head -10 |
| 116 | +``` |
| 117 | + |
| 118 | +## Test Scenarios |
| 119 | + |
| 120 | +### Scenario 1: Measure Signature Verification Scaling |
| 121 | + |
| 122 | +**Goal:** Measure how block processing time scales with attestation count. |
| 123 | + |
| 124 | +1. Start a 4-node devnet, let it stabilize (~20 slots) |
| 125 | +2. Record the pause slot: `PAUSE_SLOT=<current_slot>` |
| 126 | +3. Pause 2 non-aggregator nodes |
| 127 | +4. Wait for attestation backlog to build (100+ slots) |
| 128 | +5. Extract data and plot attestation count vs block processing time |
| 129 | +6. Unpause nodes, observe recovery |
| 130 | + |
| 131 | +**Expected results (sequential verification):** |
| 132 | +- Pre-pause: ~90ms median, 0-6 attestations per block |
| 133 | +- Post-pause: ~1,400ms median, 36 attestations per block |
| 134 | +- Linear relationship between attestation count and processing time |
| 135 | + |
| 136 | +**Expected results (parallel verification with rayon):** |
| 137 | +- Pre-pause: ~65ms median, 0-6 attestations per block |
| 138 | +- Post-pause: ~290ms median, 36 attestations per block |
| 139 | +- ~4.8x speedup on an 8-core machine |
| 140 | + |
| 141 | +### Scenario 2: Finalization Stall and Recovery |
| 142 | + |
| 143 | +**Goal:** Verify that finalization resumes after paused validators rejoin. |
| 144 | + |
| 145 | +1. Start a 4-node devnet, wait for finalization to start advancing (slot ~10+) |
| 146 | +2. Note the current finalized slot |
| 147 | +3. Pause 2 nodes |
| 148 | +4. Confirm finalization stalls (finalized slot stops advancing for 50+ slots) |
| 149 | +5. Unpause both nodes simultaneously |
| 150 | +6. Verify finalization resumes within ~20 slots |
| 151 | + |
| 152 | +### Scenario 3: Aggregator Failure |
| 153 | + |
| 154 | +**Goal:** Observe the effect of losing the aggregator. |
| 155 | + |
| 156 | +1. Start a 4-node devnet, confirm blocks include aggregated attestations (`attestation_count > 0`) |
| 157 | +2. Pause the aggregator (`ethlambda_3`) |
| 158 | +3. Observe: blocks are produced with `attestation_count=0`, finalization stalls immediately |
| 159 | +4. Unpause the aggregator |
| 160 | +5. Verify aggregation resumes and finalization recovers |
| 161 | + |
| 162 | +**Note:** This is more severe than pausing non-aggregators because no attestation proofs are produced at all, not just a supermajority loss. |
| 163 | + |
| 164 | +## Important Notes |
| 165 | + |
| 166 | +### docker pause vs docker stop |
| 167 | + |
| 168 | +| | `docker pause` | `docker stop` | |
| 169 | +|---|---|---| |
| 170 | +| Process state | Frozen (SIGSTOP) | Terminated (SIGTERM) | |
| 171 | +| Container state | Still "Up" | Exited | |
| 172 | +| Data preserved | Yes | Yes (if volume-mounted) | |
| 173 | +| Recovery | `docker unpause` (instant) | `docker start` (full restart, needs checkpoint sync) | |
| 174 | +| Gossipsub mesh | Peers detect timeout after ~30s | Peers detect disconnect immediately | |
| 175 | +| Use case | Simulate temporary network partition | Simulate node crash | |
| 176 | + |
| 177 | +**Prefer `docker pause`** for instability testing because: |
| 178 | +- Recovery is instant (no re-peering, no checkpoint sync needed) |
| 179 | +- The paused node's state is exactly preserved |
| 180 | +- Simulates a network partition more accurately than a crash |
| 181 | + |
| 182 | +### Never pause the aggregator unless testing aggregator failure |
| 183 | + |
| 184 | +Without the aggregator, blocks contain zero attestation proofs. This is a different failure mode than losing non-aggregator validators. For signature verification benchmarking, always keep the aggregator running. |
| 185 | + |
| 186 | +### Supermajority thresholds |
| 187 | + |
| 188 | +| Validators | Supermajority (3/4) | Max paused for finality | |
| 189 | +|-----------|--------------------|-----------------------| |
| 190 | +| 4 | 3 | 1 | |
| 191 | +| 6 | 5 | 1 | |
| 192 | +| 8 | 6 | 2 | |
| 193 | +| 12 | 9 | 3 | |
| 194 | + |
| 195 | +Pausing 2 of 4 nodes guarantees non-finality. Pausing 1 of 4 still allows finalization (3/4 supermajority met). |
0 commit comments