Failover Script Updated
Professional route-based Dual-WAN failover for Ethereum validator with failback.
Key properties (That implemented in the script):
route-based failover (never brings NICs down) — avoids killing TCP connections
hysteresis (consecutive failure and recovery counters) to avoid flapping
fast failover (configurable) and delayed, stable failback (configurable)
idempotent operations (safe to re-run)
logging to syslog + log file
simple health checks bound to each interface (ICMP + optional TCP port check)
small safety checks and useful
systemctl/wgtest commands at the end
Copy the script, edit the user variables at the top to match your environment (interfaces, gateway IPs, ping targets), install it, enable the systemd service and you’re done.
1. Deployment files you need
Failover script — /usr/local/sbin/dual-wan-failover.sh
Save this exact file as root and make executable (sudo nano /usr/local/sbin/dual-wan-failover.sh → paste → sudo chmod +x /usr/local/sbin/dual-wan-failover.sh).
#!/usr/bin/env bash
# dual-wan-failover.sh
# Professional route-based Dual-WAN failover for Ethereum validator
# - Does NOT bring interfaces down
# - Adds hysteresis for failover/failback
# - Replaces default route to preferred/backup gateway
# - Logs to syslog and /var/log/dual-wan-failover.log
#
# Edit the variables in the "USER CONFIG" block before running.
set -uo pipefail
LOGFILE="/var/log/dual-wan-failover.log"
exec 1>>"$LOGFILE" 2>&1
timestamp() { date "+%F %T"; }
# ------------------ USER CONFIG ------------------
# Interfaces
PRIMARY_IF="eth0" # interface for WAN1 (static public IP)
SECONDARY_IF="eth1" # interface for WAN2 (dynamic)
# Gateways (next-hop addresses on each WAN's LAN)
PRIMARY_GW="192.168.1.1" # gateway for PRIMARY_IF
SECONDARY_GW="192.168.2.1" # gateway for SECONDARY_IF
# Health check targets (public IPs or gateways). Use reliable IPs
# Prefer an external public IP (8.8.8.8) and/or the ISP gateway IP.
CHECK_TARGET="8.8.8.8"
# Health check behavior
PING_COUNT_PER_CHECK=1 # pings per check
PING_TIMEOUT=2 # seconds per ping
# Hysteresis
FAIL_THRESHOLD=3 # consecutive failed checks before failover
RECOVER_THRESHOLD=12 # consecutive successful checks before failback (e.g. 12*5s = 60s)
# Poll interval
SLEEP_INTERVAL=5 # seconds between checks
# Optional TCP check (e.g. test port 80/443) - leave blank to skip
TCP_CHECK_HOST=""
TCP_CHECK_PORT=""
# Safety: don't allow auto-failback immediately; set to true to require manual failback
AUTO_FAILBACK=true
# -------------------------------------------------
# Validate environment
if ! command -v ip >/dev/null 2>&1; then
echo "$(timestamp) ERROR: iproute2 (ip) required" >&2
exit 1
fi
if ! command -v ping >/dev/null 2>&1; then
echo "$(timestamp) ERROR: ping required" >&2
exit 1
fi
# helper: check target via interface with ping
ping_check() {
local ifname="$1"
local target="$2"
# Use -I interface address; if interface has no IPv4 yet, ping will fail
ping -c "$PING_COUNT_PER_CHECK" -W "$PING_TIMEOUT" -I "$ifname" "$target" >/dev/null 2>&1
}
# Optional TCP connect check using timeout+bash / redirection (requires bash's /dev/tcp)
tcp_check() {
local host="$1" port="$2" timeout="${3:-2}"
# bash /dev/tcp method (may hang on some shells); use timeout wrapper
timeout "$timeout" bash -c "cat < /dev/null > /dev/tcp/$host/$port" >/dev/null 2>&1
}
# Get current default nexthop (show only first default route)
get_current_default() {
ip route show default 2>/dev/null | awk 'NR==1{print $3}'
}
# Set default route (replace)
set_default_route() {
local gw="$1" dev="$2"
# replace default route — idempotent
ip route replace default via "$gw" dev "$dev" proto static
echo "$(timestamp) INFO: Default route set -> gw=$gw dev=$dev"
}
# Log wrapper
log() {
echo "$(timestamp) - $*"
}
# initial counters
primary_fail_count=0
primary_ok_count=0
# On startup, set default to primary if available, else try secondary
initial_primary_ok=false
if ping_check "$PRIMARY_IF" "$CHECK_TARGET"; then
set_default_route "$PRIMARY_GW" "$PRIMARY_IF"
initial_primary_ok=true
log "Startup: primary reachable, using primary"
else
set_default_route "$SECONDARY_GW" "$SECONDARY_IF"
log "Startup: primary NOT reachable, using secondary"
fi
# main loop
while true; do
# Check primary
if ping_check "$PRIMARY_IF" "$CHECK_TARGET"; then
# optional additional TCP check
if [ -n "$TCP_CHECK_HOST" ] && [ -n "$TCP_CHECK_PORT" ]; then
if tcp_check "$TCP_CHECK_HOST" "$TCP_CHECK_PORT" 2; then
primary_ok=true
else
primary_ok=false
fi
else
primary_ok=true
fi
else
primary_ok=false
fi
if $primary_ok; then
primary_fail_count=0
primary_ok_count=$((primary_ok_count+1))
# if default is on secondary and we've stabilized, switch back (only if AUTO_FAILBACK true)
current_gw=$(get_current_default)
if [ "$AUTO_FAILBACK" = true ] && [ "$current_gw" != "$PRIMARY_GW" ]; then
if [ "$primary_ok_count" -ge "$RECOVER_THRESHOLD" ]; then
log "Primary stabilized for $primary_ok_count checks -> switching default back to PRIMARY"
set_default_route "$PRIMARY_GW" "$PRIMARY_IF"
primary_ok_count=0
else
log "Primary OK (${primary_ok_count}/${RECOVER_THRESHOLD}) - waiting before failback"
fi
fi
else
primary_ok_count=0
primary_fail_count=$((primary_fail_count+1))
log "Primary NOT reachable (${primary_fail_count}/${FAIL_THRESHOLD})"
# On threshold breach, switch to secondary
if [ "$primary_fail_count" -ge "$FAIL_THRESHOLD" ]; then
current_gw=$(get_current_default)
if [ "$current_gw" != "$SECONDARY_GW" ]; then
log "Primary down for $primary_fail_count checks -> switching default to SECONDARY"
set_default_route "$SECONDARY_GW" "$SECONDARY_IF"
else
log "Already on SECONDARY"
fi
fi
fi
sleep "$SLEEP_INTERVAL"
done
Systemd unit — /etc/systemd/system/dual-wan-failover.service
Create the unit file and enable (save as root):
[Unit]
Description=Dual WAN Failover (route-based) for Ethereum validator
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/sbin/dual-wan-failover.sh
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Then run:
sudo systemctl daemon-reload
sudo systemctl enable --now dual-wan-failover.service2. What is changed from the previous version script (why it’s safer)
Don’t bring interfaces down. Earlier script used
ip link set <if> down— that kills sockets, may cause longer disruptions. I replace default route instead.Hysteresis: require N consecutive failures to failover and M consecutive successes to fail back. This prevents route flapping for flaky links.
Use
ip route replace default via <gw> dev <if>— idempotent and fast.Logging to
/var/log/dual-wan-failover.logand syslog.Optional TCP check (in case ICMP is filtered by ISP) — you can set
TCP_CHECK_HOSTandTCP_CHECK_PORT.Auto-failback toggle — set
AUTO_FAILBACK=falseto require manual failback.
3. Recommended parameter values (tweak if needed)
SLEEP_INTERVAL=5(check every 5s)FAIL_THRESHOLD=3(failover in about 15s if ping fails)RECOVER_THRESHOLD=12(recover after ~60s of stable connectivity)
This gives a reasonable trade-off: fast failover, stable failback.
4. Tests and verification
Start service & check logs
sudo systemctl start dual-wan-failover
sudo journalctl -fu dual-wan-failover -o cat
tail -f /var/log/dual-wan-failover.log
Force primary down (test)
Instead of actually disabling your cable, you can simulate unreachable gateway by replacing the ping target or by adding a temporary firewall rule that drops ICMP from the primary interface. If you must test real failover, ensure you are physically available to re-enable.
Check current default route
ip route show default
# expected output when primary active:
# default via 192.168.1.1 dev eth0 proto staticConfirm validator outbound IP (if using VPS/WireGuard egress test)
If you route validator traffic via WireGuard, test:
curl --interface wg0 https://ifconfig.co5. Manual failback (if you prefer)
/ If you set AUTO_FAILBACK=false at top of the script, the service will not automatically switch back. To switch manually:
sudo ip route replace default via <PRIMARY_GW> dev <PRIMARY_IF>
sudo logger "Manual failback executed: default -> PRIMARY"Or re-enable auto-failback and let it happen automatically.
6. Extra hardening tips for validator reliability
Keep Beacon & Execution clients on the same machine (or ensure both active ELs are synced) — do not run the same validator key on two machines.
Use a WireGuard tunnel to a static VPS (Hetzner) to provide a stable egress IP if you want zero apparent IP changes.
Ensure
rp_filteris off to avoid martian drops:
sudo sysctl -w net.ipv4.conf.all.rp_filter=0
sudo sysctl -w net.ipv4.conf.default.rp_filter=0Monitor beacon logs for missed attestations and set up alerts (email/telegram).
7. Is manual switch-back OK to avoid missed attestations?
Yes — manual switch-back is fine and safer than immediate automatic flapping, because you can ensure the primary link is stable before switching back. The provided script supports automatic controlled failback (recommended) or manual failback (set AUTO_FAILBACK=false) — both approaches are valid. Manual switching prevents unnecessary flaps and hence reduces chance of missed attestations.
Last updated