Failover Script Updated

Professional route-based Dual-WAN failover for Ethereum validator with failback.

Key properties (That implemented in the script):

  • route-based failover (never brings NICs down) — avoids killing TCP connections

  • hysteresis (consecutive failure and recovery counters) to avoid flapping

  • fast failover (configurable) and delayed, stable failback (configurable)

  • idempotent operations (safe to re-run)

  • logging to syslog + log file

  • simple health checks bound to each interface (ICMP + optional TCP port check)

  • small safety checks and useful systemctl / wg test commands at the end

Copy the script, edit the user variables at the top to match your environment (interfaces, gateway IPs, ping targets), install it, enable the systemd service and you’re done.

1. Deployment files you need

Failover script — /usr/local/sbin/dual-wan-failover.sh

Save this exact file as root and make executable (sudo nano /usr/local/sbin/dual-wan-failover.sh → paste → sudo chmod +x /usr/local/sbin/dual-wan-failover.sh).

#!/usr/bin/env bash
# dual-wan-failover.sh
# Professional route-based Dual-WAN failover for Ethereum validator
# - Does NOT bring interfaces down
# - Adds hysteresis for failover/failback
# - Replaces default route to preferred/backup gateway
# - Logs to syslog and /var/log/dual-wan-failover.log
#
# Edit the variables in the "USER CONFIG" block before running.

set -uo pipefail

LOGFILE="/var/log/dual-wan-failover.log"
exec 1>>"$LOGFILE" 2>&1

timestamp() { date "+%F %T"; }

# ------------------ USER CONFIG ------------------
# Interfaces
PRIMARY_IF="eth0"            # interface for WAN1 (static public IP)
SECONDARY_IF="eth1"          # interface for WAN2 (dynamic)

# Gateways (next-hop addresses on each WAN's LAN)
PRIMARY_GW="192.168.1.1"     # gateway for PRIMARY_IF
SECONDARY_GW="192.168.2.1"   # gateway for SECONDARY_IF

# Health check targets (public IPs or gateways). Use reliable IPs
# Prefer an external public IP (8.8.8.8) and/or the ISP gateway IP.
CHECK_TARGET="8.8.8.8"

# Health check behavior
PING_COUNT_PER_CHECK=1       # pings per check
PING_TIMEOUT=2               # seconds per ping

# Hysteresis
FAIL_THRESHOLD=3             # consecutive failed checks before failover
RECOVER_THRESHOLD=12         # consecutive successful checks before failback (e.g. 12*5s = 60s)

# Poll interval
SLEEP_INTERVAL=5             # seconds between checks

# Optional TCP check (e.g. test port 80/443) - leave blank to skip
TCP_CHECK_HOST=""
TCP_CHECK_PORT=""

# Safety: don't allow auto-failback immediately; set to true to require manual failback
AUTO_FAILBACK=true
# -------------------------------------------------

# Validate environment
if ! command -v ip >/dev/null 2>&1; then
  echo "$(timestamp) ERROR: iproute2 (ip) required" >&2
  exit 1
fi
if ! command -v ping >/dev/null 2>&1; then
  echo "$(timestamp) ERROR: ping required" >&2
  exit 1
fi

# helper: check target via interface with ping
ping_check() {
  local ifname="$1"
  local target="$2"
  # Use -I interface address; if interface has no IPv4 yet, ping will fail
  ping -c "$PING_COUNT_PER_CHECK" -W "$PING_TIMEOUT" -I "$ifname" "$target" >/dev/null 2>&1
}

# Optional TCP connect check using timeout+bash / redirection (requires bash's /dev/tcp)
tcp_check() {
  local host="$1" port="$2" timeout="${3:-2}"
  # bash /dev/tcp method (may hang on some shells); use timeout wrapper
  timeout "$timeout" bash -c "cat < /dev/null > /dev/tcp/$host/$port" >/dev/null 2>&1
}

# Get current default nexthop (show only first default route)
get_current_default() {
  ip route show default 2>/dev/null | awk 'NR==1{print $3}'
}

# Set default route (replace)
set_default_route() {
  local gw="$1" dev="$2"
  # replace default route — idempotent
  ip route replace default via "$gw" dev "$dev" proto static
  echo "$(timestamp) INFO: Default route set -> gw=$gw dev=$dev"
}

# Log wrapper
log() {
  echo "$(timestamp) - $*"
}

# initial counters
primary_fail_count=0
primary_ok_count=0

# On startup, set default to primary if available, else try secondary
initial_primary_ok=false
if ping_check "$PRIMARY_IF" "$CHECK_TARGET"; then
  set_default_route "$PRIMARY_GW" "$PRIMARY_IF"
  initial_primary_ok=true
  log "Startup: primary reachable, using primary"
else
  set_default_route "$SECONDARY_GW" "$SECONDARY_IF"
  log "Startup: primary NOT reachable, using secondary"
fi

# main loop
while true; do
  # Check primary
  if ping_check "$PRIMARY_IF" "$CHECK_TARGET"; then
    # optional additional TCP check
    if [ -n "$TCP_CHECK_HOST" ] && [ -n "$TCP_CHECK_PORT" ]; then
      if tcp_check "$TCP_CHECK_HOST" "$TCP_CHECK_PORT" 2; then
        primary_ok=true
      else
        primary_ok=false
      fi
    else
      primary_ok=true
    fi
  else
    primary_ok=false
  fi

  if $primary_ok; then
    primary_fail_count=0
    primary_ok_count=$((primary_ok_count+1))
    # if default is on secondary and we've stabilized, switch back (only if AUTO_FAILBACK true)
    current_gw=$(get_current_default)
    if [ "$AUTO_FAILBACK" = true ] && [ "$current_gw" != "$PRIMARY_GW" ]; then
      if [ "$primary_ok_count" -ge "$RECOVER_THRESHOLD" ]; then
        log "Primary stabilized for $primary_ok_count checks -> switching default back to PRIMARY"
        set_default_route "$PRIMARY_GW" "$PRIMARY_IF"
        primary_ok_count=0
      else
        log "Primary OK (${primary_ok_count}/${RECOVER_THRESHOLD}) - waiting before failback"
      fi
    fi
  else
    primary_ok_count=0
    primary_fail_count=$((primary_fail_count+1))
    log "Primary NOT reachable (${primary_fail_count}/${FAIL_THRESHOLD})"
    # On threshold breach, switch to secondary
    if [ "$primary_fail_count" -ge "$FAIL_THRESHOLD" ]; then
      current_gw=$(get_current_default)
      if [ "$current_gw" != "$SECONDARY_GW" ]; then
        log "Primary down for $primary_fail_count checks -> switching default to SECONDARY"
        set_default_route "$SECONDARY_GW" "$SECONDARY_IF"
      else
        log "Already on SECONDARY"
      fi
    fi
  fi

  sleep "$SLEEP_INTERVAL"
done

Then run:

sudo systemctl daemon-reload
sudo systemctl enable --now dual-wan-failover.service

2. What is changed from the previous version script (why it’s safer)

  • Don’t bring interfaces down. Earlier script used ip link set <if> down — that kills sockets, may cause longer disruptions. I replace default route instead.

  • Hysteresis: require N consecutive failures to failover and M consecutive successes to fail back. This prevents route flapping for flaky links.

  • Use ip route replace default via <gw> dev <if> — idempotent and fast.

  • Logging to /var/log/dual-wan-failover.log and syslog.

  • Optional TCP check (in case ICMP is filtered by ISP) — you can set TCP_CHECK_HOST and TCP_CHECK_PORT.

  • Auto-failback toggle — set AUTO_FAILBACK=false to require manual failback.

  • SLEEP_INTERVAL=5 (check every 5s)

  • FAIL_THRESHOLD=3 (failover in about 15s if ping fails)

  • RECOVER_THRESHOLD=12 (recover after ~60s of stable connectivity)

This gives a reasonable trade-off: fast failover, stable failback.

4. Tests and verification

Start service & check logs

sudo systemctl start dual-wan-failover
sudo journalctl -fu dual-wan-failover -o cat
tail -f /var/log/dual-wan-failover.log

Force primary down (test)

Instead of actually disabling your cable, you can simulate unreachable gateway by replacing the ping target or by adding a temporary firewall rule that drops ICMP from the primary interface. If you must test real failover, ensure you are physically available to re-enable.

Check current default route

ip route show default
# expected output when primary active:
# default via 192.168.1.1 dev eth0 proto static

Confirm validator outbound IP (if using VPS/WireGuard egress test)

If you route validator traffic via WireGuard, test:

curl --interface wg0 https://ifconfig.co

5. Manual failback (if you prefer)

/ If you set AUTO_FAILBACK=false at top of the script, the service will not automatically switch back. To switch manually:

sudo ip route replace default via <PRIMARY_GW> dev <PRIMARY_IF>
sudo logger "Manual failback executed: default -> PRIMARY"

Or re-enable auto-failback and let it happen automatically.

6. Extra hardening tips for validator reliability

  • Keep Beacon & Execution clients on the same machine (or ensure both active ELs are synced) — do not run the same validator key on two machines.

  • Use a WireGuard tunnel to a static VPS (Hetzner) to provide a stable egress IP if you want zero apparent IP changes.

  • Ensure rp_filter is off to avoid martian drops:

sudo sysctl -w net.ipv4.conf.all.rp_filter=0
sudo sysctl -w net.ipv4.conf.default.rp_filter=0
  • Monitor beacon logs for missed attestations and set up alerts (email/telegram).

7. Is manual switch-back OK to avoid missed attestations?

Yes — manual switch-back is fine and safer than immediate automatic flapping, because you can ensure the primary link is stable before switching back. The provided script supports automatic controlled failback (recommended) or manual failback (set AUTO_FAILBACK=false) — both approaches are valid. Manual switching prevents unnecessary flaps and hence reduces chance of missed attestations.

Last updated