# Failover Script Updated

#### Key properties (That implemented in the script):

* **route-based failover (never brings NICs down)** — avoids killing TCP connections
* **hysteresis** (consecutive failure and recovery counters) to avoid flapping
* **fast failover** (configurable) and **delayed, stable failback** (configurable)
* **idempotent operations** (safe to re-run)
* **logging** to syslog + log file
* simple health checks bound to each interface (ICMP + optional TCP port check)
* small safety checks and useful `systemctl` / `wg` test commands at the end

Copy the script, edit the user variables at the top to match your environment (interfaces, gateway IPs, ping targets), install it, enable the systemd service and you’re done.

#### 1. Deployment files you need

{% tabs %}
{% tab title="Failover Script " %}
Failover script — `/usr/local/sbin/dual-wan-failover.sh`

Save this exact file as root and make executable (`sudo nano /usr/local/sbin/dual-wan-failover.sh` → paste → `sudo chmod +x /usr/local/sbin/dual-wan-failover.sh`).

```
#!/usr/bin/env bash
# dual-wan-failover.sh
# Professional route-based Dual-WAN failover for Ethereum validator
# - Does NOT bring interfaces down
# - Adds hysteresis for failover/failback
# - Replaces default route to preferred/backup gateway
# - Logs to syslog and /var/log/dual-wan-failover.log
#
# Edit the variables in the "USER CONFIG" block before running.

set -uo pipefail

LOGFILE="/var/log/dual-wan-failover.log"
exec 1>>"$LOGFILE" 2>&1

timestamp() { date "+%F %T"; }

# ------------------ USER CONFIG ------------------
# Interfaces
PRIMARY_IF="eth0"            # interface for WAN1 (static public IP)
SECONDARY_IF="eth1"          # interface for WAN2 (dynamic)

# Gateways (next-hop addresses on each WAN's LAN)
PRIMARY_GW="192.168.1.1"     # gateway for PRIMARY_IF
SECONDARY_GW="192.168.2.1"   # gateway for SECONDARY_IF

# Health check targets (public IPs or gateways). Use reliable IPs
# Prefer an external public IP (8.8.8.8) and/or the ISP gateway IP.
CHECK_TARGET="8.8.8.8"

# Health check behavior
PING_COUNT_PER_CHECK=1       # pings per check
PING_TIMEOUT=2               # seconds per ping

# Hysteresis
FAIL_THRESHOLD=3             # consecutive failed checks before failover
RECOVER_THRESHOLD=12         # consecutive successful checks before failback (e.g. 12*5s = 60s)

# Poll interval
SLEEP_INTERVAL=5             # seconds between checks

# Optional TCP check (e.g. test port 80/443) - leave blank to skip
TCP_CHECK_HOST=""
TCP_CHECK_PORT=""

# Safety: don't allow auto-failback immediately; set to true to require manual failback
AUTO_FAILBACK=true
# -------------------------------------------------

# Validate environment
if ! command -v ip >/dev/null 2>&1; then
  echo "$(timestamp) ERROR: iproute2 (ip) required" >&2
  exit 1
fi
if ! command -v ping >/dev/null 2>&1; then
  echo "$(timestamp) ERROR: ping required" >&2
  exit 1
fi

# helper: check target via interface with ping
ping_check() {
  local ifname="$1"
  local target="$2"
  # Use -I interface address; if interface has no IPv4 yet, ping will fail
  ping -c "$PING_COUNT_PER_CHECK" -W "$PING_TIMEOUT" -I "$ifname" "$target" >/dev/null 2>&1
}

# Optional TCP connect check using timeout+bash / redirection (requires bash's /dev/tcp)
tcp_check() {
  local host="$1" port="$2" timeout="${3:-2}"
  # bash /dev/tcp method (may hang on some shells); use timeout wrapper
  timeout "$timeout" bash -c "cat < /dev/null > /dev/tcp/$host/$port" >/dev/null 2>&1
}

# Get current default nexthop (show only first default route)
get_current_default() {
  ip route show default 2>/dev/null | awk 'NR==1{print $3}'
}

# Set default route (replace)
set_default_route() {
  local gw="$1" dev="$2"
  # replace default route — idempotent
  ip route replace default via "$gw" dev "$dev" proto static
  echo "$(timestamp) INFO: Default route set -> gw=$gw dev=$dev"
}

# Log wrapper
log() {
  echo "$(timestamp) - $*"
}

# initial counters
primary_fail_count=0
primary_ok_count=0

# On startup, set default to primary if available, else try secondary
initial_primary_ok=false
if ping_check "$PRIMARY_IF" "$CHECK_TARGET"; then
  set_default_route "$PRIMARY_GW" "$PRIMARY_IF"
  initial_primary_ok=true
  log "Startup: primary reachable, using primary"
else
  set_default_route "$SECONDARY_GW" "$SECONDARY_IF"
  log "Startup: primary NOT reachable, using secondary"
fi

# main loop
while true; do
  # Check primary
  if ping_check "$PRIMARY_IF" "$CHECK_TARGET"; then
    # optional additional TCP check
    if [ -n "$TCP_CHECK_HOST" ] && [ -n "$TCP_CHECK_PORT" ]; then
      if tcp_check "$TCP_CHECK_HOST" "$TCP_CHECK_PORT" 2; then
        primary_ok=true
      else
        primary_ok=false
      fi
    else
      primary_ok=true
    fi
  else
    primary_ok=false
  fi

  if $primary_ok; then
    primary_fail_count=0
    primary_ok_count=$((primary_ok_count+1))
    # if default is on secondary and we've stabilized, switch back (only if AUTO_FAILBACK true)
    current_gw=$(get_current_default)
    if [ "$AUTO_FAILBACK" = true ] && [ "$current_gw" != "$PRIMARY_GW" ]; then
      if [ "$primary_ok_count" -ge "$RECOVER_THRESHOLD" ]; then
        log "Primary stabilized for $primary_ok_count checks -> switching default back to PRIMARY"
        set_default_route "$PRIMARY_GW" "$PRIMARY_IF"
        primary_ok_count=0
      else
        log "Primary OK (${primary_ok_count}/${RECOVER_THRESHOLD}) - waiting before failback"
      fi
    fi
  else
    primary_ok_count=0
    primary_fail_count=$((primary_fail_count+1))
    log "Primary NOT reachable (${primary_fail_count}/${FAIL_THRESHOLD})"
    # On threshold breach, switch to secondary
    if [ "$primary_fail_count" -ge "$FAIL_THRESHOLD" ]; then
      current_gw=$(get_current_default)
      if [ "$current_gw" != "$SECONDARY_GW" ]; then
        log "Primary down for $primary_fail_count checks -> switching default to SECONDARY"
        set_default_route "$SECONDARY_GW" "$SECONDARY_IF"
      else
        log "Already on SECONDARY"
      fi
    fi
  fi

  sleep "$SLEEP_INTERVAL"
done

```

{% endtab %}

{% tab title="Systemd Unit" %}
Systemd unit — `/etc/systemd/system/dual-wan-failover.service`

Create the unit file and enable (save as root):

```
[Unit]
Description=Dual WAN Failover (route-based) for Ethereum validator
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/sbin/dual-wan-failover.sh
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

```

{% endtab %}
{% endtabs %}

Then run:

```
sudo systemctl daemon-reload
sudo systemctl enable --now dual-wan-failover.service
```

#### 2. What is changed from the previous version script (why it’s safer)

* **Don’t bring interfaces down.** Earlier script used `ip link set <if> down` — that kills sockets, may cause longer disruptions. I replace default route instead.
* **Hysteresis:** require N consecutive failures to failover and M consecutive successes to fail back. This prevents route flapping for flaky links.
* **Use `ip route replace default via <gw> dev <if>`** — idempotent and fast.
* **Logging** to `/var/log/dual-wan-failover.log` and syslog.
* **Optional TCP check** (in case ICMP is filtered by ISP) — you can set `TCP_CHECK_HOST` and `TCP_CHECK_PORT`.
* **Auto-failback toggle** — set `AUTO_FAILBACK=false` to require manual failback.

#### 3. Recommended parameter values (tweak if needed)

* `SLEEP_INTERVAL=5` (check every 5s)
* `FAIL_THRESHOLD=3` (failover in about 15s if ping fails)
* `RECOVER_THRESHOLD=12` (recover after \~60s of stable connectivity)

This gives a reasonable trade-off: fast failover, stable failback.

#### 4. Tests and verification

**Start service & check logs**

```
sudo systemctl start dual-wan-failover
sudo journalctl -fu dual-wan-failover -o cat
tail -f /var/log/dual-wan-failover.log

```

**Force primary down (test)**

Instead of actually disabling your cable, you can simulate unreachable gateway by replacing the ping target or by adding a temporary firewall rule that drops ICMP from the primary interface. If you must test real failover, ensure you are physically available to re-enable.

**Check current default route**

```
ip route show default
# expected output when primary active:
# default via 192.168.1.1 dev eth0 proto static
```

**Confirm validator outbound IP (if using VPS/WireGuard egress test)**

If you route validator traffic via WireGuard, test:

```
curl --interface wg0 https://ifconfig.co
```

#### 5. Manual failback (if you prefer)

/ If you set `AUTO_FAILBACK=false` at top of the script, the service will not automatically switch back. To switch manually:

```
sudo ip route replace default via <PRIMARY_GW> dev <PRIMARY_IF>
sudo logger "Manual failback executed: default -> PRIMARY"
```

Or re-enable auto-failback and let it happen automatically.

#### 6. Extra hardening tips for validator reliability

* Keep Beacon & Execution clients on the same machine (or ensure both active ELs are synced) — do **not** run the same validator key on two machines.
* Use a WireGuard tunnel to a static VPS (Hetzner) to provide a stable egress IP if you want zero apparent IP changes.
* Ensure `rp_filter` is off to avoid martian drops:&#x20;

```
sudo sysctl -w net.ipv4.conf.all.rp_filter=0
sudo sysctl -w net.ipv4.conf.default.rp_filter=0
```

* Monitor beacon logs for missed attestations and set up alerts (email/telegram).

#### 7. Is manual switch-back OK to avoid missed attestations?

Yes — manual switch-back is fine and **safer** than immediate automatic flapping, because you can ensure the primary link is stable before switching back. The provided script supports **automatic** controlled failback (recommended) or manual failback (set `AUTO_FAILBACK=false`) — both approaches are valid. Manual switching prevents unnecessary flaps and hence reduces chance of missed attestations.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://prime-stake-pool.gitbook.io/node-setup-guide/server-settings/dual-nic-wan-failover-setup/failover-script-updated.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
