fix(connectivity): clean up cell_relay policy routing on teardown
Unit Tests / test (push) Successful in 9m37s

A cell_relay policy-routes an assigned peer with `ip rule from <peer>
lookup <table>` plus a shared `default via <cell-ip>` route in that table
inside cell-wireguard. Two teardown bugs leaked both (confirmed on hardware,
pic0<->pic1):

- remove_peer_route_via deleted the rule with a hardcoded default table 100,
  but the v2 cell_relay path adds it with the connection's own table (1000+),
  so the rule never matched and survived peer detach/delete. It now deletes
  by source IP (table-agnostic), covering both the v2 and the legacy
  route-via (table 100) paths.
- nothing ever removed the table's shared default route: delete_connection
  explicitly skipped cell_relay and reconcile_cell_relays deletes the record
  directly. Added wireguard_manager.teardown_route_table(table) (removes any
  leftover lookup-<table> rules + flushes the table) and call it from both
  delete_connection and the reconcile removal path.

Also clear a peer's relay rule on peer deletion so a peer deleted while still
assigned doesn't leave a stale source rule that could misroute a future peer
reusing the IP.

Regression tests: detach removes the rule by source; delete_connection and
reconcile-removal each flush the relay table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-17 11:34:41 -04:00
parent 639fb66e5b
commit 2ab6e715d8
4 changed files with 129 additions and 8 deletions
+47 -8
View File
@@ -786,21 +786,60 @@ class WireGuardManager(BaseServiceManager):
logger.error(f'apply_peer_route_via failed: {e}')
return False
def remove_peer_route_via(self, peer_ip: str, table: int = 100) -> None:
"""Remove the ip rule for peer_ip added by apply_peer_route_via. Non-fatal."""
def remove_peer_route_via(self, peer_ip: str) -> None:
"""Remove the policy-routing ip rule(s) for peer_ip. Non-fatal.
Deletes every `ip rule from peer_ip/32` regardless of which table it
points at: the v2 cell_relay path adds the rule with the connection's
own table (1000+) while the legacy route-via path uses table 100, so a
caller clearing a peer's exit does not reliably know the table. Matching
by source alone removes the rule in both cases (and any duplicate). The
shared routing *table* itself is torn down separately at connection
teardown — see teardown_route_table.
"""
real_conf = self._config_file()
if '/tmp/' in real_conf or 'pytest' in real_conf or 'wg_confs' not in real_conf:
return
try:
subprocess.run(
['docker', 'exec', 'cell-wireguard',
'ip', 'rule', 'del', 'from', f'{peer_ip}/32',
'pref', str(table), 'lookup', str(table)],
capture_output=True, timeout=5
)
for _ in range(32):
r = subprocess.run(
['docker', 'exec', 'cell-wireguard',
'ip', 'rule', 'del', 'from', f'{peer_ip}/32'],
capture_output=True, timeout=5
)
if r.returncode != 0:
break
except Exception:
pass
def teardown_route_table(self, table: int) -> None:
"""Tear down a relay routing table when its connection is removed. Non-fatal.
Removes any remaining `ip rule ... lookup <table>` entries (e.g. one left
by a peer deleted while still assigned) and flushes the table's routes —
notably the `default via <cell-ip>` route that apply_peer_route_via
installs. That route is shared by every peer routed through the relay, so
no per-peer detach may remove it; it can only be cleared once the
connection itself is gone, or it leaks (stale default route + a possible
blackhole if a rule survives).
"""
real_conf = self._config_file()
if '/tmp/' in real_conf or 'pytest' in real_conf or 'wg_confs' not in real_conf:
return
try:
def _wg(cmd):
return subprocess.run(
['docker', 'exec', 'cell-wireguard'] + cmd,
capture_output=True, timeout=5
)
for _ in range(64):
r = _wg(['ip', 'rule', 'del', 'lookup', str(table)])
if r.returncode != 0:
break
_wg(['ip', 'route', 'flush', 'table', str(table)])
except Exception as e:
logger.warning(f'teardown_route_table({table}) failed: {e}')
def remove_peer(self, public_key: str) -> bool:
"""Remove the [Peer] block matching public_key from wg0.conf."""
try: