Commit Graph

7 Commits

Author SHA1 Message Date
roof 9a800e3b6b feat: fix cross-cell service access — DNS DNAT, service DNAT, Caddy routing
DNS A records now return the WireGuard server IP (10.0.0.1) instead of
Docker bridge VIPs so cross-cell peers resolve service names correctly
regardless of their bridge subnet. DNAT rules (wg0:53→cell-dns:53 and
wg0:80→cell-caddy:80) are applied at startup. Caddy routes by Host header,
eliminating the Docker bridge subnet conflict. Firewall cell rules allow
DNS and service (Caddy) traffic from linked cell subnets. Split-tunnel
AllowedIPs now dynamically includes connected-cell VPN subnets and drops
the 172.20.0.0/16 range. Peers with route_via set now receive full-tunnel
config (0.0.0.0/0) so all their traffic exits via the remote cell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 03:12:09 -04:00
roof 8ea834e108 feat: Phase 3 - per-peer internet routing via exit cell
Adds the ability to route a specific peer's internet traffic through a
connected cell acting as an exit relay.

Cell A side:
- PUT /api/peers/<peer>/route-via {"via_cell": "cellB"} sets route_via
- Updates WG AllowedIPs to include 0.0.0.0/0 for the exit cell peer
- Adds ip rule + ip route in policy table inside cell-wireguard so the
  specific peer's traffic egresses via cellB's WG IP
- Sets exit_relay_active on the cell link and pushes use_as_exit_relay=True
  to cellB via peer-sync

Cell B side:
- Receives use_as_exit_relay in the peer-sync payload
- Calls apply_cell_rules(..., exit_relay=True) to add FORWARD -o eth0 ACCEPT
- Stores remote_exit_relay_active flag for startup recovery

Startup recovery:
- apply_all_cell_rules passes exit_relay=remote_exit_relay_active (cellB)
- _apply_startup_enforcement reapplies ip rule for each peer with route_via (cellA)
  since policy routing rules don't survive container restart

peer_registry gets route_via field with lazy migration.
22 new tests across test_cell_link_manager, test_peer_registry, test_peer_route_via.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:23:31 -04:00
roof dcee03dd3f feat(cells): Phase 2 — exit-offer signaling between connected cells
Adds the ability for a cell to signal to a peer that it's willing to
route internet traffic on their behalf.  This is the signaling layer
for Phase 3 (per-peer routing via exit cell).

Changes:
- cell_links.json: exit_offered (bool) + remote_exit_offered (bool)
  fields with lazy migration (default false for existing records)
- _push_permissions_to_remote: includes exit_offered in the push body
- apply_remote_permissions: accepts exit_offered kwarg; stores it as
  remote_exit_offered on the matching cell link
- peer-sync receiver: passes exit_offered from body to apply_remote_permissions
- CellLinkManager.set_exit_offered(cell_name, offered): persists +
  triggers push so the remote learns of our offer immediately
- PUT /api/cells/<name>/exit-offer: REST endpoint to toggle the flag
- 12 new tests covering all new paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:49:21 -04:00
roof a3d0cd5a48 feat(cells): Phase 1 — permission sync between connected PICs
When PIC A updates service sharing permissions, it immediately pushes
the mirrored state to PIC B over the WireGuard tunnel so B's UI shows
what A is sharing with it in real time.

Architecture:
- Push model: update_permissions() → _push_permissions_to_remote() →
  POST /api/cells/peer-sync/permissions on remote cell
- Auth: source IP must be inside a known cell's vpn_subnet (WireGuard
  tunnel proves identity) + body's from_public_key must match stored key
- Mirror semantics: our inbound (what we share) → their outbound view
- Non-fatal: push failures set pending_push=True; replay_pending_pushes()
  retries at startup so offline cells catch up on reconnect
- add_connection() also pushes initial state so remote sees permissions
  immediately on the first connect

New fields on cell_links.json records (lazy-migrated):
  remote_api_url, last_push_status, last_push_at, last_push_error,
  pending_push, last_remote_update_at

New endpoint: POST /api/cells/peer-sync/permissions

30 new tests (1101 total).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 13:12:30 -04:00
roof 0b103ffafb feat(cells): fix PIC-to-PIC connection + add service-sharing permissions
Phase 1 — connection fixes:
- routing_manager.stop(): remove iptables -F / -t nat -F nuclear flush that
  would wipe WireGuard MASQUERADE and all peer rules on any UI stop action
- wireguard_manager.add_cell_peer(): reject vpn_subnet that overlaps the local
  WG network (routing blackhole — was the root cause of no handshake)
- wireguard_manager._syncconf(): pass Endpoint to 'wg set' so cell peers with
  static endpoints are synced to the kernel (not just AllowedIPs)

Phase 2 — service-sharing permissions backend:
- firewall_manager: add _cell_tag(), clear_cell_rules(), apply_cell_rules(),
  apply_all_cell_rules() — iptables FORWARD rules for cell-to-cell traffic
  using 'pic-cell-<name>' comment tags, distinct from 'pic-peer-*'
- app.py startup enforcement: call apply_all_cell_rules(cell_links) so rules
  survive API restarts
- cell_link_manager: permissions schema {inbound, outbound} per service;
  lazy migration for existing entries; update_permissions(), get_permissions();
  apply_cell_rules wired into add_connection/remove_connection
- routes/cells.py: GET /api/cells/services, GET+PUT /api/cells/<n>/permissions;
  RuntimeError now returns 400 (not 500) from add_connection

Removed broken 'test' cell (subnet 10.0.0.0/24 collided with local WG network).
Second PIC must use a distinct subnet (e.g. 10.0.1.0/24) before reconnecting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 08:35:24 -04:00
roof 5d0238ff3c A5: Extract config routes into blueprint (app.py 1294 → 579 lines)
Move all /api/config/* routes and pending-restart helpers into
routes/config.py. Re-export helpers from app.py for backward compat:

  from routes.config import _set_pending_restart, _clear_pending_restart,
                           _collect_service_ports, _dedup_changes

Test patches updated:
  app._set_pending_restart     → routes.config._set_pending_restart
  app._clear_pending_restart   → routes.config._clear_pending_restart
  app.threading.Thread         → routes.config.threading.Thread

Remaining in app.py: Flask setup, middleware, health monitor thread,
/health, /api/status, /api/health/history* (use module-level state).

1021 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 06:53:24 -04:00
roof 09138fbc18 A5: Extract all route groups into Flask blueprints (app.py -1735 lines)
Extract 9 route groups out of app.py into routes/ blueprints:
- routes/network.py  — DNS, DHCP, NTP, network info/test (10 routes)
- routes/wireguard.py — WireGuard keys, peers, config, enforcement (18 routes)
- routes/cells.py    — cell-to-cell connections (5 routes)
- routes/peers.py    — peer CRUD + IP update + _next_peer_ip helper (10 routes)
- routes/routing.py  — NAT, peer routes, firewall, iptables (17 routes)
- routes/vault.py    — certs, trust, secrets (19 routes)
- routes/containers.py — containers, images, volumes (14 routes)
- routes/services.py — service bus, logs, services status/connectivity (18 routes)
- routes/peer_dashboard.py — peer-scoped dashboard/services (2 routes)

All blueprints use lazy `from app import X` inside route bodies to preserve
test patch compatibility (patch('app.email_manager', mock) still works).

Also included in this commit:
- A1 fix: backup/restore now includes email/calendar user files
- A2 fix: apply_config sets applying=True flag via helper container
- A3 fix: add_peer rolls back firewall on DNS failure

app.py reduced: 3011 → 1294 lines. 1021 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 06:11:21 -04:00