Commit Graph

71 Commits

Author SHA1 Message Date
roof 7d290c12c4 Phase 2: caddy_manager — Caddyfile generation, health monitor, DNS-01 support
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 09:04:11 -04:00
roof cf1b9672f4 Phase 1: first-run setup wizard, bash installer, Docker profiles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 08:05:38 -04:00
roof 7391d7f7a2 Add e2e latency consistency test for WireGuard tunnel
Sends 50 pings at 0.2s intervals through the cell-to-cell tunnel and
asserts that ≤5% exceed 3× the median RTT (floor 15ms). Catches
server-side packet processing regressions on wired paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 15:13:27 -04:00
roof 1b61e9e290 Fix ICMP latency: re-anchor ESTABLISHED,RELATED to FORWARD position 1 on every health tick
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 18:51:38 -04:00
roof 6f84a3ffe1 Fix e2e fixture: use Table=off + manual routes to avoid wg-quick conflict
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 13:31:53 -04:00
roof e2c50c381a Fix cross-cell domain access: scope DNAT rules, add Docker→wg0 routing
- firewall_manager: add _get_wg_server_ip() helper; scope ensure_cell_api_dnat(),
  ensure_dns_dnat(), ensure_service_dnat() DNAT rules with -d server_ip; add
  ensure_wg_masquerade() (Docker→wg0 MASQUERADE+FORWARD) and
  ensure_cell_subnet_routes() (host routes via docker run busybox)
- wireguard_manager: scope PostUp DNAT rules with -d server_ip in generate_config()
  and ensure_postup_dnat(); add Docker→wg0 MASQUERADE+FORWARD rules
- app.py: call ensure_wg_masquerade() and ensure_cell_subnet_routes() in
  _apply_startup_enforcement()
- tests/test_firewall_manager.py: mock _get_wg_server_ip, add
  test_dnat_is_scoped_to_server_ip and test_returns_false_when_wg_server_ip_not_found
- tests/e2e/wg/test_cell_to_cell_routing.py: rewrite to use dynamic config
  (no hardcoded IPs/ports), add latency and domain access tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:37:02 -04:00
roof 1e1bda4679 Fix cross-cell ICMP routing: state-based cell DROP + e2e test
The cell catch-all DROP rule blocked all traffic from a connected cell's
subnet, including ESTABLISHED/RELATED packets (ICMP replies, TCP ACKs) for
connections initiated by local VPN peers. This broke ping to the remote
cell's WireGuard IP even when the cell-to-cell tunnel was healthy.

Change the DROP to match only NEW,INVALID connections so established reply
traffic passes through to the stateful ACCEPT rule.

Also adds tests/e2e/wg/test_cell_to_cell_routing.py — an end-to-end test
that brings up a real WireGuard tunnel from the test runner to pic1 and
verifies full cross-cell routing including ICMP ping, API /health, and Caddy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 10:59:11 -04:00
roof 5a4e292440 fix: allow reply traffic from connected cells through FORWARD chain
apply_cell_rules drops all traffic from a cell's subnet except specific
service ports. This also drops ICMP replies and TCP ACKs for connections
initiated by local peers to the connected cell, breaking cross-cell
routing (ping to 10.0.0.1 silently dropped by test's cell DROP rule).

Fix: ensure_forward_stateful() inserts a stateful ESTABLISHED,RELATED
ACCEPT at the top of FORWARD. Called from apply_cell_rules (every cell
add/update) and from _apply_startup_enforcement. Idempotent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:13:59 -04:00
roof c2d215ee2e fix: cross-cell routing for split-tunnel peers
Three related fixes for split-tunnel peers that need to reach connected cells:

1. apply_peer_rules/apply_all_peer_rules now accept wg_subnet (actual local VPN
   subnet) and cell_subnets (connected cells' vpn_subnets) parameters instead of
   hardcoding 10.0.0.0/24. All callers (startup, add_peer, update_peer,
   apply-enforcement endpoint) pass the real values.

2. Explicit ACCEPT rules are inserted in FORWARD for each connected cell's
   subnet so split-tunnel peers (internet_access=False) can still reach
   connected cells via the wg0→wg0 path.

3. apply_ip_range in network_manager now loads cell_links.json and passes it
   to generate_corefile(), fixing a race where the bootstrap DNS thread could
   overwrite the Corefile and wipe cross-cell DNS forwarding zones on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:36:28 -04:00
roof 8ee1d88e37 Add subnet conflict validation for wireguard.address and ip_range changes
When a cell is connected to others, changing the local WireGuard address
or Docker ip_range to a subnet that overlaps a connected cell's vpn_subnet
would break routing. Both now return 409 with the conflicting cell name.

- wireguard.address: derive network from new address, check all connected
  cells' vpn_subnet for overlap (after existing format validation)
- ip_range: check all connected cells' vpn_subnet for overlap (after
  existing RFC-1918 validation)

Tests: 4 cases each (overlap → 409, no overlap → ok, no cells → ok,
format error still fires first → 400).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 10:00:58 -04:00
roof c658d2b16c Add domain conflict validation when changing domain or accepting heal invite
Two gaps allowed a cell to take a domain already in use by a connected cell:

1. PUT /api/config domain change: added check against cell_link_manager's
   connected cells list before saving — returns 409 if the new domain
   collides with any connected cell's domain.

2. accept_invite healing path: a remote cell changing its domain via a
   re-invite was not validated against other connected cells' domains.
   Now calls _check_invite_conflicts(invite, exclude_cell=name) before
   applying any change.

Also: the healing path now detects domain changes (alongside dns_ip/
vpn_subnet/endpoint), updates the stored domain, and refreshes the DNS
forward rule when the domain changes.

Tests: 3 new domain-conflict tests in test_config_validation.py;
3 new accept_invite healing tests in test_cell_link_manager.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 09:46:58 -04:00
roof ac0c16c97b Fix session cookie name collision when running multiple PIC instances on localhost
Flask's default cookie name ('session') is shared across all ports on the same
hostname. When two PIC instances are accessed via localhost:portA and localhost:portB,
logging into one overwrites the other's session cookie, causing repeated logouts.

Derive a unique 8-hex suffix from each instance's persistent SECRET_KEY and set
SESSION_COOKIE_NAME = 'pic_sess_<suffix>'. This ensures each cell uses a distinct
cookie name, so sessions are fully isolated regardless of hostname.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 09:15:42 -04:00
roof 28a193e430 Fix ensure_postup_dnat to strip-and-replace all DNAT rules idempotently
_get_dnat_container_ips() used a concatenating docker inspect format that
produced "invalid IP" when containers had multiple network attachments.
The old ensure_postup_dnat appended rather than replacing, so each update
call added a broken duplicate set of rules causing iptables to fail on
startup and tear down wg0 entirely.

Fix _get_dnat_container_ips to use a space separator in the format string
and validate each token as a real IP before accepting it.

Rewrite ensure_postup_dnat with _is_dnat_rule() helper: strips every
managed DNAT/FORWARD rule (any IP, port 53/80) on semicolon-split and
appends a single correct set — fully idempotent regardless of prior state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 06:54:20 -04:00
roof 67362349d1 test: add loop detection tests for PUT /api/peers/<peer>/route-via
3 new tests in TestSetPeerRouteVia:
- 409 when remote_exit_relay_active=True (would create A→B→A cycle)
- disable (via_cell=null) bypasses loop check — always allowed
- no 409 when remote_exit_relay_active=False (safe to enable)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 04:24:02 -04:00
roof dc2606541c feat: Phase 4 hardening — retry/backoff, loop detection, sync status UI + tests
Phase 4.1 — Retry/backoff for failed permission pushes:
- _compute_next_retry(): capped exponential backoff with jitter (60s–1h)
- _record_push_result(): tracks push_attempts and next_retry_at per link
- replay_pending_pushes(): skips links still in backoff window, logs deferred count
- _load() migration: adds push_attempts/next_retry_at to existing records

Phase 4.2 — Loop detection (A→B→A routing cycle):
- set_peer_route_via(): returns 409 if target cell already routes peers through us
- apply_remote_permissions(): soft warning when accepting exit-relay that would cycle

Phase 4.3 — Sync staleness indicator in Cell Network UI:
- SyncBadge component: green (synced), amber (pending/failed), gray (never)
- Shows relativeTime of last sync + error message + next retry estimate
- Injected into CellPanel header alongside tunnel online/handshake status

Tests (54 new):
- TestCheckInviteConflicts: subnet overlap, domain conflict, exclude_cell (9 tests)
- TestPushInviteToRemote: success, 4xx, no endpoint, subprocess errors (7 tests)
- TestAcceptInviteNew: new cell, idempotent, healing dns/subnet changes (16 tests)
- TestAddConnectionMutualPairing: push-invite call, non-fatal failure (5 tests)
- TestPeerSyncAcceptInvite endpoint: happy path, field validation, error propagation (16 tests)
- Fixed 2 existing replay tests to clear backoff gate (simulates elapsed window)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 04:18:36 -04:00
roof 960a4ecc51 fix: WG address change now queues pending restart + heals cell connections
Three issues fixed together:

1. WireGuard address changes now go through the pending-restart queue
   (shown in the UI banner) instead of restarting cell-wireguard immediately.
   Only private_key changes still restart immediately; address and port
   changes both defer to the user-initiated Apply flow.  Previously the
   address change was silently applied and never appeared in Settings →
   Pending Configuration.

2. When the WG address changes, the API spawns a background thread that
   pushes the updated invite to all connected cells (over LAN, before the
   WG tunnel is back up).  This lets remote cells automatically update
   their dns_ip, AllowedIPs, and CoreDNS forwarding rules without manual
   re-pairing.

3. accept_invite now handles the "already connected but changed" case:
   if the remote cell re-sends an invite with a different dns_ip, vpn_subnet
   or endpoint, we update the stored link, the WG AllowedIPs, and the
   CoreDNS forward rule in place — no delete/re-add required.  Previously
   the endpoint was ignored and returned the stale record unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 08:29:18 -04:00
roof 0e16d6968a fix: prevent test runs from corrupting live WG state; sync wg0.conf on IP change
Three fixes:

1. Extend the docker-exec safety guard in wireguard_manager to also check
   for 'wg_confs' in the config path.  When running unit tests on the host
   the API uses /app/config/wireguard/wg0.conf (no wg_confs subdir), so the
   old '/tmp/' | 'pytest' check didn't fire — _syncconf and friends were
   executing live 'docker exec cell-wireguard wg set' calls against the
   running container, removing real VPN peers that didn't appear in the
   test config.  The wg_confs subdir only exists inside the container mount,
   so its presence reliably gates live calls.

2. Fix get_split_tunnel_ips() wrong path: self.data_dir + 'api/cell_links.json'
   → self.data_dir + 'cell_links.json'.  The extra 'api/' segment produced
   /app/data/api/cell_links.json inside the container instead of the real
   /app/data/cell_links.json, so connected cells were silently excluded from
   split-tunnel CIDRs.

3. update_peer_ip_registry and ip_update now also call
   wireguard_manager.update_peer_ip so wg0.conf AllowedIPs stay in sync when
   a peer's VPN IP changes at runtime (previously only peers.json was updated).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 07:45:28 -04:00
roof 1a611e0474 fix: UI always accessible; fix exit-relay AllowedIPs not updating
**PIC UI always accessible (service_access=[])**
Remove the per-peer Caddy:80 ACCEPT/DROP rule from apply_peer_rules.
Service access was enforced at two layers (iptables DROP + CoreDNS ACL),
but the iptables layer also blocked the PIC web UI served through Caddy.
CoreDNS ACL alone is sufficient — DNS blocks service hostnames; the UI
path through Caddy remains reachable regardless of service_access value.

**Exit-relay internet routing (route_via another cell)**
update_peer_ip validated new_ip as a single ip_network, rejecting the
comma-separated '10.0.1.0/24, 0.0.0.0/0' string passed by
update_cell_peer_allowed_ips(add_default_route=True). The AllowedIPs
in wg0.conf was never updated, so WireGuard never routed internet traffic
through the exit cell's tunnel. Fix: validate each CIDR individually and
apply the change live via wg set without a container restart.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 05:41:22 -04:00
roof c521fab1cb fix: merge CoreDNS ACL per-service and add reload plugin; add peer/cell e2e tests
- _build_acl_block: put all blocked IPs for a service in ONE acl block instead
  of one block per peer — the first block's allow-all was silently granting
  access to every peer after the first blocked one (first-match semantics)
- generate_corefile: add 'reload' plugin so SIGUSR1 triggers Corefile reload
  in newer CoreDNS builds (without it the signal was a no-op)
- tests/test_firewall_manager.py: new tests for single merged ACL block and
  the reload directive
- tests/e2e/api/test_peer_access_update.py: e2e tests for service_access,
  internet_access, and peer_access updates persisting live to iptables/CoreDNS
- tests/e2e/api/test_cell_to_cell.py: e2e tests for cell-to-cell connection
  management, permissions API, and cross-cell service access restrictions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 04:57:37 -04:00
roof f1666ba19c fix: embed DNAT rules in wg0.conf PostUp for persistence + fix dns_ip in server config
DNAT rules applied via docker exec are lost whenever wg-easy reloads the
WireGuard interface (PostDown flushes the nat table then PostUp only
re-adds static rules). Fix: embed DNS (port 53) and service (port 80)
DNAT rules directly in wg0.conf PostUp/PostDown so they reapply on every
interface restart. ensure_postup_dnat() patches existing configs on startup.

get_server_config() now returns the WG server IP (e.g. 10.0.0.1) for
dns_ip instead of the cell-dns container IP (172.20.0.3). This makes the
value consistent with what get_peer_config() writes into the .conf file,
and fixes the stale hint text in Peers.jsx and WireGuard.jsx.

UI: fallback dns_ip changed from 172.20.0.3 to 10.0.0.1; split-tunnel
fallback drops the 172.20.0.0/16 stale range.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 04:07:10 -04:00
roof 9a800e3b6b feat: fix cross-cell service access — DNS DNAT, service DNAT, Caddy routing
DNS A records now return the WireGuard server IP (10.0.0.1) instead of
Docker bridge VIPs so cross-cell peers resolve service names correctly
regardless of their bridge subnet. DNAT rules (wg0:53→cell-dns:53 and
wg0:80→cell-caddy:80) are applied at startup. Caddy routes by Host header,
eliminating the Docker bridge subnet conflict. Firewall cell rules allow
DNS and service (Caddy) traffic from linked cell subnets. Split-tunnel
AllowedIPs now dynamically includes connected-cell VPN subnets and drops
the 172.20.0.0/16 range. Peers with route_via set now receive full-tunnel
config (0.0.0.0/0) so all their traffic exits via the remote cell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 03:12:09 -04:00
roof 68c27b4521 security: replace WireGuard catch-all ACCEPT with DROP
The PostUp rule appended `iptables -A FORWARD -i wg0 -j ACCEPT` which
allowed any WireGuard-connected client full internet access regardless of
per-peer rules, even when no peers were configured in wg0.conf.

Fix: change PostUp/PostDown to use DROP as the catch-all. Per-peer and
per-cell rules use -I (insert at top) so they take precedence; unknown
or unconfigured WG traffic hits the DROP at the bottom.

Also add reconcile_stale_peer_rules() called on startup to remove FORWARD
rules for peer IPs that no longer exist in the registry, preventing deleted
peers from retaining firewall access across container restarts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 00:31:55 -04:00
roof 8ea834e108 feat: Phase 3 - per-peer internet routing via exit cell
Adds the ability to route a specific peer's internet traffic through a
connected cell acting as an exit relay.

Cell A side:
- PUT /api/peers/<peer>/route-via {"via_cell": "cellB"} sets route_via
- Updates WG AllowedIPs to include 0.0.0.0/0 for the exit cell peer
- Adds ip rule + ip route in policy table inside cell-wireguard so the
  specific peer's traffic egresses via cellB's WG IP
- Sets exit_relay_active on the cell link and pushes use_as_exit_relay=True
  to cellB via peer-sync

Cell B side:
- Receives use_as_exit_relay in the peer-sync payload
- Calls apply_cell_rules(..., exit_relay=True) to add FORWARD -o eth0 ACCEPT
- Stores remote_exit_relay_active flag for startup recovery

Startup recovery:
- apply_all_cell_rules passes exit_relay=remote_exit_relay_active (cellB)
- _apply_startup_enforcement reapplies ip rule for each peer with route_via (cellA)
  since policy routing rules don't survive container restart

peer_registry gets route_via field with lazy migration.
22 new tests across test_cell_link_manager, test_peer_registry, test_peer_route_via.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:23:31 -04:00
roof dcee03dd3f feat(cells): Phase 2 — exit-offer signaling between connected cells
Adds the ability for a cell to signal to a peer that it's willing to
route internet traffic on their behalf.  This is the signaling layer
for Phase 3 (per-peer routing via exit cell).

Changes:
- cell_links.json: exit_offered (bool) + remote_exit_offered (bool)
  fields with lazy migration (default false for existing records)
- _push_permissions_to_remote: includes exit_offered in the push body
- apply_remote_permissions: accepts exit_offered kwarg; stores it as
  remote_exit_offered on the matching cell link
- peer-sync receiver: passes exit_offered from body to apply_remote_permissions
- CellLinkManager.set_exit_offered(cell_name, offered): persists +
  triggers push so the remote learns of our offer immediately
- PUT /api/cells/<name>/exit-offer: REST endpoint to toggle the flag
- 12 new tests covering all new paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:49:21 -04:00
roof 7da0cbb714 fix: add X-Forwarded-For WG IP to peer-sync push curl command
MASQUERADE rewrites the source IP of forwarded packets from
the cell's WG address (10.0.x.1) to cell-wireguard's bridge
IP (172.20.x.9).  The peer-sync endpoint authenticates callers
by checking that the source IP is inside a known cell's vpn_subnet,
so MASQUERADE caused all pushes to fail with 403.

Fix: _push_permissions_to_remote() now calls _local_wg_ip() to
get the local wg0 address and passes it as X-Forwarded-For.
_authenticate_peer_cell() already supports XFF for exactly this
proxying scenario.  Also adds a test verifying the header is present
in the constructed curl command.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:24:08 -04:00
roof 59927b6ad7 fix: whitelist peer-sync endpoint from session auth + CSRF
/api/cells/peer-sync/permissions is called over the WireGuard tunnel
by remote cells — they have no session cookie and cannot produce a CSRF
token. The endpoint authenticates via source IP (must be in the remote
cell's vpn_subnet) and WireGuard public key instead.

Without this, the global enforce_auth hook returns 401 before the route
handler runs, so all cross-cell permission pushes fail even when the
WG tunnel and iptables rules are correct.

Also adds a test verifying the route can be reached without a session.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 14:59:57 -04:00
roof 4a9c4cc58b fix: add kernel routes for cell peers after wg set
wg set updates WireGuard peer state but does not add kernel routes —
unlike wg-quick. Without ip route add, traffic to a remote cell's
vpn_subnet is routed via the default gateway (internet) instead of wg0,
causing all cross-cell pushes to time out with HTTP 000.

- add_cell_peer() now calls _ensure_cell_route(vpn_subnet) after
  writing the peer config and running _syncconf
- _ensure_cell_route() runs docker exec cell-wireguard ip route add
  (idempotent, non-fatal); no-op inside test dirs
- sync_cell_routes() parses wg0.conf at startup to re-add any routes
  lost across container restarts; called from _apply_startup_enforcement
- 5 new unit tests covering both normal and test-dir no-op paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 14:47:22 -04:00
roof ea6731d62c Fix FORWARD rule ordering: embed API-sync ACCEPT inside apply_cell_rules
The per-cell catch-all DROP was reaching position 5 before our ACCEPT
(position 6) because apply_all_cell_rules can re-run after
ensure_cell_api_dnat, pushing the DNAT ACCEPT below the DROP.

Fix: add the API-sync ACCEPT inside apply_cell_rules itself, tagged with
the cell's own tag and inserted LAST (= position 1, above the DROP).
Since it's part of the cell's rule block it is always in the right
position relative to the catch-all DROP, regardless of call order.

Also adds _get_cell_api_ip() helper (docker inspect cell-api) so the
destination IP is always current, and two new tests that verify both the
rule exists and that the insertion order guarantees it wins over DROP.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 14:05:49 -04:00
roof 4ba79fd614 Fix Phase 1 permission sync: route push via cell-wireguard + DNAT receive
cell-api has no route to remote WG tunnel IPs — only cell-wireguard does.
Fix _push_permissions_to_remote() to use 'docker exec cell-wireguard curl'
so outbound sync HTTP traverses the WG tunnel from the right namespace.

On the receive side, add ensure_cell_api_dnat() which installs three
iptables rules inside cell-wireguard on startup:
  - PREROUTING DNAT: wg0:3000 → cell-api:3000 (Docker bridge IP)
  - POSTROUTING MASQUERADE: so cell-api's reply routes back via wg0
  - FORWARD ACCEPT: allow the wg0→eth0 forwarded traffic

Called from _apply_startup_enforcement() so rules survive container restarts.
Tests updated to mock subprocess.run instead of urllib.request.urlopen.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 13:48:49 -04:00
roof a3d0cd5a48 feat(cells): Phase 1 — permission sync between connected PICs
When PIC A updates service sharing permissions, it immediately pushes
the mirrored state to PIC B over the WireGuard tunnel so B's UI shows
what A is sharing with it in real time.

Architecture:
- Push model: update_permissions() → _push_permissions_to_remote() →
  POST /api/cells/peer-sync/permissions on remote cell
- Auth: source IP must be inside a known cell's vpn_subnet (WireGuard
  tunnel proves identity) + body's from_public_key must match stored key
- Mirror semantics: our inbound (what we share) → their outbound view
- Non-fatal: push failures set pending_push=True; replay_pending_pushes()
  retries at startup so offline cells catch up on reconnect
- add_connection() also pushes initial state so remote sees permissions
  immediately on the first connect

New fields on cell_links.json records (lazy-migrated):
  remote_api_url, last_push_status, last_push_at, last_push_error,
  pending_push, last_remote_update_at

New endpoint: POST /api/cells/peer-sync/permissions

30 new tests (1101 total).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 13:12:30 -04:00
roof 562d866a65 feat(cells): Phase 3 tests + Phase 4 UI for cell service-sharing
Phase 3 — tests (50 new, total now 1071):
- test_cell_link_manager: atomicity (WG fail → DNS not called, link not
  persisted), DNS warning non-fatal, inbound_services arg, unknown service
  filtered, update/get permissions, lazy migration of legacy entries
- test_wireguard_manager: subnet overlap rejection (exact, supernet, adjacent
  non-overlapping, different class-A, honours wg0.conf configured network)
- test_firewall_manager: _cell_tag sanitisation, apply_cell_rules emits correct
  ACCEPT/DROP per service + catch-all DROP, clear_cell_rules no-op and exact
  line removal, apply_all_cell_rules iterates with correct args
- test_cells_endpoints: RuntimeError→400, GET /services, GET/PUT permissions
  (200/400/404 paths, service name validation, arg forwarding)

Phase 4 — UI:
- CellNetwork.jsx: replace flat cell list with CellPanel expandable cards;
  add ServiceShareToggle (ARIA switch, saves immediately), InboundServiceBadge
  (read-only), DisconnectConfirmModal (replaces window.confirm); relative
  timestamps; paste validation on blur; WireGuard status merged by public_key
- api.js: add cellLinkAPI.getPermissions, updatePermissions, getServices

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 08:45:32 -04:00
roof 5d0238ff3c A5: Extract config routes into blueprint (app.py 1294 → 579 lines)
Move all /api/config/* routes and pending-restart helpers into
routes/config.py. Re-export helpers from app.py for backward compat:

  from routes.config import _set_pending_restart, _clear_pending_restart,
                           _collect_service_ports, _dedup_changes

Test patches updated:
  app._set_pending_restart     → routes.config._set_pending_restart
  app._clear_pending_restart   → routes.config._clear_pending_restart
  app.threading.Thread         → routes.config.threading.Thread

Remaining in app.py: Flask setup, middleware, health monitor thread,
/health, /api/status, /api/health/history* (use module-level state).

1021 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 06:53:24 -04:00
roof 09138fbc18 A5: Extract all route groups into Flask blueprints (app.py -1735 lines)
Extract 9 route groups out of app.py into routes/ blueprints:
- routes/network.py  — DNS, DHCP, NTP, network info/test (10 routes)
- routes/wireguard.py — WireGuard keys, peers, config, enforcement (18 routes)
- routes/cells.py    — cell-to-cell connections (5 routes)
- routes/peers.py    — peer CRUD + IP update + _next_peer_ip helper (10 routes)
- routes/routing.py  — NAT, peer routes, firewall, iptables (17 routes)
- routes/vault.py    — certs, trust, secrets (19 routes)
- routes/containers.py — containers, images, volumes (14 routes)
- routes/services.py — service bus, logs, services status/connectivity (18 routes)
- routes/peer_dashboard.py — peer-scoped dashboard/services (2 routes)

All blueprints use lazy `from app import X` inside route bodies to preserve
test patch compatibility (patch('app.email_manager', mock) still works).

Also included in this commit:
- A1 fix: backup/restore now includes email/calendar user files
- A2 fix: apply_config sets applying=True flag via helper container
- A3 fix: add_peer rolls back firewall on DNS failure

app.py reduced: 3011 → 1294 lines. 1021 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 06:11:21 -04:00
roof d54844cd44 fix(P2): peer add rollback, helper failure recovery, manager extraction (A2/A3/A5)
A3 — Peer add atomicity: track firewall_applied flag and call
clear_peer_rules() during rollback so partial peer-add failures
don't leave stale iptables rules behind. Added test.

A2 — Pending config flag: instead of clearing before spawning the
helper container (fire-and-forget), set applying=True and let the
helper clear it on success by writing to cell_config.json via a
mounted /app/data volume. On API restart after a failed apply,
_recover_pending_apply() resets the applying flag so the UI shows
pending changes and the user can retry. GET /api/config/pending now
includes the applying field.

A5 (foundation) — Extract all manager instantiation into managers.py.
app.py re-exports every name so existing test patches (patch('app.X'))
continue to work unchanged. 1021 unit tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 05:27:39 -04:00
roof 2455fe189e fix: apply_cell_name regex now matches zone files with TTL field
_generate_zone_content writes records as "name TTL IN A value" but the
regex only matched "name IN A value" (no TTL), so renaming the cell
never updated the DNS hostname record. Updated regex to make TTL optional.
Also fixed the unit test zone fixture to use the actual generated format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 09:32:51 -04:00
roof a43f9fbf0d fix: full security audit remediation — P0/P1/P2/P3 fixes + 1020 passing tests
P0 — Broken functionality:
- Fix 12+ endpoints with wrong manager method signatures (email/calendar/file/routing)
- Fix email_manager.delete_email_user() missing domain arg
- Fix cell-link DNS forwarding wiped on every peer change (generate_corefile now
  accepts cell_links param; add/remove_cell_dns_forward no longer clobber the file)
- Fix Flask SECRET_KEY regenerating on every restart (persisted to DATA_DIR)
- Fix _next_peer_ip exhaustion returning 500 instead of 409
- Fix ConfigManager Caddyfile path (/app/config-caddy/)
- Fix UI double-add and wrong-key peer bugs in Peers.jsx / WireGuard.jsx
- Remove hardcoded credentials from Dashboard.jsx

P1 — Security:
- CSRF token validation on all POST/PUT/DELETE/PATCH to /api/* (double-submit pattern)
- enforce_auth: 503 only when users file readable but empty; never bypass on IOError
- WireGuard add_cell_peer: validate pubkey, name, endpoint against strict regexes
- DNS add_cell_dns_forward: validate IP and domain; reject injection chars
- DNS zone write: realpath containment + record content validation
- iptables comment /32 suffix prevents substring match deleting wrong peer rules
- is_local_request() trusts only loopback + 172.16.0.0/12 (Docker bridge)
- POST /api/containers: volume allow-list prevents arbitrary host mounts
- file_manager: bcrypt ($2b→$2y) for WebDAV; realpath containment in delete_user
- email/calendar: stop persisting plaintext passwords in user records
- routing_manager: validate IPs, networks, and interface names
- peer_registry: write peers.json at mode 0o600
- vault_manager: Fernet key file at mode 0o600
- CORS: lock down to explicit origin list
- domain/cell_name validation: reject newline, brace, semicolon injection chars

P2 — Architecture:
- Peer add: rollback registry entry if firewall rules fail post-add
- restart_service(): base class now calls _restart_container(); email and calendar
  managers call cell-mail / cell-radicale respectively
- email/calendar managers sync user list (no passwords) to cell_config.json
- Pending-restart flag cleared only after helper subprocess exits with code 0
- docker-compose.yml: add config-caddy volume to API container

P3 — Tests (854 → 1020):
- Fill test_email_endpoints.py, test_calendar_endpoints.py,
  test_network_endpoints.py, test_routing_endpoints.py
- New: test_peer_management_update.py, test_peer_management_edge_cases.py,
  test_input_validation.py, test_enforce_auth_configured.py,
  test_cell_link_dns.py, test_logs_endpoints.py, test_cells_endpoints.py,
  test_is_local_request_per_endpoint.py, test_caddy_routing.py
- E2E conftest: skip WireGuard suite when wg-quick absent
- Update existing tests to match fixed signatures and comment formats

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 11:30:21 -04:00
roof 0c12e3fc97 fix: change domain from dev to lan to avoid browser HSTS preload blocking HTTP
The .dev TLD has been HSTS preloaded in Chrome/Firefox/Safari/Edge since 2019.
Browsers silently redirect http://anything.dev to https://anything.dev before
making any network request. Since Caddy has auto_https off, all browser-based
access to .dev domains fails with a connection error even though DNS, routing,
and HTTP all work correctly (curl works; browsers don't).

- cell_config.json: domain "dev" -> "lan"
- Caddyfile: all http://*.dev blocks -> http://*.lan
- Corefile: dev zone -> lan zone (file /data/lan.zone)
- data/dns/lan.zone: new zone file (dev.zone removed live)
- test_wg_domain_access.py: remove hardcoded DOMAIN_IPS / .dev references;
  read domain from /api/config at runtime so tests work with any configured TLD

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 01:54:33 -04:00
roof 32272420cb test: add E2E coverage for peer dashboard/services, DNS records, and WG domain access
- test_peer_dashboard_services.py (63 tests): unit tests for all API fixes
  * peer_dashboard field names (name/transfer_rx/transfer_tx vs old stale names)
  * peer_dashboard service_urls dict with correct domain-keyed URLs
  * peer_services email structure (nested smtp/imap, address not username)
  * peer_services files key (not webdav), caldav URL (calendar.dev not radicale.dev:5232)
  * peer_services wireguard DNS (not 10.0.0.1), config text with DNS line
  * DNS zone records (api/webui → Caddy, VIPs for calendar/files/mail/webdav)
  * Caddyfile generation (all service blocks including webui.dev)
  * Access control (401 anon, 403 admin on peer-only routes, 404 missing peer)
- e2e/api/test_peer_endpoints.py: fix stale field assertions, add structure checks
- e2e/wg/test_wg_domain_access.py: E2E WG tests for DNS resolution via VPN tunnel
  * All *.dev domains resolve to correct IPs via CoreDNS
  * api.dev/webui.dev must resolve to Caddy, not container direct IPs
  * CoreDNS reachability through VPN tunnel
  * Peer config DNS field correctness
- e2e/ui/test_peer_dashboard.py: UI checks for service icon links, CalDAV URL, email

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 17:41:21 -04:00
roof e5d59fd94d fix: sync API key-store from wg0.conf to prevent WireGuard handshake failure
linuxserver/wireguard auto-generates its own PrivateKey on first container
start, independently of the PIC API's key-store.  When the two diverge, the
API generates peer configs with the wrong server public key and the WireGuard
handshake fails silently — the client can ping the VPN subnet (10.0.0.x) but
gets no internet and cannot reach any Docker service (172.20.0.x).

Adds _sync_keys_from_conf(): called at the top of apply_config(), reads the
PrivateKey from wg0.conf, derives the matching public key, and overwrites the
API key files (private.key / public.key) if they differ.  This makes wg0.conf
the authoritative source for the server identity, keeping get_peer_config()
consistent with the live WireGuard interface.

Adds 5 new tests in TestSyncKeysFromConf covering:
- key-store update when conf key differs
- no-op when keys already match
- get_peer_config() uses the synced key
- no raise when conf is missing
- apply_config() passes the synced key through bootstrap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 16:40:21 -04:00
roof 9418c3da5b feat: restore WireGuard peers after bootstrap and add VPN routing tests
apply_config() now calls _load_registered_peers() when wg0.conf is empty
so all active peers from peers.json are written back into the config file
after a bootstrap — preventing clients from losing tunnel access after
an API restart that regenerated wg0.conf from scratch.

Adds test_wireguard_vpn_routing.py (36 tests) covering:
- generate_config() PostUp/PostDown rules enabling internet forwarding
  (MASQUERADE + FORWARD ACCEPT required for internet-through-VPN)
- get_peer_config() DNS field pointing to cell-dns for domain resolution
- apply_config() bootstrap peer restoration from peers.json
- _load_registered_peers() filtering (inactive, missing fields, malformed)
- add_peer() /32 AllowedIPs enforcement to prevent route leaks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 15:33:57 -04:00
roof 580d8af7ae fix: port changes now propagate to containers via env file in-place writes
Root cause: write_env_file used os.replace() which creates a new inode.
Docker file bind-mounts track the original inode at mount time, so the
container's /app/.env.compose never saw updates — docker compose always
read the stale port value and skipped container recreation.

Fixes:
- ip_utils.write_env_file: write in-place (open 'w') instead of os.replace()
  so Docker bind-mounted files see the update immediately
- apply_pending_config: add --force-recreate to docker compose up for
  specific-container restarts, bypassing config-hash comparison as a
  belt-and-suspenders measure

Tests added:
- TestWriteEnvFileInPlace: verifies inode is preserved across writes
- TestApplyPendingConfigForceRecreate: verifies --force-recreate is in the
  docker compose command for specific-container restarts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 15:00:43 -04:00
roof de5ff75a2e fix: wireguard_port identity change and check_port_open verification
Bug 1 — port not propagated to wg0.conf:
  The identity update path (wireguard_port via PUT /api/config) was calling
  wireguard_manager.update_config() which only saves to a JSON file via
  BaseServiceManager. wg0.conf was never updated, so after a container
  restart the WireGuard interface would still listen on the old port.
  Fix: call apply_config() instead — it writes ListenPort into wg0.conf.

Bug 2 — check_port_open ignored configured port:
  check_port_open() checked for 'listening port' in wg show output but
  never compared it against the configured port. A port-mismatch (e.g.
  after config change but before restart) would return True — misleading.
  Fix: require 'listening port: {configured_port}' to match exactly.

Tests added:
  - test_check_port_open_wrong_port_returns_false
  - test_check_port_open_explicit_port_matches
  - test_check_port_open_explicit_port_mismatch
  - test_wireguard_port_identity_change_calls_apply_config
  - test_wireguard_port_same_value_does_not_call_apply_config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 08:41:22 -04:00
roof 9677755b4f fix: e2e/integration test infrastructure and Makefile test targets
- Fix make test: was pointing to non-existent api/tests/, now runs unit tests
  correctly with --ignore=e2e --ignore=integration
- Remove dead phase test targets (test-phase1..4, test-all-phases) that all
  referenced cd api && pytest tests/ (non-existent path)
- Add .test_admin_pass file: reset_admin_password.py now writes a persistent
  test password file alongside .admin_initial_password; the API never deletes
  it (unlike .admin_initial_password which is consumed on first startup)
- Update both integration/conftest.py and e2e/helpers/admin_password.py to
  read .test_admin_pass before .admin_initial_password — so tests work after
  make restart without needing PIC_ADMIN_PASS env var
- Add AI collaboration rules to CLAUDE.md (auto-loaded every session)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 08:27:27 -04:00
roof 420dced9ff fix: WireGuard peer sync, privileged mode, E2E and integration test correctness
- api/app.py: sync WireGuard server config on peer add/remove (non-fatal)
- docker-compose.yml: add privileged:true to wireguard service
- E2E tests: fix logout selector, DNS IP lookup, wg config DNS line, VIP skip guards,
  badge text selectors, heading .first, async logout wait
- Integration tests: fix 4 tests that sent unauthenticated requests expecting 400
  (now use authenticated session helpers); accept 401 as valid in webui proxy test;
  add password field to service_access validation test
- Remove stale tracked config templates (config/api/api/*, config/api/cell.env, etc.)
  that no longer exist on disk after config layout was reorganised

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 06:04:40 -04:00
roof 31a7951ffd fix: 4 issues — admin password sudo, peer modal, WireGuard fetch creds, port check
1. make reset/show-admin-password: use sudo so data/api/ owned-by-root
   files are writable without explicit sudo prefix

2. Peers.jsx: remove one-time password modal on peer creation — admin
   already knows the password they typed; replace with a success toast
   showing peer name and provisioned accounts

3. WireGuard.jsx + Peers.jsx: add credentials:'include' to every raw
   fetch() call (7 calls across two files, plus fix one hardcoded
   localhost:3000 URL); the port check and peer status calls were
   returning 401 because they didn't send the session cookie

4. test_admin_wireguard.py: update test to match new toast flow (no modal),
   add Scenario 10 test that verifies the port check badge renders on the
   WireGuard page after the credentials fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-26 03:33:11 -04:00
roof 7d2979b8af fix: integration and E2E test correctness after auth enforcement
config_manager: make per-file copy errors non-fatal during restore
  (resolves test failures when /app/config/* is not writable by test runner)
test_live_api.py: fix NameError (_req.Session not requests.Session)
test_negative_scenarios.py: replace raw requests.* with authenticated _S.*
  (all endpoints now require auth; unauthenticated calls return 401)
wg/conftest.py: fix wg_server_info — public key is at /api/wireguard/keys
test_admin_navigation.py, test_peer_acl.py: add .first to ambiguous locators
  to avoid Playwright strict-mode errors when desktop+mobile nav both mount

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 18:14:38 -04:00
roof 0d32038150 feat: add comprehensive E2E test suite (Playwright + WireGuard + API)
Adds tests/e2e/ with three layers of E2E coverage:
- API layer (tests/e2e/api/): unauthenticated access, admin endpoints,
  peer endpoints, access control enforcement — 24 tests
- Playwright UI (tests/e2e/ui/): login flows, admin navigation, peer
  dashboard/services, role-based ACL, password change — 60+ tests
- WireGuard connectivity (tests/e2e/wg/): tunnel up/down, DNS resolution
  through VPN, service ACL enforcement via iptables, full-tunnel routing
Shared helpers: PicAPIClient, WGInterface, playwright_login, cleanup.
Makefile targets: test-e2e-api, test-e2e-ui, test-e2e-wg, test-e2e.
Adds scripts/reset_admin_password.py for test bootstrap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 16:41:13 -04:00
roof fc3cfc9741 Fix post-deploy auth issues: best-effort service provisioning, integration test auth, test mock corrections
- api/app.py: email/calendar/files provisioning now best-effort (non-fatal); fixed email_manager.create_email_user call to include domain argument
- tests/integration: added module-level auth sessions to all integration test files; added admin auth to api fixture and _resolve_admin_pass() helper; added TEST_PEER_PASSWORD constant; added password to peer creation calls
- tests/test_peer_provisioning.py: renamed rollback test to reflect new best-effort semantics (email failure no longer causes rollback)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 15:42:03 -04:00
roof 8650704316 feat: add authentication and authorization system
Backend:
- AuthManager (api/auth_manager.py): server-side user store with bcrypt
  password hashing, account lockout after 5 failed attempts (15 min),
  and atomic file writes
- AuthRoutes (api/auth_routes.py): Blueprint at /api/auth/* — login,
  logout, me, change-password, admin reset-password, list-users
- app.py: register auth_bp blueprint; add enforce_auth before_request
  hook (401 for unauthenticated, 403 for wrong role; only active when
  auth store has users so pre-auth tests remain green); instantiate
  AuthManager; update POST /api/peers to require password >= 10 chars
  and auto-provision email + calendar + files + auth accounts with full
  rollback on any failure; extend DELETE /api/peers to tear down all
  four service accounts; add /api/peer/dashboard and /api/peer/services
  peer-scoped routes; fix is_local_request to also trust the last
  X-Forwarded-For entry appended by the reverse proxy (Caddy)
- Role-based access: admin for /api/* (except /api/auth/* which is
  public and /api/peer/* which is peer-only)
- setup_cell.py: generate and print initial admin password, store in
  .admin_initial_password with 0600 permissions; cleaned up on first
  admin login

Frontend:
- AuthContext.jsx: React context with login/logout/me state and Axios
  interceptor for automatic 401 redirect
- PrivateRoute.jsx: route guard component
- Login.jsx: login page with error handling and must-change-password
  redirect
- AccountSettings.jsx: change-password form for any authenticated user
- PeerDashboard.jsx: peer-role landing page (IP, service list)
- MyServices.jsx: peer service links page
- App.jsx, Sidebar.jsx: AuthContext integration, logout button,
  PrivateRoute wrappers, peer-role routing
- Peers.jsx, WireGuard.jsx, api.js: auth-aware API calls

Tests: 100 new auth tests all pass (test_auth_manager, test_auth_routes,
test_route_protection, test_peer_provisioning). Fix pre-existing test
failures: update WireGuard test keys to valid 44-char base64 format
(test_wireguard_manager, test_peer_wg_integration), add password field
and service manager mocks to test_api_endpoints peer tests, add auth
helpers to conftest.py. Full suite: 845 passed, 0 failures.

Fixed: .admin_initial_password security cleanup on bootstrap, username
minimum length (3 chars enforced by USERNAME_RE regex)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 15:00:06 -04:00
roof a338836bb8 add security fixes, port hardening, and expanded QA coverage
Security fixes:
- Replace debug=True with env-driven FLASK_DEBUG in app.py
- Add _safe_path helper and path-traversal protection to all 6 file routes
  in file_manager.py
- Add peer_name regex and input validation (public_key, name, endpoint_ip)
  in wireguard_manager.py
- Stop returning private key from GET /api/wireguard/keys; return only
  public_key + has_private_key boolean
- Fix is_local_request() XFF bypass by checking remote_addr only, ignoring
  X-Forwarded-For
- Remove duplicate get_all_configs / get_config_summary methods from
  config_manager.py

DevOps:
- Bind 6 internal service ports to 127.0.0.1 in docker-compose.yml
  (radicale, webdav, api, webui, rainloop, filegator)
- Move WebDAV credentials to env vars (WEBDAV_USER, WEBDAV_PASS)
- Pin flask, flask-cors, requests, cryptography, docker to secure minimum
  versions in requirements.txt

QA (560 tests, 0 failures):
- tests/test_wireguard_endpoints.py: 18 new endpoint tests
- tests/test_file_endpoints.py: 24 new endpoint tests incl. path traversal
- tests/test_container_manager.py: expanded from 2 to 30 tests
- tests/test_config_backup_restore_http.py: 25 new tests (new file)
- tests/test_config_apply.py: 9 new tests (new file)

Docs:
- Rewrite README.md with accurate architecture, ports, env vars, security notes
- Rewrite QUICKSTART.md with verified commands

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 13:08:24 -04:00