Commit Graph

133 Commits

Author SHA1 Message Date
roof 16fb362df7 feat: replace hardcoded service names with ServiceRegistry-driven Caddy and CoreDNS config
Unit Tests / test (push) Failing after 11s
Previously, CaddyManager and NetworkManager contained hardcoded lists of
service names (calendar, files, mail, webdav, etc.), meaning every new
service required a code change to appear in Caddy routes and DNS records.
Now both managers accept a service_registry parameter and derive their
service lists dynamically from the registry at runtime.

- CaddyManager: new _build_registry_service_routes() and
  _http01_service_pairs() methods pull routes from the registry
- NetworkManager: new _get_service_subdomains() method returns registry
  subdomains with a hardcoded fallback when no registry is wired in;
  _build_dns_records, stale-record detection, and service name sets all
  use the registry
- managers.py: service_registry constructed before network_manager so it
  can be injected into both CaddyManager and NetworkManager
- service_registry.py: validation chokepoint in get_caddy_routes() rejects
  invalid subdomain/backend values and reserved service names
- service_store_manager.py: _validate_manifest now validates top-level
  subdomain, backend, extra_subdomains, and extra_backends fields
- tests: 24 new tests covering registry-driven routing and DNS subdomain
  generation (test_caddy_registry_integration.py)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 18:27:52 -04:00
roof 0afdee32da feat: Services UI — nested nav, per-service pages, settings migration
Rename Store → Services: ServicesIndex.jsx shows built-in core services
(Email, Calendar, Files) with Manage links, plus the existing add-on
store below.

New service sub-pages at /services/email|calendar|files serve both
admin and peer roles. Admins see connection info, service status, users
list, and an inline config form (port/data-dir). Peers see connection
info and their personal credentials fetched from peerAPI.

Navigation restructured: a Services parent item expands to show the
three sub-pages via a collapsible sidebar group (ChevronDown toggle).
Both admin and peer navigation include the Services group. Sidebar
extracted NavItem/NavList components to eliminate the duplicate mobile/
desktop rendering.

Settings.jsx drops EmailForm, CalendarForm, FilesForm and their
SERVICE_DEFS entries. Port conflict detection and per-service validation
logic extracted to utils/serviceConfig.js, shared by Settings and the
new service pages. Service form flushers are registered without cleanup
so the Apply banner saves dirty config even when the user navigates away
from a service page before clicking Apply.

Legacy routes /email, /calendar, /files, /store redirect to their new
canonical paths.

GET /api/config now includes installed_services so the nav can derive
which add-ons are installed without a separate store fetch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 06:46:17 -04:00
roof b16189d00f Fix three DNS corruption bugs in DDNS/non-LAN mode
Unit Tests / test (push) Successful in 11m30s
apply_cell_name() now skips multi-label zone files (split-horizon DDNS
zones like pic2.pic.ngo.zone) and excludes '*' and '@' from hostname
candidate detection, preventing the wildcard record from being renamed
to the old cell name during a cell rename.

update_split_horizon_zone() now deletes stale zone files from previous
cell names sharing the same TLD (e.g. pic3.pic.ngo.zone when renaming
to pic2.pic.ngo), eliminating orphaned DNS entries.

_bootstrap_dns() now detects non-LAN domain modes and calls
update_split_horizon_zone() instead of apply_ip_range(), preventing
service records (api, calendar, files…) from being re-injected into
the DDNS parent zone on every container restart.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 05:56:00 -04:00
roof 66500bb128 fix: use effective_domain for service links and clean up stale DNS records
Unit Tests / test (push) Successful in 11m32s
Dashboard, Email, Calendar, and Files pages were building service URLs
with the internal LAN zone name (e.g. 'cell') instead of the public
effective domain (e.g. 'pic2.pic.ngo'), and always using http:// even
in DDNS mode where HTTPS is available.

Changes:
- Dashboard/Email/Calendar/Files: read effective_domain + domain_mode
  from ConfigContext; use effective_domain in non-LAN mode and https://
  for all DDNS domain modes.
- Calendar: show port 443 instead of 80 in DDNS mode.
- network_manager.update_split_horizon_zone: when the primary internal
  zone name is a parent of the effective DDNS domain (e.g. pic.ngo is a
  parent of pic2.pic.ngo), remove stale bootstrap service records (api,
  calendar, files, mail, webmail, webdav) that pollute the DNS display
  and would shadow public DNS responses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 05:06:52 -04:00
roof d7dbd596ab feat: route PIC services as subdomains of the cell's effective domain
Unit Tests / test (push) Successful in 11m33s
In DDNS modes (pic_ngo, cloudflare, duckdns, http01), all built-in
services are now reachable as subdomains of the cell domain, e.g.
calendar.pic1.pic.ngo instead of pic1.pic.ngo/calendar.

Key changes:
- CaddyManager._build_core_service_routes(): new helper generates
  Caddy named-matcher host blocks for calendar, mail/webmail, files,
  webdav, and api subdomains within the wildcard TLS server block.
- All ACME modes (pic_ngo, cloudflare, duckdns) use the new
  subdomain matchers; http01 emits a dedicated server block per service.
- http01: installed store-plugin services whose name clashes with a
  core service are skipped to prevent duplicate server blocks.
- routes/config.py: ip_utils.write_caddyfile() is skipped in non-LAN
  modes so LAN Caddy config never overwrites the ACME config.
- firewall_manager.generate_corefile(): new split_horizon_zones param
  adds local authoritative file zones so LAN clients resolve
  *.pic1.pic.ngo to the internal Caddy IP without hairpin NAT.
- NetworkManager.update_split_horizon_zone(): writes the wildcard zone
  file and regenerates the Corefile with the split-horizon block;
  called automatically after every identity change in non-LAN mode.
- Added @ to allowed record-name chars in update_dns_zone validation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 04:31:57 -04:00
roof 1f016de855 feat: make DDNS domain_name the effective domain across all services
Unit Tests / test (push) Successful in 11m35s
- ConfigManager.get_effective_domain(): returns domain_name when DDNS
  active (pic_ngo/cloudflare/duckdns), domain otherwise. Used by all
  public-facing services so they use the real registered FQDN.
- ConfigManager.get_internal_domain(): always returns _identity.domain
  (CoreDNS zone name, dnsmasq, cell-link invites — stays internal).
- Silent migration: if domain_mode != lan and domain is generic "cell",
  auto-set to {cell_name}.local for unique CoreDNS zone naming.
- caddy_manager: fix custom_domain bug — cloudflare/http01 modes were
  reading identity.get('custom_domain') which never exists; now reads
  domain_name correctly.
- routes/config, app: expose effective_domain in GET /api/config and
  /api/status responses.
- email_manager, routes/email: use get_effective_domain() for
  OVERRIDE_HOSTNAME, POSTMASTER_ADDRESS, and new-user email defaults.
- ServiceBus.IDENTITY_CHANGED event: emitted from PUT /api/config and
  POST /api/ddns/register after identity writes; caddy_manager and
  email_manager subscribe to regenerate config automatically.
- Settings.jsx: hide Local Domain input in non-LAN modes; show
  read-only effective_domain with "managed by DDNS" badge and an
  Advanced toggle for the internal CoreDNS zone name.
- 11 new test classes covering all new helpers, event subscriptions,
  caddy/email handlers, and the custom_domain fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 02:48:47 -04:00
roof ad2eaca273 feat: release old pic.ngo subdomain when cell name changes
Unit Tests / test (push) Successful in 15m45s
Adds DELETE /api/v1/registration to the DDNS server (token-authenticated,
owner-only) and PicNgoDDNS.release() on the client. DDNSManager.register()
now automatically releases the old subdomain before claiming the new one,
so stale names are freed for others to use. Release failures are logged as
warnings and do not block the new registration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 17:07:13 -04:00
roof de43f4a9a0 fix: DDNS register() always sends public IP and saves token to correct location
Unit Tests / test (push) Successful in 15m27s
Two bugs that prevented registration from working after wizard completion:
1. register(name, '') sent empty IP; server stored blank A record. Now calls
   _get_public_ip() when ip is empty so the A record is always set correctly.
2. Token was saved to _identity.domain.ddns.token (TypeError when domain is a
   string) instead of the top-level ddns config where update_ip() reads it.
   Subdomain also now correctly written to _identity.domain_name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 16:05:55 -04:00
roof 0b31d02f10 feat: DDNS self-healing heartbeat + manual re-register endpoint
Unit Tests / test (push) Successful in 15m26s
- DDNSTokenExpired exception triggers auto re-register in update_ip()
  so cells recover silently after a DDNS DB reset
- POST /api/ddns/register lets the user force re-registration from Settings
- Re-register button in Settings → External Domain & DDNS (pic_ngo only)
- 3 new tests covering register endpoint: wrong provider, missing name, success

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 15:05:27 -04:00
roof cde177966d fix: DDNS URL env var takes priority; switch default to HTTPS
- ddns_manager: DDNS_URL env var overrides stored api_base_url so
  existing cells pick up the new HTTPS endpoint without re-registering
- docker-compose.yml: default DDNS_URL now points to https://ddns.pic.ngo
- setup_manager.py: add rstrip('/') before replacing /api/v1 to handle
  URLs with or without trailing slash

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 14:50:28 -04:00
roof 61e8631c7d feat: DDNS settings integration — check availability, update credentials
- GET /api/config now returns domain_mode, domain_name, ddns.{provider,subdomain,has_token}
- GET /api/ddns/check/<name> proxies availability check to DDNS service
- PUT /api/ddns validates and saves cloudflare/duckdns credentials post-setup
- When cell_name changes for pic_ngo provider, auto-registers the new subdomain
- Settings: Cell Name shows availability badge for pic_ngo; auto-save blocks on taken
- Settings: new External Domain & DDNS section — pic_ngo info, cloudflare/duckdns edit
- 11 new tests for the two new endpoints (all pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 14:35:37 -04:00
roof 777ffa4fb2 fix: use DDNS_URL env var for availability check; default to port 8080
Unit Tests / test (push) Successful in 15m23s
_check_pic_ngo_available was hardcoding https://ddns.pic.ngo, ignoring
DDNS_URL. Now imports DDNS_API_BASE from setup_manager so both the
availability check and DDNS registration use the same configured URL.

API container now receives DDNS_URL and DDNS_TOTP_SECRET from env.
Default DDNS_URL points to http://ddns.pic.ngo:8080/api/v1 (the
FastAPI service runs on port 8080 without TLS termination in front).

Also returns 503 (not 500) when the DDNS service is unreachable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 13:06:44 -04:00
roof 1c62c47475 fix: 500 on setup complete + wizard shows all 7 steps
Unit Tests / test (push) Successful in 15m41s
Two bugs:

1. AttributeError: AuthManager.update_password does not exist — the
   fallback when create_user fails should call set_password_admin().
   This caused a 500 on every setup submit when an admin user already
   existed (e.g. from a previous install attempt).

2. Wizard was jumping to step 2 and skipping domain steps 3-4 when
   preconfigured data existed in cell_config.json. Since the installer
   no longer sets that data, and the wizard must always show all steps,
   the installerConfigured state and all step-skipping navigation is
   removed. Values are still pre-filled if found in config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:41:33 -04:00
roof 4a42ff5dcc wizard: move all config to /setup; install.sh is infrastructure-only
Unit Tests / test (push) Successful in 15m41s
install.sh no longer prompts for anything. It installs packages (with sudo),
creates the system user, clones the repo, and runs 'make install' — all as
the invoking user. Only package installs and system-level ops use sudo.
All folder creation happens under the user's own account, no chown needed.

/setup wizard gains the missing validation that was previously in install.sh:
- Step 1: checks pic.ngo name availability via backend (non-blocking)
- Step 4: 'Verify token' button for Cloudflare and DuckDNS tokens,
  validated server-side through new /api/setup/validate steps

API changes (routes/setup.py):
- validate step 'pic_ngo_available': proxy check to ddns.pic.ngo
- validate step 'cloudflare_token': verify via Cloudflare tokens API
- validate step 'duckdns_token': verify via DuckDNS update endpoint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:07:56 -04:00
roof 2d842abe5b installer: restore cell identity prompts and domain setup
Unit Tests / test (push) Successful in 15m39s
Reverts 8d1ef39. The installer must collect cell name, domain mode, and
provider tokens before 'make install' so that DDNS registration,
availability checks, and Caddy TLS can be configured at first boot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 15:01:32 -04:00
roof f550f04ce2 Fix DDNS registration and wizard pre-fill after installer run
Unit Tests / test (push) Successful in 15m29s
DDNS registration (setup_cell.py):
- Replace pyotp dependency with stdlib TOTP (HMAC-SHA1, RFC 6238)
  pyotp is only available inside the Docker container, not on the host
  where setup_cell.py runs — registration was silently skipped every time
- OTP header still sent if generation succeeds; omitted gracefully if not

Wizard pre-fill (setup_manager + Setup.jsx):
- GET /api/setup/status now returns 'preconfigured' dict with cell_name,
  domain_mode, domain_name, and provider tokens from installer-written config
- Setup.jsx fetches status on mount and pre-fills all form state so the
  user only needs to set password, services, and timezone — not re-enter
  the identity they already configured in the bash installer
- Fails silently so wizard still works on fresh installs with no config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 12:22:53 -04:00
roof 925ab1f696 Overhaul setup wizard: domain config, password strength, field alignment
Unit Tests / test (push) Successful in 8m48s
Password:
- Add lowercase to strength scoring; "Good" now requires all API criteria
  (12 chars, upper, lower, digit) — no more submitting passwords the API rejects
- isReady gates the Next button on meeting API requirements, not just length

Domain steps 3 + 4:
- Step 3: choose pic_ngo / custom / lan (sends valid API domain_modes)
- Step 4 (pic.ngo): shows derived [cellName].pic.ngo domain preview
- Step 4 (custom): domain name field + TLS method selector
  (Cloudflare DNS-01 + API token, DuckDNS + token, HTTP-01 + port-80 warning)
- Step 4 skipped entirely for LAN-only
- Review step shows actual domain string and TLS method instead of opaque codes

Cell name:
- Description and preview hint make clear it becomes the pic.ngo subdomain
- Step 1 shows live "name.pic.ngo" preview as you type

Backend:
- setup_manager now accepts and stores domain_name, cloudflare_api_token,
  duckdns_token for Phase 3 DDNS registration use

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 07:27:59 -04:00
roof 24877df976 Fix setup wizard and installer for fresh-install flow
Unit Tests / test (push) Successful in 8m53s
- setup_manager: fall back to update_password if admin already exists
  (installer bootstrap creates admin; wizard now updates rather than fails)
- install.sh: chown repo to SUDO_USER instead of pic user so the
  invoking operator can run make update without git safe.directory errors
- test: update mock to also stub update_password when testing total auth failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 06:08:55 -04:00
roof 1989dfa0a3 Fix: exempt /api/setup/* from enforce_auth so setup wizard works on fresh install
Unit Tests / test (push) Successful in 8m49s
The setup wizard runs before any account exists, but the installer's
setup_cell.py creates auth_users.json with an admin account first.
This meant enforce_auth was active by the time the browser hit /setup,
blocking all /api/setup/* calls with 401. The CSRF hook already exempted
/api/setup/* — auth enforcement now matches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 05:03:44 -04:00
roof f1b48208fc Fix CI unit test failures and DDNS config wiring
Unit Tests / test (push) Failing after 8m58s
- auth_manager._ensure_file(): stop creating the empty auth_users.json on
  init — the constructor now only creates the parent directory.  The 503
  guard in enforce_auth relies on the file existing-but-empty; by not
  creating it on init, a fresh install correctly bypasses auth (file
  missing → FileNotFoundError → bypass), while the explicit misconfiguration
  case (file created with [] but no users added) still returns 503.
- test_enforce_auth_configured.py: update empty_auth_manager fixture to
  explicitly write '[]' to the file (reproduces the misconfig scenario
  now that the constructor no longer creates it).
- ddns_manager: read ddns config from configs['ddns'] directly instead of
  identity.domain.ddns — _identity.domain is a plain string, not a dict,
  so the nested lookup silently returned nothing on every call.
- setup_cell.py: write top-level 'ddns' block into cell_config.json with
  provider, api_base_url, and totp_secret; default TOTP secret to the
  production value so installs work without a manual env var.
- test_ddns_manager.py: update _make_config_manager to populate cm.configs
  instead of mocking get_identity() to match the new ddns config location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 04:20:19 -04:00
roof e38bd4e81f Phase 5: extended connectivity — WireGuard ext, OpenVPN, Tor exit routing
- ConnectivityManager: per-peer exit routing via iptables fwmark/policy tables
  (wg_ext=0x10/t110, openvpn=0x20/t120, tor=0x30/t130)
- Dedicated PIC_CONNECTIVITY chains (mangle+nat), kill-switch FORWARD DROP
- Config upload with sanitization: strips PostUp/PostDown and OVpn script dirs
- Peer exit_via field added to peer registry (backward-compat, default=default)
- 7 Flask routes at /api/connectivity/*
- Connectivity.jsx: 693-line frontend with exit cards, peer assignment table
- 72 new tests for ConnectivityManager (72 passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 10:48:20 -04:00
roof 0a21f22076 Phase 4: service store — manifest validation, install/remove, Store UI
- ServiceStoreManager: manifest allowlist (git.pic.ngo/roof/*), volume
  denylist, ACCEPT-only iptables rules, ${SERVICE_IP}-only dest_ip
- IP allocator: pool 172.20.0.20-254, skips CONTAINER_OFFSETS VIPs
- Compose overlay: docker-compose.services.yml auto-included via DCF
- Flask blueprint at /api/store: list, install, remove, refresh
- Store.jsx: full install/remove UI with spinners and toast notifications
- 95 new unit tests for ServiceStoreManager (all passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 10:19:39 -04:00
roof f77d7fabcd Phase 3: ddns_manager — DDNS client, provider adapters, IP heartbeat
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 09:42:00 -04:00
roof 7d290c12c4 Phase 2: caddy_manager — Caddyfile generation, health monitor, DNS-01 support
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 09:04:11 -04:00
roof cf1b9672f4 Phase 1: first-run setup wizard, bash installer, Docker profiles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 08:05:38 -04:00
roof b8e57b6e51 Fix race condition in ensure_forward_stateful: add threading.Lock
Concurrent callers (health monitor + startup) could both pass the
delete-all loop and each insert a copy, producing duplicate
ESTABLISHED,RELATED rules. Lock serialises all calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:12:18 -04:00
roof 1b61e9e290 Fix ICMP latency: re-anchor ESTABLISHED,RELATED to FORWARD position 1 on every health tick
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 18:51:38 -04:00
roof 0042b3b1bb Use alpine instead of busybox for cell subnet route injection
pic1 ships alpine but not busybox; ensure_cell_subnet_routes() now uses
the alpine image so route injection works on all cells.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:59:23 -04:00
roof e2c50c381a Fix cross-cell domain access: scope DNAT rules, add Docker→wg0 routing
- firewall_manager: add _get_wg_server_ip() helper; scope ensure_cell_api_dnat(),
  ensure_dns_dnat(), ensure_service_dnat() DNAT rules with -d server_ip; add
  ensure_wg_masquerade() (Docker→wg0 MASQUERADE+FORWARD) and
  ensure_cell_subnet_routes() (host routes via docker run busybox)
- wireguard_manager: scope PostUp DNAT rules with -d server_ip in generate_config()
  and ensure_postup_dnat(); add Docker→wg0 MASQUERADE+FORWARD rules
- app.py: call ensure_wg_masquerade() and ensure_cell_subnet_routes() in
  _apply_startup_enforcement()
- tests/test_firewall_manager.py: mock _get_wg_server_ip, add
  test_dnat_is_scoped_to_server_ip and test_returns_false_when_wg_server_ip_not_found
- tests/e2e/wg/test_cell_to_cell_routing.py: rewrite to use dynamic config
  (no hardcoded IPs/ports), add latency and domain access tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:37:02 -04:00
roof 1e1bda4679 Fix cross-cell ICMP routing: state-based cell DROP + e2e test
The cell catch-all DROP rule blocked all traffic from a connected cell's
subnet, including ESTABLISHED/RELATED packets (ICMP replies, TCP ACKs) for
connections initiated by local VPN peers. This broke ping to the remote
cell's WireGuard IP even when the cell-to-cell tunnel was healthy.

Change the DROP to match only NEW,INVALID connections so established reply
traffic passes through to the stateful ACCEPT rule.

Also adds tests/e2e/wg/test_cell_to_cell_routing.py — an end-to-end test
that brings up a real WireGuard tunnel from the test runner to pic1 and
verifies full cross-cell routing including ICMP ping, API /health, and Caddy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 10:59:11 -04:00
roof 5a4e292440 fix: allow reply traffic from connected cells through FORWARD chain
apply_cell_rules drops all traffic from a cell's subnet except specific
service ports. This also drops ICMP replies and TCP ACKs for connections
initiated by local peers to the connected cell, breaking cross-cell
routing (ping to 10.0.0.1 silently dropped by test's cell DROP rule).

Fix: ensure_forward_stateful() inserts a stateful ESTABLISHED,RELATED
ACCEPT at the top of FORWARD. Called from apply_cell_rules (every cell
add/update) and from _apply_startup_enforcement. Idempotent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:13:59 -04:00
roof c2d215ee2e fix: cross-cell routing for split-tunnel peers
Three related fixes for split-tunnel peers that need to reach connected cells:

1. apply_peer_rules/apply_all_peer_rules now accept wg_subnet (actual local VPN
   subnet) and cell_subnets (connected cells' vpn_subnets) parameters instead of
   hardcoding 10.0.0.0/24. All callers (startup, add_peer, update_peer,
   apply-enforcement endpoint) pass the real values.

2. Explicit ACCEPT rules are inserted in FORWARD for each connected cell's
   subnet so split-tunnel peers (internet_access=False) can still reach
   connected cells via the wg0→wg0 path.

3. apply_ip_range in network_manager now loads cell_links.json and passes it
   to generate_corefile(), fixing a race where the bootstrap DNS thread could
   overwrite the Corefile and wipe cross-cell DNS forwarding zones on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:36:28 -04:00
roof 8ee1d88e37 Add subnet conflict validation for wireguard.address and ip_range changes
When a cell is connected to others, changing the local WireGuard address
or Docker ip_range to a subnet that overlaps a connected cell's vpn_subnet
would break routing. Both now return 409 with the conflicting cell name.

- wireguard.address: derive network from new address, check all connected
  cells' vpn_subnet for overlap (after existing format validation)
- ip_range: check all connected cells' vpn_subnet for overlap (after
  existing RFC-1918 validation)

Tests: 4 cases each (overlap → 409, no overlap → ok, no cells → ok,
format error still fires first → 400).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 10:00:58 -04:00
roof c658d2b16c Add domain conflict validation when changing domain or accepting heal invite
Two gaps allowed a cell to take a domain already in use by a connected cell:

1. PUT /api/config domain change: added check against cell_link_manager's
   connected cells list before saving — returns 409 if the new domain
   collides with any connected cell's domain.

2. accept_invite healing path: a remote cell changing its domain via a
   re-invite was not validated against other connected cells' domains.
   Now calls _check_invite_conflicts(invite, exclude_cell=name) before
   applying any change.

Also: the healing path now detects domain changes (alongside dns_ip/
vpn_subnet/endpoint), updates the stored domain, and refreshes the DNS
forward rule when the domain changes.

Tests: 3 new domain-conflict tests in test_config_validation.py;
3 new accept_invite healing tests in test_cell_link_manager.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 09:46:58 -04:00
roof ac0c16c97b Fix session cookie name collision when running multiple PIC instances on localhost
Flask's default cookie name ('session') is shared across all ports on the same
hostname. When two PIC instances are accessed via localhost:portA and localhost:portB,
logging into one overwrites the other's session cookie, causing repeated logouts.

Derive a unique 8-hex suffix from each instance's persistent SECRET_KEY and set
SESSION_COOKIE_NAME = 'pic_sess_<suffix>'. This ensures each cell uses a distinct
cookie name, so sessions are fully isolated regardless of hostname.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 09:15:42 -04:00
roof 28a193e430 Fix ensure_postup_dnat to strip-and-replace all DNAT rules idempotently
_get_dnat_container_ips() used a concatenating docker inspect format that
produced "invalid IP" when containers had multiple network attachments.
The old ensure_postup_dnat appended rather than replacing, so each update
call added a broken duplicate set of rules causing iptables to fail on
startup and tear down wg0 entirely.

Fix _get_dnat_container_ips to use a space separator in the format string
and validate each token as a real IP before accepting it.

Rewrite ensure_postup_dnat with _is_dnat_rule() helper: strips every
managed DNAT/FORWARD rule (any IP, port 53/80) on semicolon-split and
appends a single correct set — fully idempotent regardless of prior state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 06:54:20 -04:00
roof dc2606541c feat: Phase 4 hardening — retry/backoff, loop detection, sync status UI + tests
Phase 4.1 — Retry/backoff for failed permission pushes:
- _compute_next_retry(): capped exponential backoff with jitter (60s–1h)
- _record_push_result(): tracks push_attempts and next_retry_at per link
- replay_pending_pushes(): skips links still in backoff window, logs deferred count
- _load() migration: adds push_attempts/next_retry_at to existing records

Phase 4.2 — Loop detection (A→B→A routing cycle):
- set_peer_route_via(): returns 409 if target cell already routes peers through us
- apply_remote_permissions(): soft warning when accepting exit-relay that would cycle

Phase 4.3 — Sync staleness indicator in Cell Network UI:
- SyncBadge component: green (synced), amber (pending/failed), gray (never)
- Shows relativeTime of last sync + error message + next retry estimate
- Injected into CellPanel header alongside tunnel online/handshake status

Tests (54 new):
- TestCheckInviteConflicts: subnet overlap, domain conflict, exclude_cell (9 tests)
- TestPushInviteToRemote: success, 4xx, no endpoint, subprocess errors (7 tests)
- TestAcceptInviteNew: new cell, idempotent, healing dns/subnet changes (16 tests)
- TestAddConnectionMutualPairing: push-invite call, non-fatal failure (5 tests)
- TestPeerSyncAcceptInvite endpoint: happy path, field validation, error propagation (16 tests)
- Fixed 2 existing replay tests to clear backoff gate (simulates elapsed window)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 04:18:36 -04:00
roof 960a4ecc51 fix: WG address change now queues pending restart + heals cell connections
Three issues fixed together:

1. WireGuard address changes now go through the pending-restart queue
   (shown in the UI banner) instead of restarting cell-wireguard immediately.
   Only private_key changes still restart immediately; address and port
   changes both defer to the user-initiated Apply flow.  Previously the
   address change was silently applied and never appeared in Settings →
   Pending Configuration.

2. When the WG address changes, the API spawns a background thread that
   pushes the updated invite to all connected cells (over LAN, before the
   WG tunnel is back up).  This lets remote cells automatically update
   their dns_ip, AllowedIPs, and CoreDNS forwarding rules without manual
   re-pairing.

3. accept_invite now handles the "already connected but changed" case:
   if the remote cell re-sends an invite with a different dns_ip, vpn_subnet
   or endpoint, we update the stored link, the WG AllowedIPs, and the
   CoreDNS forward rule in place — no delete/re-add required.  Previously
   the endpoint was ignored and returned the stale record unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 08:29:18 -04:00
roof 0e16d6968a fix: prevent test runs from corrupting live WG state; sync wg0.conf on IP change
Three fixes:

1. Extend the docker-exec safety guard in wireguard_manager to also check
   for 'wg_confs' in the config path.  When running unit tests on the host
   the API uses /app/config/wireguard/wg0.conf (no wg_confs subdir), so the
   old '/tmp/' | 'pytest' check didn't fire — _syncconf and friends were
   executing live 'docker exec cell-wireguard wg set' calls against the
   running container, removing real VPN peers that didn't appear in the
   test config.  The wg_confs subdir only exists inside the container mount,
   so its presence reliably gates live calls.

2. Fix get_split_tunnel_ips() wrong path: self.data_dir + 'api/cell_links.json'
   → self.data_dir + 'cell_links.json'.  The extra 'api/' segment produced
   /app/data/api/cell_links.json inside the container instead of the real
   /app/data/cell_links.json, so connected cells were silently excluded from
   split-tunnel CIDRs.

3. update_peer_ip_registry and ip_update now also call
   wireguard_manager.update_peer_ip so wg0.conf AllowedIPs stay in sync when
   a peer's VPN IP changes at runtime (previously only peers.json was updated).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 07:45:28 -04:00
roof 99c1d9cd92 feat: auto mutual WG pairing + subnet/domain conflict detection
**Auto mutual pairing**
When Cell A imports Cell B's invite (POST /api/cells on A), A now
immediately pushes its own invite to Cell B over the LAN (using the
endpoint IP, before the WG tunnel exists) via the new endpoint:
  POST /api/cells/peer-sync/accept-invite

Cell B auto-adds Cell A as a WireGuard peer and DNS forward, completing
the bidirectional tunnel without any manual action on Cell B's UI.
The endpoint is idempotent and unauthenticated (runs before WG tunnel).

Previously, the pairing was one-sided: Cell A had Cell B as a WG peer
but Cell B never had Cell A — the tunnel never established and all
cross-cell operations silently failed.

**Conflict detection (add_connection + accept-invite)**
_check_invite_conflicts() now validates before connecting:
  - VPN subnet must not overlap own subnet or any already-connected cell's subnet
  - Domain must not match own domain or any already-connected cell's domain
Returns clear error messages so the admin knows which cell to reconfigure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 06:24:46 -04:00
roof 1a611e0474 fix: UI always accessible; fix exit-relay AllowedIPs not updating
**PIC UI always accessible (service_access=[])**
Remove the per-peer Caddy:80 ACCEPT/DROP rule from apply_peer_rules.
Service access was enforced at two layers (iptables DROP + CoreDNS ACL),
but the iptables layer also blocked the PIC web UI served through Caddy.
CoreDNS ACL alone is sufficient — DNS blocks service hostnames; the UI
path through Caddy remains reachable regardless of service_access value.

**Exit-relay internet routing (route_via another cell)**
update_peer_ip validated new_ip as a single ip_network, rejecting the
comma-separated '10.0.1.0/24, 0.0.0.0/0' string passed by
update_cell_peer_allowed_ips(add_default_route=True). The AllowedIPs
in wg0.conf was never updated, so WireGuard never routed internet traffic
through the exit cell's tunnel. Fix: validate each CIDR individually and
apply the change live via wg set without a container restart.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 05:41:22 -04:00
roof c521fab1cb fix: merge CoreDNS ACL per-service and add reload plugin; add peer/cell e2e tests
- _build_acl_block: put all blocked IPs for a service in ONE acl block instead
  of one block per peer — the first block's allow-all was silently granting
  access to every peer after the first blocked one (first-match semantics)
- generate_corefile: add 'reload' plugin so SIGUSR1 triggers Corefile reload
  in newer CoreDNS builds (without it the signal was a no-op)
- tests/test_firewall_manager.py: new tests for single merged ACL block and
  the reload directive
- tests/e2e/api/test_peer_access_update.py: e2e tests for service_access,
  internet_access, and peer_access updates persisting live to iptables/CoreDNS
- tests/e2e/api/test_cell_to_cell.py: e2e tests for cell-to-cell connection
  management, permissions API, and cross-cell service access restrictions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 04:57:37 -04:00
roof f1666ba19c fix: embed DNAT rules in wg0.conf PostUp for persistence + fix dns_ip in server config
DNAT rules applied via docker exec are lost whenever wg-easy reloads the
WireGuard interface (PostDown flushes the nat table then PostUp only
re-adds static rules). Fix: embed DNS (port 53) and service (port 80)
DNAT rules directly in wg0.conf PostUp/PostDown so they reapply on every
interface restart. ensure_postup_dnat() patches existing configs on startup.

get_server_config() now returns the WG server IP (e.g. 10.0.0.1) for
dns_ip instead of the cell-dns container IP (172.20.0.3). This makes the
value consistent with what get_peer_config() writes into the .conf file,
and fixes the stale hint text in Peers.jsx and WireGuard.jsx.

UI: fallback dns_ip changed from 172.20.0.3 to 10.0.0.1; split-tunnel
fallback drops the 172.20.0.0/16 stale range.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 04:07:10 -04:00
roof 9a800e3b6b feat: fix cross-cell service access — DNS DNAT, service DNAT, Caddy routing
DNS A records now return the WireGuard server IP (10.0.0.1) instead of
Docker bridge VIPs so cross-cell peers resolve service names correctly
regardless of their bridge subnet. DNAT rules (wg0:53→cell-dns:53 and
wg0:80→cell-caddy:80) are applied at startup. Caddy routes by Host header,
eliminating the Docker bridge subnet conflict. Firewall cell rules allow
DNS and service (Caddy) traffic from linked cell subnets. Split-tunnel
AllowedIPs now dynamically includes connected-cell VPN subnets and drops
the 172.20.0.0/16 range. Peers with route_via set now receive full-tunnel
config (0.0.0.0/0) so all their traffic exits via the remote cell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 03:12:09 -04:00
roof f2f15eb17e fix: restore cell WG peer blocks lost from wg0.conf on startup
Cell link [Peer] blocks can vanish from wg0.conf after a container
rebuild or config reset. The startup recovery previously only restored
VPN peer rules (iptables) but not the WireGuard peer blocks needed for
cell-to-cell tunnels, leaving the link red with no automatic recovery.

Add _restore_cell_wg_peers() called from _apply_startup_enforcement()
that reconciles wg0.conf against cell_links.json and re-adds any missing
[Peer] blocks, then calls _syncconf() to hot-reload the interface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 01:52:47 -04:00
roof 68c27b4521 security: replace WireGuard catch-all ACCEPT with DROP
The PostUp rule appended `iptables -A FORWARD -i wg0 -j ACCEPT` which
allowed any WireGuard-connected client full internet access regardless of
per-peer rules, even when no peers were configured in wg0.conf.

Fix: change PostUp/PostDown to use DROP as the catch-all. Per-peer and
per-cell rules use -I (insert at top) so they take precedence; unknown
or unconfigured WG traffic hits the DROP at the bottom.

Also add reconcile_stale_peer_rules() called on startup to remove FORWARD
rules for peer IPs that no longer exist in the registry, preventing deleted
peers from retaining firewall access across container restarts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 00:31:55 -04:00
roof 8ea834e108 feat: Phase 3 - per-peer internet routing via exit cell
Adds the ability to route a specific peer's internet traffic through a
connected cell acting as an exit relay.

Cell A side:
- PUT /api/peers/<peer>/route-via {"via_cell": "cellB"} sets route_via
- Updates WG AllowedIPs to include 0.0.0.0/0 for the exit cell peer
- Adds ip rule + ip route in policy table inside cell-wireguard so the
  specific peer's traffic egresses via cellB's WG IP
- Sets exit_relay_active on the cell link and pushes use_as_exit_relay=True
  to cellB via peer-sync

Cell B side:
- Receives use_as_exit_relay in the peer-sync payload
- Calls apply_cell_rules(..., exit_relay=True) to add FORWARD -o eth0 ACCEPT
- Stores remote_exit_relay_active flag for startup recovery

Startup recovery:
- apply_all_cell_rules passes exit_relay=remote_exit_relay_active (cellB)
- _apply_startup_enforcement reapplies ip rule for each peer with route_via (cellA)
  since policy routing rules don't survive container restart

peer_registry gets route_via field with lazy migration.
22 new tests across test_cell_link_manager, test_peer_registry, test_peer_route_via.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:23:31 -04:00
roof dcee03dd3f feat(cells): Phase 2 — exit-offer signaling between connected cells
Adds the ability for a cell to signal to a peer that it's willing to
route internet traffic on their behalf.  This is the signaling layer
for Phase 3 (per-peer routing via exit cell).

Changes:
- cell_links.json: exit_offered (bool) + remote_exit_offered (bool)
  fields with lazy migration (default false for existing records)
- _push_permissions_to_remote: includes exit_offered in the push body
- apply_remote_permissions: accepts exit_offered kwarg; stores it as
  remote_exit_offered on the matching cell link
- peer-sync receiver: passes exit_offered from body to apply_remote_permissions
- CellLinkManager.set_exit_offered(cell_name, offered): persists +
  triggers push so the remote learns of our offer immediately
- PUT /api/cells/<name>/exit-offer: REST endpoint to toggle the flag
- 12 new tests covering all new paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:49:21 -04:00
roof 7da0cbb714 fix: add X-Forwarded-For WG IP to peer-sync push curl command
MASQUERADE rewrites the source IP of forwarded packets from
the cell's WG address (10.0.x.1) to cell-wireguard's bridge
IP (172.20.x.9).  The peer-sync endpoint authenticates callers
by checking that the source IP is inside a known cell's vpn_subnet,
so MASQUERADE caused all pushes to fail with 403.

Fix: _push_permissions_to_remote() now calls _local_wg_ip() to
get the local wg0 address and passes it as X-Forwarded-For.
_authenticate_peer_cell() already supports XFF for exactly this
proxying scenario.  Also adds a test verifying the header is present
in the constructed curl command.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:24:08 -04:00
roof 59927b6ad7 fix: whitelist peer-sync endpoint from session auth + CSRF
/api/cells/peer-sync/permissions is called over the WireGuard tunnel
by remote cells — they have no session cookie and cannot produce a CSRF
token. The endpoint authenticates via source IP (must be in the remote
cell's vpn_subnet) and WireGuard public key instead.

Without this, the global enforce_auth hook returns 401 before the route
handler runs, so all cross-cell permission pushes fail even when the
WG tunnel and iptables rules are correct.

Also adds a test verifying the route can be reached without a session.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 14:59:57 -04:00