Commit Graph

40 Commits

Author SHA1 Message Date
roof 3d750ed1e8 Fix DDNS security and reliability gaps (#2, #3, #5, #6, #7)
Unit Tests / test (push) Successful in 7m23s
- Fix #2: Move DDNS bearer token from cell_config.json to data/api/ddns_token.
  Token is now in the secrets store (data/) rather than the config store (config/).
  Auto-migrates existing installs on first access. ConfigManager.get/set_ddns_token()
  added. set_ddns_config() now strips 'token' key to prevent it leaking back.

- Fix #3: Set Caddyfile permissions to 0o600 after write so the token embedded
  in the Caddyfile is not world-readable on the host filesystem.

- Fix #5: Heartbeat now fires IDENTITY_CHANGED after re-registration so Caddy
  regenerates its config with the new token automatically — users no longer need
  to click Re-register in Settings after a wizard registration failure.
  Also: heartbeat skips the 401-cycle when no token exists and goes straight to
  registration instead. DDNSManager now accepts service_bus= and is wired up.

- Fix #6: Settings page starts polling GET /api/caddy/cert-status every 15s
  after a successful DDNS re-registration and shows "Acquiring certificate…"
  feedback until Let's Encrypt issues the cert (up to 5 minutes).

- Fix #7: regenerate_with_installed() is debounced (5 s window) so two rapid
  IDENTITY_CHANGED events (e.g. wizard + heartbeat) can't start simultaneous
  ACME orders that interfere with each other.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 03:37:48 -04:00
roof 40f9d90fad feat: improve setup wizard and DDNS UX
Unit Tests / test (push) Successful in 7m29s
Setup wizard (Issue 1 — UI):
- pic.ngo subdomain input now uses the same split-field style as DuckDNS:
  input + static '.pic.ngo' suffix in a flex row, availability status below

Setup wizard (Issue 2 — Caddy not regenerating after completion):
- complete_setup route now fires IDENTITY_CHANGED after a successful wizard
  submission so CaddyManager regenerates the Caddyfile immediately; users
  no longer need to press 'Renew Certificate' to start ACME

Settings — DDNS status (Issue 2 — domain status missing):
- New GET /api/ddns/status endpoint: returns registered flag, domain_name,
  public_ip (ipify with 30s cache), last_ip from heartbeat
- Settings DDNS section for pic_ngo now shows a live status row with
  color-coded dot (green=registered+current, yellow=registered+stale,
  gray=not registered), current public IP, and a Check button
- Status auto-refreshes on mount and after each successful re-registration

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 00:36:47 -04:00
roof 76bbc2b67a fix: EmailManager route calls get_email_users not get_users
Unit Tests / test (push) Successful in 7m27s
The method is named get_email_users in EmailManager; the route was
calling the non-existent get_users, causing an AttributeError on every
GET /api/email/users request.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:12:24 -04:00
roof 69862331e7 fix: DDNS update token in body, webdav gating, regression tests
Unit Tests / test (push) Successful in 7m25s
- PicNgoDDNS.update(): send token in request body instead of Authorization
  header; DDNS server validates it from body (was returning HTTP 422 on
  every heartbeat, leaving IP record stale after fresh install)
- peers.py / Peers.jsx: webdav service_access only valid when 'files' store
  service is installed; was always shown even with no services, confusing
  users into thinking WebDAV was pre-installed
- 10 new regression tests: DDNS update body contract, Caddy always
  regenerates on startup with no services, peer role allowed on
  /api/services/active, webdav gating by installed services

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 16:56:12 -04:00
roof c696ca9ef6 fix: DNS split-horizon in DDNS mode, service access filter, health check, verbosity persistence
Unit Tests / test (push) Successful in 7m32s
- DNS (critical): add _configured_dns_params() that returns (primary_domain,
  split_horizon_zones) from config_manager so all apply_all_dns_rules() callers
  pass the correct primary zone (e.g. 'pic.ngo') and split-horizon list
  (e.g. ['pic1.pic.ngo']) instead of the FQDN as the primary — fixes
  DNS_PROBE_FINISHED_BAD_CONFIG for all external domains when on VPN

- firewall_manager: add split_horizon_zones param to apply_all_dns_rules()
  and forward it to generate_corefile()

- Peers: filter service_access list to installed services only; peers.py
  derives valid services from config_manager.get_installed_services() with
  the email→mail ID mapping; Peers.jsx fetches from /api/store/installed
  and filters the checkboxes and defaults accordingly

- Health check: fix file_manager→'files' ID mapping so files service health
  is checked when installed (was silently skipped due to 'file' vs 'files')

- Verbosity persistence: move log_levels.json from non-mounted
  /app/api/config/ to CONFIG_DIR (/app/config/) which maps to config/api/
  on the host; both load (managers.py) and save (routes/services.py) updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 13:05:58 -04:00
roof 26576e1124 Fix: use domain_name (FQDN) in cell invite and conflict checks
Unit Tests / test (push) Successful in 7m39s
The GET /api/cells/invite endpoint was returning domain='pic.ngo' instead
of the full FQDN 'test5.pic.ngo' because it read _identity.domain rather
than _identity.domain_name.

Apply the same domain_name preference (domain_name || domain) to:
- routes/cells.py get_cell_invite() — the invite shown to connecting cells
- routes/cells.py update_cell_permissions() — Corefile DNS regeneration
- cell_link_manager.py _check_invite_conflicts() — incoming domain collision check
- cell_link_manager.py exchange_invites() — own invite construction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 11:56:42 -04:00
roof f84f16fcd6 fix: add /api/network/dns/corefile endpoint and per-line iptables check
Unit Tests / test (push) Successful in 11m13s
The e2e tests were reading a stale Corefile at a hardcoded fallback path
(/home/roof/pic/config/dns/Corefile) instead of the live one written by
the API (/opt/pic/config/dns/Corefile on pic1). Adding a proper API
endpoint eliminates the path ambiguity.

The iptables test was checking whether peer_ip, DROP, and dpt:80 appeared
anywhere in the full multi-line output rather than on the same rule line,
producing false positives. Now checks per line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 05:54:17 -04:00
roof eee0e800aa feat: add GET /api/peers/<peer_name> endpoint
Unit Tests / test (push) Successful in 11m19s
Allows fetching a single peer by name. E2E tests need this to verify
persisted peer state after PUT operations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 05:19:10 -04:00
roof 39c59fd3ef feat: WireGuard endpoint override + fix Docker network label issue
Unit Tests / test (push) Successful in 11m14s
Endpoint override:
- Add PUT /api/wireguard/endpoint to set endpoint_override in identity
  config; GET returns detected, override, and effective endpoints
- _effective_endpoint() helper applies override in peer config generation
  (wireguard.py and peer_dashboard.py); detected IP still shown in UI
- Add Endpoint Override input in WireGuard page — solves the common case
  where auto-detected IP is a gateway/VPS but peers connect via LAN IP

Docker cell-network fix:
- Declare cell-network external in docker-compose.yml; Docker Compose v5
  enforces label ownership and rejects networks created by older versions
- Makefile start/update pre-create cell-network idempotently
- reinstall/uninstall(full) explicitly delete and recreate the network
- Fix uninstall loop path: data/api/services/ (not data/services/)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 04:51:38 -04:00
roof 0267dce73d feat: HTTPS cert status, IDENTITY_CHANGED wiring, remove stale ip_utils Caddyfile writes
Unit Tests / test (push) Successful in 11m18s
- CaddyManager: add refresh_cert_status() and get_cert_status_fresh() that
  open a live TLS connection to cell-caddy:443 to read cert expiry; avoids
  needing a volume mount into the API container
- CaddyManager: periodic cert refresh in health_monitor_loop (every 60 cycles)
- config.py PUT /api/ddns: publish IDENTITY_CHANGED so CaddyManager regenerates
  the Caddyfile immediately after any domain/cell_name change — previously the
  event was never fired from this route
- config.py: remove all ip_utils.write_caddyfile() calls; CaddyManager is now
  the sole authority for Caddyfile generation
- app.py: add GET /api/caddy/cert-status route
- app.py: add GET /api/egress/status and PUT /api/egress/services/<id>/exit routes
- Settings.jsx: display cert status badge (valid/expired/internal/unknown) with
  expiry date and days-remaining in the domain section
- Tests: TestRefreshCertStatus (8 tests), TestDdnsConfigUpdatesFiresIdentityChanged,
  TestCaddyCertStatusRoute added; fix expired-cert helper to set not_valid_before
  relative to expiry so it's always earlier

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 11:39:36 -04:00
roof 41d09c598b wire: AccountManager HTTP dispatch + EgressManager startup + egress API routes
Unit Tests / test (push) Successful in 11m15s
- add_peer() now calls account_manager.provision() for any installed store
  service whose manifest declares accounts.manager == 'http', enabling
  per-peer credential provisioning to third-party HTTP services
- reapply_on_startup() calls egress_manager.apply_all() so fwmark rules
  survive container restarts without manual intervention
- add GET /api/egress/status and PUT /api/egress/services/<id>/exit routes
  so the UI can read and override per-service egress policy
- tests: HTTP provision wiring (happy path + non-fatal failure), egress
  apply_all at startup (wired/unwired/failure cases)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 10:30:41 -04:00
roof e87022dc55 fix: cell-network name, install error surfacing, health history cleanup
Unit Tests / test (push) Successful in 11m22s
- docker-compose.services.yml: change external network name from
  pic_cell-network to cell-network so store-service compose files can find
  it.  The project-prefixed name was overriding the explicit name: cell-network
  fix in docker-compose.yml when both files were merged by make start.

- service_store.py: normalize docker compose stderr into the error key in
  the 400 response so the Store page shows the actual failure reason instead
  of the generic fallback message.

- app.py: skip health checks for email/calendar/files managers when those
  optional store services are not installed — prevents false Down alerts and
  unnecessary noise in health history.

- Logs.jsx: remove Email/Calendar/Files columns from the health history table;
  they are optional store services, not core builtins that should always appear.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 14:28:46 -04:00
roof 44d7e96f29 feat: Phase 6 — require_active_service decorator + wizard install wiring
Email/calendar/files routes now return 404 when the service is not
installed, using a require_active_service decorator that checks
ServiceRegistry. Status endpoints are exempt so health checks always work.

SetupManager.complete_setup() now accepts a service_store_manager and
installs any wizard-selected services in a background daemon thread after
setup completes. Failures are logged but do not fail the wizard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 16:58:57 -04:00
roof a10fe11136 feat: Phase 4 — dynamic nav + service visibility based on installed services
Unit Tests / test (push) Successful in 11m24s
Email, calendar, and files no longer appear in the nav or as usable pages
unless they are installed. The nav refreshes whenever a service is installed
or removed via the new pic-services-changed CustomEvent.

Changes:
- routes/services.py: add GET /api/services/active endpoint
- api.js: add servicesAPI.listActive()
- App.jsx: replace hardcoded coreServiceChildren with dynamic state fetched
  from /api/services/active; SERVICE_META maps ids to nav entry shapes
- ServiceNotInstalledBanner.jsx: new component — admin gets catalog link,
  peer gets "contact admin" message
- EmailPage/CalendarPage/FilesPage: show banner when service not installed
- ServicesIndex.jsx: remove CoreServiceCard + CORE_SERVICES "Built-in"
  section; rename Remove → Uninstall; dispatch pic-services-changed on
  install/uninstall success
- MyServices.jsx: conditionally render service cards based on active list;
  placeholder card when absent; page-level notice when nothing is installed
- tests/test_services_active_endpoint.py: 4 new endpoint tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:15:02 -04:00
roof 16fb362df7 feat: replace hardcoded service names with ServiceRegistry-driven Caddy and CoreDNS config
Unit Tests / test (push) Failing after 11s
Previously, CaddyManager and NetworkManager contained hardcoded lists of
service names (calendar, files, mail, webdav, etc.), meaning every new
service required a code change to appear in Caddy routes and DNS records.
Now both managers accept a service_registry parameter and derive their
service lists dynamically from the registry at runtime.

- CaddyManager: new _build_registry_service_routes() and
  _http01_service_pairs() methods pull routes from the registry
- NetworkManager: new _get_service_subdomains() method returns registry
  subdomains with a hardcoded fallback when no registry is wired in;
  _build_dns_records, stale-record detection, and service name sets all
  use the registry
- managers.py: service_registry constructed before network_manager so it
  can be injected into both CaddyManager and NetworkManager
- service_registry.py: validation chokepoint in get_caddy_routes() rejects
  invalid subdomain/backend values and reserved service names
- service_store_manager.py: _validate_manifest now validates top-level
  subdomain, backend, extra_subdomains, and extra_backends fields
- tests: 24 new tests covering registry-driven routing and DNS subdomain
  generation (test_caddy_registry_integration.py)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 18:27:52 -04:00
roof 0afdee32da feat: Services UI — nested nav, per-service pages, settings migration
Rename Store → Services: ServicesIndex.jsx shows built-in core services
(Email, Calendar, Files) with Manage links, plus the existing add-on
store below.

New service sub-pages at /services/email|calendar|files serve both
admin and peer roles. Admins see connection info, service status, users
list, and an inline config form (port/data-dir). Peers see connection
info and their personal credentials fetched from peerAPI.

Navigation restructured: a Services parent item expands to show the
three sub-pages via a collapsible sidebar group (ChevronDown toggle).
Both admin and peer navigation include the Services group. Sidebar
extracted NavItem/NavList components to eliminate the duplicate mobile/
desktop rendering.

Settings.jsx drops EmailForm, CalendarForm, FilesForm and their
SERVICE_DEFS entries. Port conflict detection and per-service validation
logic extracted to utils/serviceConfig.js, shared by Settings and the
new service pages. Service form flushers are registered without cleanup
so the Apply banner saves dirty config even when the user navigates away
from a service page before clicking Apply.

Legacy routes /email, /calendar, /files, /store redirect to their new
canonical paths.

GET /api/config now includes installed_services so the nav can derive
which add-ons are installed without a separate store fetch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 06:46:17 -04:00
roof d7dbd596ab feat: route PIC services as subdomains of the cell's effective domain
Unit Tests / test (push) Successful in 11m33s
In DDNS modes (pic_ngo, cloudflare, duckdns, http01), all built-in
services are now reachable as subdomains of the cell domain, e.g.
calendar.pic1.pic.ngo instead of pic1.pic.ngo/calendar.

Key changes:
- CaddyManager._build_core_service_routes(): new helper generates
  Caddy named-matcher host blocks for calendar, mail/webmail, files,
  webdav, and api subdomains within the wildcard TLS server block.
- All ACME modes (pic_ngo, cloudflare, duckdns) use the new
  subdomain matchers; http01 emits a dedicated server block per service.
- http01: installed store-plugin services whose name clashes with a
  core service are skipped to prevent duplicate server blocks.
- routes/config.py: ip_utils.write_caddyfile() is skipped in non-LAN
  modes so LAN Caddy config never overwrites the ACME config.
- firewall_manager.generate_corefile(): new split_horizon_zones param
  adds local authoritative file zones so LAN clients resolve
  *.pic1.pic.ngo to the internal Caddy IP without hairpin NAT.
- NetworkManager.update_split_horizon_zone(): writes the wildcard zone
  file and regenerates the Corefile with the split-horizon block;
  called automatically after every identity change in non-LAN mode.
- Added @ to allowed record-name chars in update_dns_zone validation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 04:31:57 -04:00
roof 1f016de855 feat: make DDNS domain_name the effective domain across all services
Unit Tests / test (push) Successful in 11m35s
- ConfigManager.get_effective_domain(): returns domain_name when DDNS
  active (pic_ngo/cloudflare/duckdns), domain otherwise. Used by all
  public-facing services so they use the real registered FQDN.
- ConfigManager.get_internal_domain(): always returns _identity.domain
  (CoreDNS zone name, dnsmasq, cell-link invites — stays internal).
- Silent migration: if domain_mode != lan and domain is generic "cell",
  auto-set to {cell_name}.local for unique CoreDNS zone naming.
- caddy_manager: fix custom_domain bug — cloudflare/http01 modes were
  reading identity.get('custom_domain') which never exists; now reads
  domain_name correctly.
- routes/config, app: expose effective_domain in GET /api/config and
  /api/status responses.
- email_manager, routes/email: use get_effective_domain() for
  OVERRIDE_HOSTNAME, POSTMASTER_ADDRESS, and new-user email defaults.
- ServiceBus.IDENTITY_CHANGED event: emitted from PUT /api/config and
  POST /api/ddns/register after identity writes; caddy_manager and
  email_manager subscribe to regenerate config automatically.
- Settings.jsx: hide Local Domain input in non-LAN modes; show
  read-only effective_domain with "managed by DDNS" badge and an
  Advanced toggle for the internal CoreDNS zone name.
- 11 new test classes covering all new helpers, event subscriptions,
  caddy/email handlers, and the custom_domain fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 02:48:47 -04:00
roof 0b31d02f10 feat: DDNS self-healing heartbeat + manual re-register endpoint
Unit Tests / test (push) Successful in 15m26s
- DDNSTokenExpired exception triggers auto re-register in update_ip()
  so cells recover silently after a DDNS DB reset
- POST /api/ddns/register lets the user force re-registration from Settings
- Re-register button in Settings → External Domain & DDNS (pic_ngo only)
- 3 new tests covering register endpoint: wrong provider, missing name, success

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 15:05:27 -04:00
roof 61e8631c7d feat: DDNS settings integration — check availability, update credentials
- GET /api/config now returns domain_mode, domain_name, ddns.{provider,subdomain,has_token}
- GET /api/ddns/check/<name> proxies availability check to DDNS service
- PUT /api/ddns validates and saves cloudflare/duckdns credentials post-setup
- When cell_name changes for pic_ngo provider, auto-registers the new subdomain
- Settings: Cell Name shows availability badge for pic_ngo; auto-save blocks on taken
- Settings: new External Domain & DDNS section — pic_ngo info, cloudflare/duckdns edit
- 11 new tests for the two new endpoints (all pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 14:35:37 -04:00
roof 777ffa4fb2 fix: use DDNS_URL env var for availability check; default to port 8080
Unit Tests / test (push) Successful in 15m23s
_check_pic_ngo_available was hardcoding https://ddns.pic.ngo, ignoring
DDNS_URL. Now imports DDNS_API_BASE from setup_manager so both the
availability check and DDNS registration use the same configured URL.

API container now receives DDNS_URL and DDNS_TOTP_SECRET from env.
Default DDNS_URL points to http://ddns.pic.ngo:8080/api/v1 (the
FastAPI service runs on port 8080 without TLS termination in front).

Also returns 503 (not 500) when the DDNS service is unreachable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-26 13:06:44 -04:00
roof 4a42ff5dcc wizard: move all config to /setup; install.sh is infrastructure-only
Unit Tests / test (push) Successful in 15m41s
install.sh no longer prompts for anything. It installs packages (with sudo),
creates the system user, clones the repo, and runs 'make install' — all as
the invoking user. Only package installs and system-level ops use sudo.
All folder creation happens under the user's own account, no chown needed.

/setup wizard gains the missing validation that was previously in install.sh:
- Step 1: checks pic.ngo name availability via backend (non-blocking)
- Step 4: 'Verify token' button for Cloudflare and DuckDNS tokens,
  validated server-side through new /api/setup/validate steps

API changes (routes/setup.py):
- validate step 'pic_ngo_available': proxy check to ddns.pic.ngo
- validate step 'cloudflare_token': verify via Cloudflare tokens API
- validate step 'duckdns_token': verify via DuckDNS update endpoint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:07:56 -04:00
roof 2d842abe5b installer: restore cell identity prompts and domain setup
Unit Tests / test (push) Successful in 15m39s
Reverts 8d1ef39. The installer must collect cell name, domain mode, and
provider tokens before 'make install' so that DDNS registration,
availability checks, and Caddy TLS can be configured at first boot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 15:01:32 -04:00
roof 0a21f22076 Phase 4: service store — manifest validation, install/remove, Store UI
- ServiceStoreManager: manifest allowlist (git.pic.ngo/roof/*), volume
  denylist, ACCEPT-only iptables rules, ${SERVICE_IP}-only dest_ip
- IP allocator: pool 172.20.0.20-254, skips CONTAINER_OFFSETS VIPs
- Compose overlay: docker-compose.services.yml auto-included via DCF
- Flask blueprint at /api/store: list, install, remove, refresh
- Store.jsx: full install/remove UI with spinners and toast notifications
- 95 new unit tests for ServiceStoreManager (all passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 10:19:39 -04:00
roof f77d7fabcd Phase 3: ddns_manager — DDNS client, provider adapters, IP heartbeat
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 09:42:00 -04:00
roof cf1b9672f4 Phase 1: first-run setup wizard, bash installer, Docker profiles
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-09 08:05:38 -04:00
roof c2d215ee2e fix: cross-cell routing for split-tunnel peers
Three related fixes for split-tunnel peers that need to reach connected cells:

1. apply_peer_rules/apply_all_peer_rules now accept wg_subnet (actual local VPN
   subnet) and cell_subnets (connected cells' vpn_subnets) parameters instead of
   hardcoding 10.0.0.0/24. All callers (startup, add_peer, update_peer,
   apply-enforcement endpoint) pass the real values.

2. Explicit ACCEPT rules are inserted in FORWARD for each connected cell's
   subnet so split-tunnel peers (internet_access=False) can still reach
   connected cells via the wg0→wg0 path.

3. apply_ip_range in network_manager now loads cell_links.json and passes it
   to generate_corefile(), fixing a race where the bootstrap DNS thread could
   overwrite the Corefile and wipe cross-cell DNS forwarding zones on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:36:28 -04:00
roof 8ee1d88e37 Add subnet conflict validation for wireguard.address and ip_range changes
When a cell is connected to others, changing the local WireGuard address
or Docker ip_range to a subnet that overlaps a connected cell's vpn_subnet
would break routing. Both now return 409 with the conflicting cell name.

- wireguard.address: derive network from new address, check all connected
  cells' vpn_subnet for overlap (after existing format validation)
- ip_range: check all connected cells' vpn_subnet for overlap (after
  existing RFC-1918 validation)

Tests: 4 cases each (overlap → 409, no overlap → ok, no cells → ok,
format error still fires first → 400).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 10:00:58 -04:00
roof c658d2b16c Add domain conflict validation when changing domain or accepting heal invite
Two gaps allowed a cell to take a domain already in use by a connected cell:

1. PUT /api/config domain change: added check against cell_link_manager's
   connected cells list before saving — returns 409 if the new domain
   collides with any connected cell's domain.

2. accept_invite healing path: a remote cell changing its domain via a
   re-invite was not validated against other connected cells' domains.
   Now calls _check_invite_conflicts(invite, exclude_cell=name) before
   applying any change.

Also: the healing path now detects domain changes (alongside dns_ip/
vpn_subnet/endpoint), updates the stored domain, and refreshes the DNS
forward rule when the domain changes.

Tests: 3 new domain-conflict tests in test_config_validation.py;
3 new accept_invite healing tests in test_cell_link_manager.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 09:46:58 -04:00
roof dc2606541c feat: Phase 4 hardening — retry/backoff, loop detection, sync status UI + tests
Phase 4.1 — Retry/backoff for failed permission pushes:
- _compute_next_retry(): capped exponential backoff with jitter (60s–1h)
- _record_push_result(): tracks push_attempts and next_retry_at per link
- replay_pending_pushes(): skips links still in backoff window, logs deferred count
- _load() migration: adds push_attempts/next_retry_at to existing records

Phase 4.2 — Loop detection (A→B→A routing cycle):
- set_peer_route_via(): returns 409 if target cell already routes peers through us
- apply_remote_permissions(): soft warning when accepting exit-relay that would cycle

Phase 4.3 — Sync staleness indicator in Cell Network UI:
- SyncBadge component: green (synced), amber (pending/failed), gray (never)
- Shows relativeTime of last sync + error message + next retry estimate
- Injected into CellPanel header alongside tunnel online/handshake status

Tests (54 new):
- TestCheckInviteConflicts: subnet overlap, domain conflict, exclude_cell (9 tests)
- TestPushInviteToRemote: success, 4xx, no endpoint, subprocess errors (7 tests)
- TestAcceptInviteNew: new cell, idempotent, healing dns/subnet changes (16 tests)
- TestAddConnectionMutualPairing: push-invite call, non-fatal failure (5 tests)
- TestPeerSyncAcceptInvite endpoint: happy path, field validation, error propagation (16 tests)
- Fixed 2 existing replay tests to clear backoff gate (simulates elapsed window)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 04:18:36 -04:00
roof 960a4ecc51 fix: WG address change now queues pending restart + heals cell connections
Three issues fixed together:

1. WireGuard address changes now go through the pending-restart queue
   (shown in the UI banner) instead of restarting cell-wireguard immediately.
   Only private_key changes still restart immediately; address and port
   changes both defer to the user-initiated Apply flow.  Previously the
   address change was silently applied and never appeared in Settings →
   Pending Configuration.

2. When the WG address changes, the API spawns a background thread that
   pushes the updated invite to all connected cells (over LAN, before the
   WG tunnel is back up).  This lets remote cells automatically update
   their dns_ip, AllowedIPs, and CoreDNS forwarding rules without manual
   re-pairing.

3. accept_invite now handles the "already connected but changed" case:
   if the remote cell re-sends an invite with a different dns_ip, vpn_subnet
   or endpoint, we update the stored link, the WG AllowedIPs, and the
   CoreDNS forward rule in place — no delete/re-add required.  Previously
   the endpoint was ignored and returned the stale record unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 08:29:18 -04:00
roof 0e16d6968a fix: prevent test runs from corrupting live WG state; sync wg0.conf on IP change
Three fixes:

1. Extend the docker-exec safety guard in wireguard_manager to also check
   for 'wg_confs' in the config path.  When running unit tests on the host
   the API uses /app/config/wireguard/wg0.conf (no wg_confs subdir), so the
   old '/tmp/' | 'pytest' check didn't fire — _syncconf and friends were
   executing live 'docker exec cell-wireguard wg set' calls against the
   running container, removing real VPN peers that didn't appear in the
   test config.  The wg_confs subdir only exists inside the container mount,
   so its presence reliably gates live calls.

2. Fix get_split_tunnel_ips() wrong path: self.data_dir + 'api/cell_links.json'
   → self.data_dir + 'cell_links.json'.  The extra 'api/' segment produced
   /app/data/api/cell_links.json inside the container instead of the real
   /app/data/cell_links.json, so connected cells were silently excluded from
   split-tunnel CIDRs.

3. update_peer_ip_registry and ip_update now also call
   wireguard_manager.update_peer_ip so wg0.conf AllowedIPs stay in sync when
   a peer's VPN IP changes at runtime (previously only peers.json was updated).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 07:45:28 -04:00
roof 99c1d9cd92 feat: auto mutual WG pairing + subnet/domain conflict detection
**Auto mutual pairing**
When Cell A imports Cell B's invite (POST /api/cells on A), A now
immediately pushes its own invite to Cell B over the LAN (using the
endpoint IP, before the WG tunnel exists) via the new endpoint:
  POST /api/cells/peer-sync/accept-invite

Cell B auto-adds Cell A as a WireGuard peer and DNS forward, completing
the bidirectional tunnel without any manual action on Cell B's UI.
The endpoint is idempotent and unauthenticated (runs before WG tunnel).

Previously, the pairing was one-sided: Cell A had Cell B as a WG peer
but Cell B never had Cell A — the tunnel never established and all
cross-cell operations silently failed.

**Conflict detection (add_connection + accept-invite)**
_check_invite_conflicts() now validates before connecting:
  - VPN subnet must not overlap own subnet or any already-connected cell's subnet
  - Domain must not match own domain or any already-connected cell's domain
Returns clear error messages so the admin knows which cell to reconfigure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 06:24:46 -04:00
roof 9a800e3b6b feat: fix cross-cell service access — DNS DNAT, service DNAT, Caddy routing
DNS A records now return the WireGuard server IP (10.0.0.1) instead of
Docker bridge VIPs so cross-cell peers resolve service names correctly
regardless of their bridge subnet. DNAT rules (wg0:53→cell-dns:53 and
wg0:80→cell-caddy:80) are applied at startup. Caddy routes by Host header,
eliminating the Docker bridge subnet conflict. Firewall cell rules allow
DNS and service (Caddy) traffic from linked cell subnets. Split-tunnel
AllowedIPs now dynamically includes connected-cell VPN subnets and drops
the 172.20.0.0/16 range. Peers with route_via set now receive full-tunnel
config (0.0.0.0/0) so all their traffic exits via the remote cell.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 03:12:09 -04:00
roof 8ea834e108 feat: Phase 3 - per-peer internet routing via exit cell
Adds the ability to route a specific peer's internet traffic through a
connected cell acting as an exit relay.

Cell A side:
- PUT /api/peers/<peer>/route-via {"via_cell": "cellB"} sets route_via
- Updates WG AllowedIPs to include 0.0.0.0/0 for the exit cell peer
- Adds ip rule + ip route in policy table inside cell-wireguard so the
  specific peer's traffic egresses via cellB's WG IP
- Sets exit_relay_active on the cell link and pushes use_as_exit_relay=True
  to cellB via peer-sync

Cell B side:
- Receives use_as_exit_relay in the peer-sync payload
- Calls apply_cell_rules(..., exit_relay=True) to add FORWARD -o eth0 ACCEPT
- Stores remote_exit_relay_active flag for startup recovery

Startup recovery:
- apply_all_cell_rules passes exit_relay=remote_exit_relay_active (cellB)
- _apply_startup_enforcement reapplies ip rule for each peer with route_via (cellA)
  since policy routing rules don't survive container restart

peer_registry gets route_via field with lazy migration.
22 new tests across test_cell_link_manager, test_peer_registry, test_peer_route_via.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:23:31 -04:00
roof dcee03dd3f feat(cells): Phase 2 — exit-offer signaling between connected cells
Adds the ability for a cell to signal to a peer that it's willing to
route internet traffic on their behalf.  This is the signaling layer
for Phase 3 (per-peer routing via exit cell).

Changes:
- cell_links.json: exit_offered (bool) + remote_exit_offered (bool)
  fields with lazy migration (default false for existing records)
- _push_permissions_to_remote: includes exit_offered in the push body
- apply_remote_permissions: accepts exit_offered kwarg; stores it as
  remote_exit_offered on the matching cell link
- peer-sync receiver: passes exit_offered from body to apply_remote_permissions
- CellLinkManager.set_exit_offered(cell_name, offered): persists +
  triggers push so the remote learns of our offer immediately
- PUT /api/cells/<name>/exit-offer: REST endpoint to toggle the flag
- 12 new tests covering all new paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:49:21 -04:00
roof a3d0cd5a48 feat(cells): Phase 1 — permission sync between connected PICs
When PIC A updates service sharing permissions, it immediately pushes
the mirrored state to PIC B over the WireGuard tunnel so B's UI shows
what A is sharing with it in real time.

Architecture:
- Push model: update_permissions() → _push_permissions_to_remote() →
  POST /api/cells/peer-sync/permissions on remote cell
- Auth: source IP must be inside a known cell's vpn_subnet (WireGuard
  tunnel proves identity) + body's from_public_key must match stored key
- Mirror semantics: our inbound (what we share) → their outbound view
- Non-fatal: push failures set pending_push=True; replay_pending_pushes()
  retries at startup so offline cells catch up on reconnect
- add_connection() also pushes initial state so remote sees permissions
  immediately on the first connect

New fields on cell_links.json records (lazy-migrated):
  remote_api_url, last_push_status, last_push_at, last_push_error,
  pending_push, last_remote_update_at

New endpoint: POST /api/cells/peer-sync/permissions

30 new tests (1101 total).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 13:12:30 -04:00
roof 0b103ffafb feat(cells): fix PIC-to-PIC connection + add service-sharing permissions
Phase 1 — connection fixes:
- routing_manager.stop(): remove iptables -F / -t nat -F nuclear flush that
  would wipe WireGuard MASQUERADE and all peer rules on any UI stop action
- wireguard_manager.add_cell_peer(): reject vpn_subnet that overlaps the local
  WG network (routing blackhole — was the root cause of no handshake)
- wireguard_manager._syncconf(): pass Endpoint to 'wg set' so cell peers with
  static endpoints are synced to the kernel (not just AllowedIPs)

Phase 2 — service-sharing permissions backend:
- firewall_manager: add _cell_tag(), clear_cell_rules(), apply_cell_rules(),
  apply_all_cell_rules() — iptables FORWARD rules for cell-to-cell traffic
  using 'pic-cell-<name>' comment tags, distinct from 'pic-peer-*'
- app.py startup enforcement: call apply_all_cell_rules(cell_links) so rules
  survive API restarts
- cell_link_manager: permissions schema {inbound, outbound} per service;
  lazy migration for existing entries; update_permissions(), get_permissions();
  apply_cell_rules wired into add_connection/remove_connection
- routes/cells.py: GET /api/cells/services, GET+PUT /api/cells/<n>/permissions;
  RuntimeError now returns 400 (not 500) from add_connection

Removed broken 'test' cell (subnet 10.0.0.0/24 collided with local WG network).
Second PIC must use a distinct subnet (e.g. 10.0.1.0/24) before reconnecting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 08:35:24 -04:00
roof 5d0238ff3c A5: Extract config routes into blueprint (app.py 1294 → 579 lines)
Move all /api/config/* routes and pending-restart helpers into
routes/config.py. Re-export helpers from app.py for backward compat:

  from routes.config import _set_pending_restart, _clear_pending_restart,
                           _collect_service_ports, _dedup_changes

Test patches updated:
  app._set_pending_restart     → routes.config._set_pending_restart
  app._clear_pending_restart   → routes.config._clear_pending_restart
  app.threading.Thread         → routes.config.threading.Thread

Remaining in app.py: Flask setup, middleware, health monitor thread,
/health, /api/status, /api/health/history* (use module-level state).

1021 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 06:53:24 -04:00
roof 09138fbc18 A5: Extract all route groups into Flask blueprints (app.py -1735 lines)
Extract 9 route groups out of app.py into routes/ blueprints:
- routes/network.py  — DNS, DHCP, NTP, network info/test (10 routes)
- routes/wireguard.py — WireGuard keys, peers, config, enforcement (18 routes)
- routes/cells.py    — cell-to-cell connections (5 routes)
- routes/peers.py    — peer CRUD + IP update + _next_peer_ip helper (10 routes)
- routes/routing.py  — NAT, peer routes, firewall, iptables (17 routes)
- routes/vault.py    — certs, trust, secrets (19 routes)
- routes/containers.py — containers, images, volumes (14 routes)
- routes/services.py — service bus, logs, services status/connectivity (18 routes)
- routes/peer_dashboard.py — peer-scoped dashboard/services (2 routes)

All blueprints use lazy `from app import X` inside route bodies to preserve
test patch compatibility (patch('app.email_manager', mock) still works).

Also included in this commit:
- A1 fix: backup/restore now includes email/calendar user files
- A2 fix: apply_config sets applying=True flag via helper container
- A3 fix: add_peer rolls back firewall on DNS failure

app.py reduced: 3011 → 1294 lines. 1021 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 06:11:21 -04:00