Commit Graph

195 Commits

Author SHA1 Message Date
roof 743b026b01 feat: connectivity redesign phase 7 — cell-relay as a connection type
Unit Tests / test (push) Successful in 13m22s
cell exits surface as cell_relay connections via reconcile, bridged onto
the existing cell route_via mechanism, health from handshake, loop
detection, assignable in the unified UI

- CELL_RELAY_TYPE constant; not manually creatable
- reconcile_cell_relays() derives connections from cell links offering an
  exit (name "Cell: <cellname>", mark+table only, no iface/port/container)
- apply_routes bridges cell_relay to existing route_via path via
  apply_peer_route_via + cell firewall rules + set_exit_relay_active;
  keeps peer.route_via in sync
- _probe_cell_relay health from cell handshake + offer state
- _cell_relay_loops loop detection at assign and apply time
- FAILOPEN_DEFAULTS cell_relay=False
- set_peer_exit clears stale route_via on reassignment
- reconcile hooked into PUT /exit-offer and peer-sync/permissions handlers
- cell_link_manager + wireguard_manager wired into connectivity_manager
- UI: cell_relay in TYPE_META/GROUP_TYPES/GROUP_LABELS (Cells optgroup),
  removed "coming soon" placeholder
- 18 new tests in tests/test_connectivity_cell_relay.py

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 23:58:19 -04:00
roof 603225694c feat: connectivity redesign phase 5 — one container per connection instance
Unit Tests / test (push) Successful in 13m5s
instanceable rendering, per-instance up/down on create/delete,
store-service-installed gate, per-instance health

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:56:31 -04:00
roof d39c091cec feat: connectivity redesign phase 3+4 — per-connection health, per-peer fallback, connection CRUD API
Unit Tests / test (push) Successful in 13m15s
Health probes (probe_health/refresh_health) are type-aware: WireGuard
checks the last WG handshake timestamp, OpenVPN checks the tun/tap
interface, Tor checks the control-port GETINFO, and sshuttle/proxy
types do a TCP reachability probe to the remote endpoint. Results are
persisted via set_connection_status and wired into the health_monitor_loop
so the UI always has a current health snapshot without polling.

Per-peer fail-open semantics: VPN, SSH, and proxy connections default to
fail-closed (kill-switch stays active even when the tunnel is down).
Tor defaults to fail-open. The default can be overridden per-peer via
set_peer_failopen/effective_failopen. apply_routes skips the fwmark and
kill-switch rules for any fail-open peer whose connection health is not
"working", letting traffic fall back to direct routing transparently.

New generic admin-only connection CRUD endpoints (GET/POST/PUT/DELETE
/api/connectivity/connections, GET /<id>/health, PUT
/api/connectivity/peers/<peer>/failopen) are guarded by the existing
admin role check. connection.create, connection.update, connection.delete,
and peer.failopen are all registered in ROUTE_ACTION_MAP for the audit
hook so every change is recorded in the owner-visible change log.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 21:50:45 -04:00
roof 8b50fb1036 feat: audit/change log — owner-visible record of who changed what
Unit Tests / test (push) Successful in 12m47s
Add AuditManager (api/audit_manager.py): JSONL append-only log at
data/api/audit/audit.log with SHA-256 hash chain for tamper detection,
verify endpoint, size-based rotation, and automatic redaction of secret
fields before any entry is written. Supports structured query (actor,
action, date range) and CSV export.

Wire an @app.after_request hook in app.py that fires on every mutating
/api/* request: captures actor, role, remote IP, and maps the route +
method to a human-readable action via ROUTE_ACTION_MAP. Explicit audit
entries for password_change and password_reset are added in
auth_routes.py so those events record the actor without logging secret
values.

Expose an admin-only blueprint (api/routes/audit.py):
  GET /api/audit          — paginated query
  GET /api/audit/export   — CSV download
  GET /api/audit/verify   — hash-chain integrity check

Register AuditManager in managers.py and add api/audit to
config_manager.py critical_data_paths so it is included in backups and
restored with other persistent state.

Add Activity page (webui/src/pages/Activity.jsx, admin-only) reachable
from the nav in App.jsx. New auditAPI helper in api.js covers all three
endpoints.

Tests: test_audit_manager.py (unit: hash chain, redaction, rotation,
query, csv, verify) and test_audit_hook_routes.py (integration: hook
fires on mutating routes, skips safe methods, records actor/ip/action,
backup-inclusion assertion).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 20:19:38 -04:00
roof 13074f56cb fix: logging verbosity now actually applies + per-service log levels
Unit Tests / test (push) Successful in 12m34s
Root causes fixed:
- Dead LOG_LEVEL globals() lookup pinned root logger at INFO regardless of
  PIC_LOG_LEVEL env or config; replaced with _resolve_root_log_level() +
  apply_root_log_level() which sets both root logger and all attached handlers
  at startup and on runtime re-apply.
- set_service_level() only set the named 'pic.<service>' logger; bare module
  loggers (e.g. 'caddy_manager') were never reached, so per-service log files
  stayed 0 bytes. Fixed via _SERVICE_MODULE_LOGGERS map covering all managers.
- Log viewer GET /api/logs had no level filter; added ?level= query param.
- Per-service log levels lived in an out-of-band config/api/log_levels.json
  side-file with no validation; migrated into ConfigManager under a new
  'logging' section ({python:{root,services}, containers:{caddy,coredns,
  wireguard,mailserver,api}}) with get/set helpers, invalid-level rejection,
  and one-time migration from the old file on first load.

New capabilities:
- Container log levels: Caddy (injects global log { level X } + hot reload),
  CoreDNS (DEBUG enables log plugin, else errors-only), WireGuard/mailserver
  via pending_restart path.
- PUT /api/logs/verbosity accepts {python, containers} dict; returns per-entry
  applied:hot|pending_restart status.
- Webui Logs page gains two-section Verbosity tab (Python services + Container
  services) with needs-restart badges.
- managers.py wires per-service loggers before manager instantiation and
  re-applies persisted levels from ConfigManager; legacy log_levels.json read
  removed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:14:01 -04:00
roof 89aed4efe0 feat: connectivity redesign phase 2 — instance-aware routing + reference connections by id
Unit Tests / test (push) Successful in 12m6s
apply_routes now iterates over connection instances rather than types:
each instance gets its own fwmark, routing table, interface, and
redirect_port via _routing_connections / _resolve_peer_connection /
_apply_connection_for_src; kill-switch is enforced per iface-instance.
Old per-type MARKS/TABLES constants are kept only as migration scaffolding.

peer_registry: exit_via is now stored as a connection id (or 'default');
_migrate_exit_via_to_connection_id runs on _load_peers to upgrade legacy
type-string values; set_peer_exit_via validates against known connection
ids; VALID_EXIT_VIA removed; config_manager wired in from managers.py.

egress_manager: egress_overrides keyed by service_id → connection_id;
local MARKS/TABLES/EXIT_TYPES/_REDIRECT_PORTS/_add_tor_redirect removed;
(mark, table, redirect_port) resolved at apply-time via
connectivity_manager.get_connection; manifest egress.allowed still
enforced by connection type.

api/app.py + api.js: PUT peer/service exit endpoints accept {connection_id};
back-compat shim resolves a legacy type string to its single active instance.

Tests extended: two same-type instances produce distinct marks/tables/ports;
peer exit_via and egress override id migrations round-trip correctly;
single-instance behaviour is equivalent to the old type-keyed path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:35:28 -04:00
roof 5b9d20eeac feat: connectivity redesign phase 1 — multi-instance connection data model
Unit Tests / test (push) Successful in 12m51s
Migrate from the single-exit-per-type model (one wireguard_exit, one
tor_exit, etc.) to N named connection instances, each carrying its own
resource allocations and vault-backed secret refs.

config_manager.py:
- Connectivity v2 schema: top-level `connections` list, each entry has
  id, name, type, enabled, status, config, secret_ref, and allocated
  resources (mark, table, iface, redirect_port).
- Helpers: get_connectivity / list_connections / get_connection /
  add_connection / update_connection / delete_connection /
  set_connection_status.
- v1→v2 migration: promotes legacy wireguard_exit / tor fields into
  the new list on first load; idempotent on v2 configs.

connectivity_manager.py:
- Resource allocator: per-instance fwmark range 0x1000–0x1FFF, routing
  table range 1000+, interface names, and redirect ports 9100–9199;
  all tracked in config to survive restarts.
- Connection CRUD: create / update / delete / list / get with vault
  secret refs for WireGuard private keys and Tor credentials.
- Single-Tor enforcement: rejects a second tor/tor_bridge instance at
  creation time.
- Per-instance config validation for each connection type.
- apply_routes, peer wiring, and egress hookups are intentionally left
  unchanged in this phase; they land in later phases alongside UI.

tests/test_connectivity_connections.py (new, 473 lines):
- Allocator uniqueness, v1→v2 migration round-trip, CRUD lifecycle,
  single-Tor enforcement, and status transitions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:34:56 -04:00
roof 82a0c0e9bd fix: overhaul backup/restore — full secrets coverage, ordered reapply, optional passphrase encryption
Unit Tests / test (push) Successful in 12m25s
P0 — backups previously omitted peers/keys/vault(CA+fernet)/auth/cell-links/ddns/connectivity
configs (a restore lost everything incl admin login + CA) and included logs/trash; restore did
file-copies only with no reapply.

Changes:
- api/config_manager.py: backup_config now includes auth_users.json, .flask_secret_key,
  peers.json, peer_service_credentials.json, WireGuard keys + wg_confs + api/wireguard/keys,
  vault/** (incl fernet.key), api/services + service configs, cell_links.json, ddns_token,
  caddy/**; new _is_excluded() drops logs/config_backups/.test_admin_pass/.gitkeep/*.tmp/
  *.partial/__pycache__; restore_config reordered (vault/fernet → config → wg keys/peers →
  cell_links → caddy/dns → service configs → auth/ddns → volumes) + new _reapply_runtime_state()
  (regenerate Caddyfile/Corefile, reapply services, connectivity apply_routes, replay cell pushes)
- api/backup_crypto.py (new): optional passphrase encryption via scrypt-derived key + Fernet;
  encrypted archives written 0600
- api/routes/config.py: backup/restore accept optional {passphrase}; wrong/missing passphrase
  returns 400; backup response warns it contains secrets
- Makefile: backup target applies same excludes + chmod 0600 + secrets warning
- webui/src/services/api.js + webui/src/pages/Settings.jsx: passphrase field on create backup,
  restore prompt, "contains secrets" banner
- tests/test_config_backup_overhaul.py (new, 18 tests) + tests/test_config_backup_restore_http.py
  (2 assertions updated)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:41:10 -04:00
roof c65beb27a6 fix: remove sysctl rp_filter from WireGuard PostUp/PostDown
Unit Tests / test (push) Failing after 11m57s
sysctl writes to /proc/sys/net/ are blocked in unprivileged containers
(NET_ADMIN only, no SYS_ADMIN). The rp_filter=0 call at the end of
PostUp caused wg-quick to tear down wg0 immediately on every start,
putting cell-wireguard into a crash loop.

Remove the sysctl lines from both the seed (setup_cell.py) and the
API-regenerated (wireguard_manager.py) wg0.conf. Reverse-path filtering
is an optimisation, not required for VPN functionality; the iptables
FORWARD/MASQUERADE/DNAT rules all still work correctly without it.

Found during clean-install hardening verification on pic1 (f4b8d5c).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 14:33:05 -04:00
roof f4b8d5c4f7 harden containers: drop WG privileged, slim images, digest pins; fix WG path + empty chrony.conf
Unit Tests / test (push) Successful in 12m16s
Security — WireGuard:
- Replace linuxserver/wireguard (privileged + SYS_MODULE + /lib/modules) with a
  bespoke alpine image (wireguard/Dockerfile + entrypoint.sh): CAP_NET_ADMIN only,
  119 MB → 14.7 MB. Modern kernels (≥5.6) have WireGuard built in; no module
  loading required. Kernel-fallback comment left in compose for rare old kernels.

Security — supply-chain digest pins:
- CoreDNS image pinned by SHA-256 digest in docker-compose.yml.
- api/Dockerfile: python:3.11-slim and docker:27-cli pinned by digest.
- webui/Dockerfile: node:20-alpine and nginxinc/nginx-unprivileged:alpine pinned.
- ntp/Dockerfile: alpine:3.20 pinned by digest.
- wireguard/Dockerfile: alpine:3.20 pinned by digest.

Security — webui non-root:
- Switch from nginx:alpine (root, port 80) to nginxinc/nginx-unprivileged:alpine
  (port 8080, runs as nginx uid 101). Compose port mapping and all Caddy upstream
  references updated: cell-webui:80 → cell-webui:8080 everywhere.

API layer reduction (561 MB → 245 MB):
- Multi-stage api/Dockerfile: docker CLI copied from docker:27-cli stage instead
  of being installed via apt from Docker's external repo (removes GPG key fetch,
  lsb-release, gnupg, two apt-get update rounds). --no-install-recommends on
  remaining apt install. mkdir folded into the same RUN layer.

Bug fix — WireGuard config path mismatch:
- setup_cell.py wrote wg0.conf to config/wireguard/wg0.conf but wireguard_manager
  and the new entrypoint expect config/wireguard/wg_confs/wg0.conf (the standard
  wg-quick sub-directory). Fixed by creating the wg_confs/ sub-dir and writing
  there; REQUIRED_DIRS updated to pre-create it.

Bug fix — empty chrony.conf:
- config/ntp/chrony.conf was 0 bytes (pre-existing gap); added a real config
  (pool.ntp.org + Cloudflare, allow 172.20/10.0, local stratum 10, driftfile,
  makestep, rtcsync). NTP compose service now builds from ./ntp instead of
  pulling alpine:latest and running apk at every container start.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:07:54 -04:00
roof 1daace48eb fix: DNS first-install — split-horizon zone creation + CoreDNS inode bind-mount
VPN clients got dns_probe_finished_bad_config / couldn't resolve any domain
after first setup because:

1. complete_setup() never wrote the split-horizon DNS zone for non-LAN modes;
   SetupManager now accepts network_manager as an optional 3rd constructor
   param, and complete_setup() calls
   self.network_manager.update_split_horizon_zone(effective_domain, wg_ip,
   primary_domain) for pic_ngo/cell_to_cell modes.

2. generate_corefile() used a tmp-file + os.replace pattern; the Corefile is
   a Docker FILE bind-mount, so os.replace orphaned the inode and CoreDNS
   never saw config updates.  Fixed by truncating and rewriting in place
   (open with 'w', seek(0), truncate()), preserving the inode CoreDNS holds.

api/managers.py passes network_manager into SetupManager.
Tests: new mock_network_manager fixture, 2 setup-zone tests, 1 inode
regression test in test_firewall_manager.py.
Verified live on pic1.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 12:48:37 -04:00
roof c41cadafb4 refactor: Network Services rebuilt, DHCP decommissioned, infra cleanup
Network Services page is rebuilt around real API data: GET /api/dns/overview
returns provider-aware records; per-service Cloudflare sync is exposed via
POST /api/ddns/sync; effective domain is displayed so operators can verify
what external name resolves to the cell; NTP status reflects the actual
systemd-timesyncd state rather than a hardcoded boolean.

DHCP is fully decommissioned: the cell-dhcp container is removed from
docker-compose.yml, DHCP methods are stripped from network_manager, the
setup_cell script no longer seeds DHCP config, and the Settings DHCP field
is gone. DHCP was never a PIC responsibility and the container was consuming
resources for no benefit.

Dead code removed: api/config.py (superseded by config_manager), the
standalone Email/Calendar/Files pages (these are now optional store services
and do not need dedicated pages). api/constants.py is introduced to hold
RESERVED_SUBDOMAINS in one place rather than scattered literals.

Docker resource limits (mem_limit, cpus, pids_limit) are added to all
compose services so a runaway process cannot starve the host.

Makefile gains a warning before the backup target so operators are not
surprised by the archive path. Settings same/accept state fix ensures
the Cell Identity section correctly shows the accept/discard banner and
does not flash a false-positive change indicator on first load.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:50:00 -04:00
roof 6232ef23a9 feat: connectivity — registry-driven peer table, sshuttle/proxy egress, egress UI
The peer table was empty because it was not consulting the peer registry;
now peers are driven by PeerRegistry so the Connectivity page reflects actual
connected cells.

Exit-key handling is unified: all code paths now use the same key derivation
so a store-service exit bridge and a manual WireGuard peer both produce
consistent routing state.

Two new egress exit types are added (sshuttle via SSH tunnel and proxy via
redsocks SOCKS5), wiring through connectivity_manager, egress_manager, and
app.py routes. This lets a cell route its traffic through an SSH host or a
SOCKS5 proxy as an alternative to WireGuard exit nodes.

ServiceStoreManager and ServiceBus updated so the egress lifecycle (install /
uninstall) is cleanly signalled between components.

Connectivity.jsx gains the Service Egress section, letting operators assign
and reassign egress methods from the UI without touching config files.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:36:15 -04:00
roof cc7a223fdf fix: P0/P1 audit fixes — DDNS correctness, peer provisioning gates, honest stubs
CloudflareDDNS.update() was calling the wrong endpoint; fix to use the
correct zone-records API so DDNS updates actually land.

NoIP and FreeDNS providers now return explicit "not implemented" errors
instead of silently claiming success, preventing false-positive health state.

PicNgoDNS ACME dns-challenge now sends the token in the request body (was
missing), so cert issuance no longer silently fails.

add_peer gates builtin-service provisioning on the installed-services list
so a freshly-provisioned peer does not attempt to configure services that
aren't present, eliminating the startup error loop.

Startup Caddyfile regeneration added to routes/config.py so that a stale
on-disk Caddyfile no longer triggers the health-monitor restart loop after
a config change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 08:23:00 -04:00
roof 649378b59b fix: resolve all Cell Identity banner and cert issues
Unit Tests / test (push) Successful in 7m17s
Four bugs fixed:

1. Banner delay (up to 5 s): DraftConfigContext now exposes isDirty as
   reactive useState so App.jsx re-renders immediately when any section
   marks itself dirty, instead of waiting for the next checkPending() poll.

2. Banner re-triggers after Apply (race): For non-'*' container restarts
   (e.g., cell_name → DNS restart) the background thread took ~300 ms to
   clear _pending_restart. A concurrent checkPending() poll could see
   needs_restart=True and overwrite the frontend's optimistic clear.
   Fix: set needs_restart=False and applying=True synchronously before
   spawning the thread.

3. Apply showed banner during applyPending() when hasDirty()==false:
   setApplyStatus('saving') was skipped for the auto-save-then-apply
   path, leaving applyStatus=null while applyPending() ran and the
   banner stayed visible. Always set 'saving' before applyPending().

4. Cert status always 'unknown' in pic_ngo mode: _check_cert_via_ssl
   connected to cell-caddy:443 but sent SNI='cell-caddy'. Caddy finds no
   matching cert and returns nothing. Fix: pass the effective public
   domain (e.g. pic1.pic.ngo) as SNI so Caddy returns the right cert.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 04:17:56 -04:00
roof c1e93f2058 Fix stale DNS zone after wizard completes (#8)
Unit Tests / test (push) Successful in 7m29s
_bootstrap_dns runs at container start before the wizard, writing the
default cell name ('mycell') into cell.zone.  When the wizard completed
it fired IDENTITY_CHANGED for Caddy but never updated the DNS zone, so
DNS records kept showing 'mycell.cell' even after naming the cell.

After successful wizard completion, call network_manager.apply_cell_name
to rename the hostname record in the primary zone file, then reload
CoreDNS.  The empty old_name triggers auto-detection so it works even
when the zone was written with the env-var default.

Adds test_setup_route.py covering: apply_cell_name called on success,
not called on failure, 410 on repeat completion, and IDENTITY_CHANGED
publication.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 05:14:22 -04:00
roof 3d750ed1e8 Fix DDNS security and reliability gaps (#2, #3, #5, #6, #7)
Unit Tests / test (push) Successful in 7m23s
- Fix #2: Move DDNS bearer token from cell_config.json to data/api/ddns_token.
  Token is now in the secrets store (data/) rather than the config store (config/).
  Auto-migrates existing installs on first access. ConfigManager.get/set_ddns_token()
  added. set_ddns_config() now strips 'token' key to prevent it leaking back.

- Fix #3: Set Caddyfile permissions to 0o600 after write so the token embedded
  in the Caddyfile is not world-readable on the host filesystem.

- Fix #5: Heartbeat now fires IDENTITY_CHANGED after re-registration so Caddy
  regenerates its config with the new token automatically — users no longer need
  to click Re-register in Settings after a wizard registration failure.
  Also: heartbeat skips the 401-cycle when no token exists and goes straight to
  registration instead. DDNSManager now accepts service_bus= and is wired up.

- Fix #6: Settings page starts polling GET /api/caddy/cert-status every 15s
  after a successful DDNS re-registration and shows "Acquiring certificate…"
  feedback until Let's Encrypt issues the cert (up to 5 minutes).

- Fix #7: regenerate_with_installed() is debounced (5 s window) so two rapid
  IDENTITY_CHANGED events (e.g. wizard + heartbeat) can't start simultaneous
  ACME orders that interfere with each other.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 03:37:48 -04:00
roof 40f9d90fad feat: improve setup wizard and DDNS UX
Unit Tests / test (push) Successful in 7m29s
Setup wizard (Issue 1 — UI):
- pic.ngo subdomain input now uses the same split-field style as DuckDNS:
  input + static '.pic.ngo' suffix in a flex row, availability status below

Setup wizard (Issue 2 — Caddy not regenerating after completion):
- complete_setup route now fires IDENTITY_CHANGED after a successful wizard
  submission so CaddyManager regenerates the Caddyfile immediately; users
  no longer need to press 'Renew Certificate' to start ACME

Settings — DDNS status (Issue 2 — domain status missing):
- New GET /api/ddns/status endpoint: returns registered flag, domain_name,
  public_ip (ipify with 30s cache), last_ip from heartbeat
- Settings DDNS section for pic_ngo now shows a live status row with
  color-coded dot (green=registered+current, yellow=registered+stale,
  gray=not registered), current public IP, and a Check button
- Status auto-refreshes on mount and after each successful re-registration

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 00:36:47 -04:00
roof e9077b2633 fix: Caddy health check must hit /config/ not /
Unit Tests / test (push) Successful in 7m35s
GET http://cell-caddy:2019/ returns 404 because Caddy's admin API has no
root handler.  The health monitor interpreted every response as a failure,
restarted Caddy every 3 minutes, and prevented ACME from ever completing.

/config/ returns 200 + the running config JSON whenever Caddy is up and
serving — that is the correct liveness indicator.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 15:57:32 -04:00
roof da302b5d54 fix: renew_cert regenerates Caddyfile before reload
Unit Tests / test (push) Successful in 7m32s
A stale or empty-token Caddyfile on disk caused Caddy to reject the
/load request, so the Renew button appeared to do nothing. Now
renew_cert() calls regenerate_with_installed([]) first, which writes a
fresh Caddyfile from current identity/config before reloading Caddy.
This ensures a broken on-disk file never blocks ACME renewal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 14:38:30 -04:00
roof 6bd5f02b03 fix: surface DDNS registration failure during setup wizard
Unit Tests / test (push) Successful in 7m34s
Two problems on fresh install with pic_ngo mode:

1. Caddy crashed at startup because ddns.token was empty (registration
   hadn't completed yet), producing a bare `token` keyword in the
   Caddyfile that Caddy rejects with "wrong argument count".
   Fix: fall back to lan mode in _caddyfile_pic_ngo when the token is
   empty so Caddy always starts cleanly. The Caddyfile is regenerated
   once registration completes and the token is persisted.

2. DDNS registration failures were silently swallowed — the wizard
   showed "Setup complete!" with no indication that HTTPS wouldn't work.
   This made it look like everything was fine when the subdomain was
   never registered (e.g. name already taken from a previous install,
   or transient network error).
   Fix: capture the exception, classify it (name_taken vs transient),
   and return it as a `warnings` list in the setup response. The wizard
   done screen now shows amber warning cards with actionable text instead
   of auto-redirecting, giving the user a "Continue to login" button and
   a clear explanation of what went wrong.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 13:52:00 -04:00
roof 7ef294fd65 fix: fall back to lan mode in pic_ngo Caddyfile when token is empty
Unit Tests / test (push) Successful in 7m42s
On a fresh install before DDNS registration completes, ddns.token is
empty. Writing `token ` (bare keyword, no value) causes Caddy to reject
the Caddyfile at startup with "wrong argument count or unexpected line
ending after 'token'".

Guard added: if the token is empty, generate a LAN-mode Caddyfile so
Caddy starts cleanly. The Caddyfile is regenerated automatically once
registration completes and the token is persisted to cell_config.json.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 13:38:51 -04:00
roof 33d255f089 feat: TLS certificate management in Vault page
Unit Tests / test (push) Successful in 7m26s
Adds live cert status, one-click ACME renewal, and custom cert upload
directly to the Vault page so users never need to touch Caddy config.

Backend:
- CaddyManager.get_cert_status() now returns domain, domain_mode, and
  cert_type so the UI can render the right controls without a separate
  identity fetch
- CaddyManager.renew_cert() reloads Caddy and invalidates the status
  cache; the frontend polls until the cert turns valid
- CaddyManager.upload_custom_cert() validates PEM, writes cert+key to
  the shared config/caddy/certs/ volume, updates identity (cert_type=custom),
  and regenerates the Caddyfile so Caddy references the new paths
- LAN-mode Caddyfile switches from /etc/caddy/internal/ to the shared
  certs dir automatically when cert_type=custom is set
- ddns_api default no longer includes /api/v1 — the plugin appends it;
  legacy /api/v1 suffix is stripped at write time to keep the Caddyfile clean
- POST /api/caddy/cert-renew and POST /api/caddy/custom-cert routes added

Frontend:
- TLSPanel component at the top of Vault.jsx shows status badge
  (valid/expiring-soon/expired/pending/internal) with domain and expiry
- Renew button visible only for ACME modes; spins during the API call
  then polls GET /api/caddy/cert-status every 10 s until valid
- Upload Custom Cert opens a modal with PEM text areas; works for all modes
- caddyAPI.renewCert() and uploadCustomCert() added to api.js

Tests: 22 new tests across 5 classes covering enriched status,
renew_cert guards, upload_custom_cert validation/writes/persistence,
custom-cert Caddyfile path selection, and ddns_api suffix stripping.
All 2093 existing tests continue to pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 12:53:42 -04:00
roof 85d265187d fix: Caddy TLS cert acquisition — two DNS-01 blockers
Unit Tests / test (push) Successful in 7m32s
1. caddy_manager: embed ddns.token (registration bearer token) in
   Caddyfile, not DDNS_TOTP_SECRET. The pic_ngo plugin sends the token
   to POST /api/v1/dns-challenge; using the TOTP secret caused 401 on
   every attempt.

2. firewall_manager: add _acme-challenge.<zone> forwarding block before
   each split-horizon zone in the Corefile. Without this, CoreDNS was
   authoritative for the challenge name and returned NODATA for TXT
   queries (wildcard A record matches but wrong type), blocking Caddy's
   internal DNS pre-verification step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:45:15 -04:00
roof 76bbc2b67a fix: EmailManager route calls get_email_users not get_users
Unit Tests / test (push) Successful in 7m27s
The method is named get_email_users in EmailManager; the route was
calling the non-existent get_users, causing an AttributeError on every
GET /api/email/users request.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 10:12:24 -04:00
roof bd71466a87 fix: split-horizon DNS zone uses WireGuard IP, not Docker bridge IP
Unit Tests / test (push) Successful in 7m31s
VPN peers can reach Caddy via the host's WireGuard interface (10.0.0.1),
not via the Docker bridge IP (172.20.0.2) which is unreachable outside
the container network. _bootstrap_dns now calls _get_wg_server_ip()
instead of ip_utils.get_service_ips() so the internal zone returns a
routable address for service subdomains.

Also log config save failures instead of silently swallowing them —
the silent PermissionError/OSError was masking write failures and
making it impossible to diagnose why installed services disappeared
after container restarts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-08 02:11:01 -04:00
roof 69862331e7 fix: DDNS update token in body, webdav gating, regression tests
Unit Tests / test (push) Successful in 7m25s
- PicNgoDDNS.update(): send token in request body instead of Authorization
  header; DDNS server validates it from body (was returning HTTP 422 on
  every heartbeat, leaving IP record stale after fresh install)
- peers.py / Peers.jsx: webdav service_access only valid when 'files' store
  service is installed; was always shown even with no services, confusing
  users into thinking WebDAV was pre-installed
- 10 new regression tests: DDNS update body contract, Caddy always
  regenerates on startup with no services, peer role allowed on
  /api/services/active, webdav gating by installed services

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 16:56:12 -04:00
roof 1607a2e86f fix: peer access to /api/services/active and unconditional Caddy startup regen
Unit Tests / test (push) Successful in 7m23s
- Add _PEER_READABLE_PATHS allowlist in enforce_auth so peer-role sessions
  can read /api/services/active; fixes My Services showing 'not installed'
  for cell members when services are installed
- Move Caddy regeneration before the early-return in reapply_on_startup so
  the Caddyfile is always rebuilt from current identity on startup, even when
  no store services are installed; fixes ERR_SSL_PROTOCOL_ERROR after a cell
  rename (Caddyfile retained old wildcard domain)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 15:58:27 -04:00
roof 9bdda6aaf8 fix: service credential provisioning and install reliability
Unit Tests / test (push) Successful in 7m21s
- calendar: create_calendar_user() now writes bcrypt htpasswd entry to
  data/services/calendar/config/users (the path Radicale reads at
  /etc/radicale/users); delete_calendar_user() removes the entry

- email: create_email_user() calls `docker exec cell-mail setup email add`
  to register the account in docker-mailserver's Dovecot/Postfix store;
  delete_email_user() calls the matching `setup email del` — both are
  non-fatal if the container isn't running

- service_composer.install(): pull image separately before up so slow
  registry pulls don't race with container startup; retry up once on
  failure so a transient registry hiccup on first install doesn't
  require the user to manually retry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 13:41:41 -04:00
roof c696ca9ef6 fix: DNS split-horizon in DDNS mode, service access filter, health check, verbosity persistence
Unit Tests / test (push) Successful in 7m32s
- DNS (critical): add _configured_dns_params() that returns (primary_domain,
  split_horizon_zones) from config_manager so all apply_all_dns_rules() callers
  pass the correct primary zone (e.g. 'pic.ngo') and split-horizon list
  (e.g. ['pic1.pic.ngo']) instead of the FQDN as the primary — fixes
  DNS_PROBE_FINISHED_BAD_CONFIG for all external domains when on VPN

- firewall_manager: add split_horizon_zones param to apply_all_dns_rules()
  and forward it to generate_corefile()

- Peers: filter service_access list to installed services only; peers.py
  derives valid services from config_manager.get_installed_services() with
  the email→mail ID mapping; Peers.jsx fetches from /api/store/installed
  and filters the checkboxes and defaults accordingly

- Health check: fix file_manager→'files' ID mapping so files service health
  is checked when installed (was silently skipped due to 'file' vs 'files')

- Verbosity persistence: move log_levels.json from non-mounted
  /app/api/config/ to CONFIG_DIR (/app/config/) which maps to config/api/
  on the host; both load (managers.py) and save (routes/services.py) updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 13:05:58 -04:00
roof 4ebcb1d077 fix: don't overwrite split-horizon Corefile from _bootstrap_dns
Unit Tests / test (push) Successful in 7m29s
The apply_all_dns_rules() call at the end of _bootstrap_dns() was
added to force reload 30s into the Corefile on startup. Now that
reload 30s is removed (it broke CoreDNS zone serving), the call is
unnecessary in LAN mode and actively harmful in DDNS mode:
update_split_horizon_zone() already writes the correct Corefile
with the split-horizon block; the subsequent apply_all_dns_rules()
call would overwrite it without the split-horizon zones, causing
all service subdomain lookups to return NXDOMAIN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 04:56:41 -04:00
roof 0507445d86 fix: remove file reload 30s from CoreDNS zone blocks
Unit Tests / test (push) Successful in 7m29s
CoreDNS 1.14.3 returns REFUSED for all zones that use
'file /data/zone reload 30s' — the reload timer defers the
initial zone load, causing the plugin to return REFUSED until
the timer fires. The timer never resolves this correctly.

Zone updates are already triggered by SIGUSR1 sent from
_reload_dns_service() after every zone file write, which
causes CoreDNS to reinitialise all plugins and re-read zone
files. No periodic zone polling is needed.

Also update config/dns/Corefile to remove the stale reload 30s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 04:33:19 -04:00
roof 9b5c2e1994 fix: ensure DNS zone changes take effect immediately on startup
Unit Tests / test (push) Successful in 7m35s
Three related issues prevented CoreDNS from serving updated zone records:

1. The `file` plugin blocks in generate_corefile() lacked a `reload`
   option, so CoreDNS never re-read zone files after they were written.
   Added `reload 30s` so zone file changes are picked up within 30s.

2. _reload_dns_service() sent SIGHUP via `docker exec ... kill -HUP 1`,
   which doesn't trigger zone reloads. Changed to SIGUSR1 via
   `docker kill --signal=SIGUSR1` (same as firewall_manager.reload_coredns).

3. _bootstrap_dns() wrote the zone file but never regenerated the
   Corefile. CoreDNS's reload plugin only fires when the Corefile
   changes, so zone records from startup were invisible until the next
   peer modification triggered apply_all_dns_rules(). Now _bootstrap_dns()
   always calls apply_all_dns_rules() after the zone write.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 03:41:19 -04:00
roof 08f46332b0 fix: add built-in service subdomains to DNS zone on startup
Unit Tests / test (push) Successful in 7m45s
_build_dns_records() only hardcoded 'api' and 'webui', relying on the
optional service registry for the rest. Built-in services (calendar,
files, mail, webdav) were never registered, so they were absent from
the zone file and tests querying webdav.<domain> via CoreDNS got
NXDOMAIN.

Add _BUILTIN_SERVICE_SUBDOMAINS constant and include those names in
every zone build. Also update _stale and apply_cell_name exclusion
sets so DDNS mode correctly removes them from the parent zone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-07 03:14:34 -04:00
roof 65d6d07c8d fix: get_status returns actual configured WG address instead of hardcoded default
Unit Tests / test (push) Successful in 7m41s
The address field in get_status() was hardcoded to SERVER_ADDRESS
('10.0.0.1/24') regardless of what wg0.conf contains, so instances
with a non-default subnet (e.g. pic1 at 10.0.1.1/24) always reported
the wrong server IP to callers such as the e2e WG conftest fixture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 14:48:49 -04:00
roof 568e4f9783 Fix: prevent wg0.conf truncation when remove_peer splits blocks
Unit Tests / test (push) Successful in 7m46s
_write_config() was stripping trailing newlines, causing the next
add_cell_peer() to create a single-newline separator between [Interface]
and [Peer] blocks instead of the required blank line. On the following
remove_peer() call, split('\n\n') treated both sections as one block,
matched the PublicKey filter, and wrote an empty string — destroying the
[Interface] section and reverting to the hardcoded SERVER_ADDRESS fallback.

Two-part fix:
1. _write_config() always ends content with a newline
2. remove_peer() normalises single-newline [Peer] headers to blank-line
   separators before splitting, and refuses to write if [Interface] would
   be lost

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 12:31:05 -04:00
roof 26576e1124 Fix: use domain_name (FQDN) in cell invite and conflict checks
Unit Tests / test (push) Successful in 7m39s
The GET /api/cells/invite endpoint was returning domain='pic.ngo' instead
of the full FQDN 'test5.pic.ngo' because it read _identity.domain rather
than _identity.domain_name.

Apply the same domain_name preference (domain_name || domain) to:
- routes/cells.py get_cell_invite() — the invite shown to connecting cells
- routes/cells.py update_cell_permissions() — Corefile DNS regeneration
- cell_link_manager.py _check_invite_conflicts() — incoming domain collision check
- cell_link_manager.py exchange_invites() — own invite construction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 11:56:42 -04:00
roof 31f76c54fa Fix: use domain_name as service URL base and harden WG e2e tests
Unit Tests / test (push) Successful in 11m15s
API:
- _configured_domain() now prefers _identity.domain_name (full FQDN
  e.g. 'test5.pic.ngo') over domain ('pic.ngo'). Service URLs in
  /api/peer/services and /api/peer/dashboard now correctly return
  'calendar.test5.pic.ngo' instead of 'calendar.pic.ngo'.

WG e2e tests:
- test_api_domain_returns_json_not_webui: accept 3xx redirect as
  valid routing (Caddy redirects HTTP→HTTPS in pic_ngo mode).
- test_catchall_api_path_returns_json and test_catchall_root_serves_webui:
  skip when Caddy is in HTTPS-redirect mode — catch-all :80 block only
  exists in HTTP-mode cells (lan/local domain).
- test_http_api_domain_reaches_api: replace --dns-servers (requires
  c-ares) with dig + curl --host pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 08:40:59 -04:00
roof 463db029e1 Fix: expose listen_port in WG status API and add HTTPS DNAT to PostUp/PreDown
Unit Tests / test (push) Successful in 11m6s
Adds listen_port to /api/wireguard/status response so e2e test conftest
picks up the actual port (51821) instead of defaulting to 51820.

Extends PostUp/PreDown in generate_config to also DNAT and forward port
443 (HTTPS) through to cell-caddy — mirrors the ensure_service_dnat fix
so HTTPS works even after a WireGuard container restart without an API
restart. Updates _is_dnat_rule to recognize 443 rules.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 07:42:49 -04:00
roof 8da711e366 fix: DNAT and forward port 443 (HTTPS) to Caddy from WireGuard peers
Unit Tests / test (push) Successful in 11m9s
ensure_service_dnat() only wired port 80 → cell-caddy, so HTTPS was
silently dropped: no DNAT rule redirected 443 to the Caddy container,
and the FORWARD chain had no ACCEPT for dport 443. Refactored the
function to loop over both 80 and 443 so both are DNAT'd and forwarded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 07:14:55 -04:00
roof f84f16fcd6 fix: add /api/network/dns/corefile endpoint and per-line iptables check
Unit Tests / test (push) Successful in 11m13s
The e2e tests were reading a stale Corefile at a hardcoded fallback path
(/home/roof/pic/config/dns/Corefile) instead of the live one written by
the API (/opt/pic/config/dns/Corefile on pic1). Adding a proper API
endpoint eliminates the path ambiguity.

The iptables test was checking whether peer_ip, DROP, and dpt:80 appeared
anywhere in the full multi-line output rather than on the same rule line,
producing false positives. Now checks per line.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 05:54:17 -04:00
roof eee0e800aa feat: add GET /api/peers/<peer_name> endpoint
Unit Tests / test (push) Successful in 11m19s
Allows fetching a single peer by name. E2E tests need this to verify
persisted peer state after PUT operations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 05:19:10 -04:00
roof 39c59fd3ef feat: WireGuard endpoint override + fix Docker network label issue
Unit Tests / test (push) Successful in 11m14s
Endpoint override:
- Add PUT /api/wireguard/endpoint to set endpoint_override in identity
  config; GET returns detected, override, and effective endpoints
- _effective_endpoint() helper applies override in peer config generation
  (wireguard.py and peer_dashboard.py); detected IP still shown in UI
- Add Endpoint Override input in WireGuard page — solves the common case
  where auto-detected IP is a gateway/VPS but peers connect via LAN IP

Docker cell-network fix:
- Declare cell-network external in docker-compose.yml; Docker Compose v5
  enforces label ownership and rejects networks created by older versions
- Makefile start/update pre-create cell-network idempotently
- reinstall/uninstall(full) explicitly delete and recreate the network
- Fix uninstall loop path: data/api/services/ (not data/services/)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 04:51:38 -04:00
roof f3737acfa4 fix: fall back to cell effective domain when email service domain not configured
Unit Tests / test (push) Successful in 11m10s
When the email store service is installed but no explicit domain has been
set in its config, _provision_email now falls back to
config_manager.get_effective_domain() so peer account creation works
immediately without requiring a separate config step.

Also threads config_manager into AccountManager.__init__ (optional kwarg,
no existing callers break) so the fallback is available without a global
import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 17:06:51 -04:00
roof 0267dce73d feat: HTTPS cert status, IDENTITY_CHANGED wiring, remove stale ip_utils Caddyfile writes
Unit Tests / test (push) Successful in 11m18s
- CaddyManager: add refresh_cert_status() and get_cert_status_fresh() that
  open a live TLS connection to cell-caddy:443 to read cert expiry; avoids
  needing a volume mount into the API container
- CaddyManager: periodic cert refresh in health_monitor_loop (every 60 cycles)
- config.py PUT /api/ddns: publish IDENTITY_CHANGED so CaddyManager regenerates
  the Caddyfile immediately after any domain/cell_name change — previously the
  event was never fired from this route
- config.py: remove all ip_utils.write_caddyfile() calls; CaddyManager is now
  the sole authority for Caddyfile generation
- app.py: add GET /api/caddy/cert-status route
- app.py: add GET /api/egress/status and PUT /api/egress/services/<id>/exit routes
- Settings.jsx: display cert status badge (valid/expired/internal/unknown) with
  expiry date and days-remaining in the domain section
- Tests: TestRefreshCertStatus (8 tests), TestDdnsConfigUpdatesFiresIdentityChanged,
  TestCaddyCertStatusRoute added; fix expired-cert helper to set not_valid_before
  relative to expiry so it's always earlier

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 11:39:36 -04:00
roof 41d09c598b wire: AccountManager HTTP dispatch + EgressManager startup + egress API routes
Unit Tests / test (push) Successful in 11m15s
- add_peer() now calls account_manager.provision() for any installed store
  service whose manifest declares accounts.manager == 'http', enabling
  per-peer credential provisioning to third-party HTTP services
- reapply_on_startup() calls egress_manager.apply_all() so fwmark rules
  survive container restarts without manual intervention
- add GET /api/egress/status and PUT /api/egress/services/<id>/exit routes
  so the UI can read and override per-service egress policy
- tests: HTTP provision wiring (happy path + non-fatal failure), egress
  apply_all at startup (wired/unwired/failure cases)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 10:30:41 -04:00
roof a906c26b5d fix: resolve Caddy env vars at write time to prevent parse errors
Unit Tests / test (push) Successful in 11m25s
acme_ca and the pic_ngo DNS credentials ({$PIC_NGO_DDNS_TOKEN},
{$PIC_NGO_DDNS_API}) were written as Caddy env-var placeholders, but the
Caddy container does not inherit the API container's environment, so the
substitutions always failed — Caddy saw bare directive names with no
arguments and rejected the Caddyfile.

- _global_acme_block: only emit the acme_ca directive when ACME_CA_URL is
  actually set; omitting it makes Caddy default to Let's Encrypt production.
- _caddyfile_pic_ngo: embed the DDNS_TOTP_SECRET and DDNS_URL values directly
  into the Caddyfile at write time rather than relying on Caddy env expansion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 15:01:15 -04:00
roof e87022dc55 fix: cell-network name, install error surfacing, health history cleanup
Unit Tests / test (push) Successful in 11m22s
- docker-compose.services.yml: change external network name from
  pic_cell-network to cell-network so store-service compose files can find
  it.  The project-prefixed name was overriding the explicit name: cell-network
  fix in docker-compose.yml when both files were merged by make start.

- service_store.py: normalize docker compose stderr into the error key in
  the 400 response so the Store page shows the actual failure reason instead
  of the generic fallback message.

- app.py: skip health checks for email/calendar/files managers when those
  optional store services are not installed — prevents false Down alerts and
  unnecessary noise in health history.

- Logs.jsx: remove Email/Calendar/Files columns from the health history table;
  they are optional store services, not core builtins that should always appear.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 14:28:46 -04:00
roof 7d5c5421f1 Implement connectivity store services (wireguard-ext, openvpn-client, tor)
Unit Tests / test (push) Successful in 11m31s
- ConnectivityManager: move config dirs to data_dir/services/<id>/config so
  Docker can bind-mount them into store-service containers (Docker resolves
  bind-mount paths on the host, not inside the API container).  Add
  _migrate_legacy_configs to copy existing files from the old config_dir
  location on first boot.

- manifest_validator: add allow_host_network parameter to
  validate_rendered_compose.  When True, waives the external-network
  requirement, permits network_mode: host, and allows devices: — all needed
  by VPN/Tor containers that must share the host network namespace to create
  tun/wg interfaces.  Non-host services are unaffected.

- service_composer: read requires_host_network from the manifest and pass
  allow_host_network=True to validate_rendered_compose for connectivity
  services.

- Tests: update file-path assertions to new data_dir layout; add
  TestMigrateLegacyConfigs, TestValidateRenderedComposeHostNetwork, and
  two TestWriteCompose cases for the host-network path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 10:06:48 -04:00
roof f7bb2cc962 fix: allow first-party store service subdomains and registry images
Unit Tests / test (push) Successful in 11m25s
Two manifest validation bugs blocked all store service installs:

1. service_store_manager.RESERVED_SUBDOMAINS included 'mail', which
   prevented the email service from using its required subdomain.
   Removed mail/calendar/files/webmail — they belong to official PIC
   store services and must be claimable by them.

2. manifest_validator required @sha256 digest pins on ALL images,
   including first-party git.pic.ngo/roof/* images that the PIC team
   builds and controls. service_store_manager._validate_manifest already
   only warned for first-party images; the secondary validator was
   stricter than intended, causing a hard reject on :latest tags.
   Aligned to warn-not-reject for first-party; malformed digests (when
   provided) are still a hard error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 03:09:41 -04:00