1. caddy_manager: embed ddns.token (registration bearer token) in
Caddyfile, not DDNS_TOTP_SECRET. The pic_ngo plugin sends the token
to POST /api/v1/dns-challenge; using the TOTP secret caused 401 on
every attempt.
2. firewall_manager: add _acme-challenge.<zone> forwarding block before
each split-horizon zone in the Corefile. Without this, CoreDNS was
authoritative for the challenge name and returned NODATA for TXT
queries (wildcard A record matches but wrong type), blocking Caddy's
internal DNS pre-verification step.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- DNS (critical): add _configured_dns_params() that returns (primary_domain,
split_horizon_zones) from config_manager so all apply_all_dns_rules() callers
pass the correct primary zone (e.g. 'pic.ngo') and split-horizon list
(e.g. ['pic1.pic.ngo']) instead of the FQDN as the primary — fixes
DNS_PROBE_FINISHED_BAD_CONFIG for all external domains when on VPN
- firewall_manager: add split_horizon_zones param to apply_all_dns_rules()
and forward it to generate_corefile()
- Peers: filter service_access list to installed services only; peers.py
derives valid services from config_manager.get_installed_services() with
the email→mail ID mapping; Peers.jsx fetches from /api/store/installed
and filters the checkboxes and defaults accordingly
- Health check: fix file_manager→'files' ID mapping so files service health
is checked when installed (was silently skipped due to 'file' vs 'files')
- Verbosity persistence: move log_levels.json from non-mounted
/app/api/config/ to CONFIG_DIR (/app/config/) which maps to config/api/
on the host; both load (managers.py) and save (routes/services.py) updated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CoreDNS 1.14.3 returns REFUSED for all zones that use
'file /data/zone reload 30s' — the reload timer defers the
initial zone load, causing the plugin to return REFUSED until
the timer fires. The timer never resolves this correctly.
Zone updates are already triggered by SIGUSR1 sent from
_reload_dns_service() after every zone file write, which
causes CoreDNS to reinitialise all plugins and re-read zone
files. No periodic zone polling is needed.
Also update config/dns/Corefile to remove the stale reload 30s.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three related issues prevented CoreDNS from serving updated zone records:
1. The `file` plugin blocks in generate_corefile() lacked a `reload`
option, so CoreDNS never re-read zone files after they were written.
Added `reload 30s` so zone file changes are picked up within 30s.
2. _reload_dns_service() sent SIGHUP via `docker exec ... kill -HUP 1`,
which doesn't trigger zone reloads. Changed to SIGUSR1 via
`docker kill --signal=SIGUSR1` (same as firewall_manager.reload_coredns).
3. _bootstrap_dns() wrote the zone file but never regenerated the
Corefile. CoreDNS's reload plugin only fires when the Corefile
changes, so zone records from startup were invisible until the next
peer modification triggered apply_all_dns_rules(). Now _bootstrap_dns()
always calls apply_all_dns_rules() after the zone write.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ensure_service_dnat() only wired port 80 → cell-caddy, so HTTPS was
silently dropped: no DNAT rule redirected 443 to the Caddy container,
and the FORWARD chain had no ACCEPT for dport 443. Refactored the
function to loop over both 80 and 443 so both are DNAT'd and forwarded.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In DDNS modes (pic_ngo, cloudflare, duckdns, http01), all built-in
services are now reachable as subdomains of the cell domain, e.g.
calendar.pic1.pic.ngo instead of pic1.pic.ngo/calendar.
Key changes:
- CaddyManager._build_core_service_routes(): new helper generates
Caddy named-matcher host blocks for calendar, mail/webmail, files,
webdav, and api subdomains within the wildcard TLS server block.
- All ACME modes (pic_ngo, cloudflare, duckdns) use the new
subdomain matchers; http01 emits a dedicated server block per service.
- http01: installed store-plugin services whose name clashes with a
core service are skipped to prevent duplicate server blocks.
- routes/config.py: ip_utils.write_caddyfile() is skipped in non-LAN
modes so LAN Caddy config never overwrites the ACME config.
- firewall_manager.generate_corefile(): new split_horizon_zones param
adds local authoritative file zones so LAN clients resolve
*.pic1.pic.ngo to the internal Caddy IP without hairpin NAT.
- NetworkManager.update_split_horizon_zone(): writes the wildcard zone
file and regenerates the Corefile with the split-horizon block;
called automatically after every identity change in non-LAN mode.
- Added @ to allowed record-name chars in update_dns_zone validation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Concurrent callers (health monitor + startup) could both pass the
delete-all loop and each insert a copy, producing duplicate
ESTABLISHED,RELATED rules. Lock serialises all calls.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pic1 ships alpine but not busybox; ensure_cell_subnet_routes() now uses
the alpine image so route injection works on all cells.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- firewall_manager: add _get_wg_server_ip() helper; scope ensure_cell_api_dnat(),
ensure_dns_dnat(), ensure_service_dnat() DNAT rules with -d server_ip; add
ensure_wg_masquerade() (Docker→wg0 MASQUERADE+FORWARD) and
ensure_cell_subnet_routes() (host routes via docker run busybox)
- wireguard_manager: scope PostUp DNAT rules with -d server_ip in generate_config()
and ensure_postup_dnat(); add Docker→wg0 MASQUERADE+FORWARD rules
- app.py: call ensure_wg_masquerade() and ensure_cell_subnet_routes() in
_apply_startup_enforcement()
- tests/test_firewall_manager.py: mock _get_wg_server_ip, add
test_dnat_is_scoped_to_server_ip and test_returns_false_when_wg_server_ip_not_found
- tests/e2e/wg/test_cell_to_cell_routing.py: rewrite to use dynamic config
(no hardcoded IPs/ports), add latency and domain access tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cell catch-all DROP rule blocked all traffic from a connected cell's
subnet, including ESTABLISHED/RELATED packets (ICMP replies, TCP ACKs) for
connections initiated by local VPN peers. This broke ping to the remote
cell's WireGuard IP even when the cell-to-cell tunnel was healthy.
Change the DROP to match only NEW,INVALID connections so established reply
traffic passes through to the stateful ACCEPT rule.
Also adds tests/e2e/wg/test_cell_to_cell_routing.py — an end-to-end test
that brings up a real WireGuard tunnel from the test runner to pic1 and
verifies full cross-cell routing including ICMP ping, API /health, and Caddy.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
apply_cell_rules drops all traffic from a cell's subnet except specific
service ports. This also drops ICMP replies and TCP ACKs for connections
initiated by local peers to the connected cell, breaking cross-cell
routing (ping to 10.0.0.1 silently dropped by test's cell DROP rule).
Fix: ensure_forward_stateful() inserts a stateful ESTABLISHED,RELATED
ACCEPT at the top of FORWARD. Called from apply_cell_rules (every cell
add/update) and from _apply_startup_enforcement. Idempotent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three related fixes for split-tunnel peers that need to reach connected cells:
1. apply_peer_rules/apply_all_peer_rules now accept wg_subnet (actual local VPN
subnet) and cell_subnets (connected cells' vpn_subnets) parameters instead of
hardcoding 10.0.0.0/24. All callers (startup, add_peer, update_peer,
apply-enforcement endpoint) pass the real values.
2. Explicit ACCEPT rules are inserted in FORWARD for each connected cell's
subnet so split-tunnel peers (internet_access=False) can still reach
connected cells via the wg0→wg0 path.
3. apply_ip_range in network_manager now loads cell_links.json and passes it
to generate_corefile(), fixing a race where the bootstrap DNS thread could
overwrite the Corefile and wipe cross-cell DNS forwarding zones on startup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
**PIC UI always accessible (service_access=[])**
Remove the per-peer Caddy:80 ACCEPT/DROP rule from apply_peer_rules.
Service access was enforced at two layers (iptables DROP + CoreDNS ACL),
but the iptables layer also blocked the PIC web UI served through Caddy.
CoreDNS ACL alone is sufficient — DNS blocks service hostnames; the UI
path through Caddy remains reachable regardless of service_access value.
**Exit-relay internet routing (route_via another cell)**
update_peer_ip validated new_ip as a single ip_network, rejecting the
comma-separated '10.0.1.0/24, 0.0.0.0/0' string passed by
update_cell_peer_allowed_ips(add_default_route=True). The AllowedIPs
in wg0.conf was never updated, so WireGuard never routed internet traffic
through the exit cell's tunnel. Fix: validate each CIDR individually and
apply the change live via wg set without a container restart.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _build_acl_block: put all blocked IPs for a service in ONE acl block instead
of one block per peer — the first block's allow-all was silently granting
access to every peer after the first blocked one (first-match semantics)
- generate_corefile: add 'reload' plugin so SIGUSR1 triggers Corefile reload
in newer CoreDNS builds (without it the signal was a no-op)
- tests/test_firewall_manager.py: new tests for single merged ACL block and
the reload directive
- tests/e2e/api/test_peer_access_update.py: e2e tests for service_access,
internet_access, and peer_access updates persisting live to iptables/CoreDNS
- tests/e2e/api/test_cell_to_cell.py: e2e tests for cell-to-cell connection
management, permissions API, and cross-cell service access restrictions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DNS A records now return the WireGuard server IP (10.0.0.1) instead of
Docker bridge VIPs so cross-cell peers resolve service names correctly
regardless of their bridge subnet. DNAT rules (wg0:53→cell-dns:53 and
wg0:80→cell-caddy:80) are applied at startup. Caddy routes by Host header,
eliminating the Docker bridge subnet conflict. Firewall cell rules allow
DNS and service (Caddy) traffic from linked cell subnets. Split-tunnel
AllowedIPs now dynamically includes connected-cell VPN subnets and drops
the 172.20.0.0/16 range. Peers with route_via set now receive full-tunnel
config (0.0.0.0/0) so all their traffic exits via the remote cell.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The PostUp rule appended `iptables -A FORWARD -i wg0 -j ACCEPT` which
allowed any WireGuard-connected client full internet access regardless of
per-peer rules, even when no peers were configured in wg0.conf.
Fix: change PostUp/PostDown to use DROP as the catch-all. Per-peer and
per-cell rules use -I (insert at top) so they take precedence; unknown
or unconfigured WG traffic hits the DROP at the bottom.
Also add reconcile_stale_peer_rules() called on startup to remove FORWARD
rules for peer IPs that no longer exist in the registry, preventing deleted
peers from retaining firewall access across container restarts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the ability to route a specific peer's internet traffic through a
connected cell acting as an exit relay.
Cell A side:
- PUT /api/peers/<peer>/route-via {"via_cell": "cellB"} sets route_via
- Updates WG AllowedIPs to include 0.0.0.0/0 for the exit cell peer
- Adds ip rule + ip route in policy table inside cell-wireguard so the
specific peer's traffic egresses via cellB's WG IP
- Sets exit_relay_active on the cell link and pushes use_as_exit_relay=True
to cellB via peer-sync
Cell B side:
- Receives use_as_exit_relay in the peer-sync payload
- Calls apply_cell_rules(..., exit_relay=True) to add FORWARD -o eth0 ACCEPT
- Stores remote_exit_relay_active flag for startup recovery
Startup recovery:
- apply_all_cell_rules passes exit_relay=remote_exit_relay_active (cellB)
- _apply_startup_enforcement reapplies ip rule for each peer with route_via (cellA)
since policy routing rules don't survive container restart
peer_registry gets route_via field with lazy migration.
22 new tests across test_cell_link_manager, test_peer_registry, test_peer_route_via.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The per-cell catch-all DROP was reaching position 5 before our ACCEPT
(position 6) because apply_all_cell_rules can re-run after
ensure_cell_api_dnat, pushing the DNAT ACCEPT below the DROP.
Fix: add the API-sync ACCEPT inside apply_cell_rules itself, tagged with
the cell's own tag and inserted LAST (= position 1, above the DROP).
Since it's part of the cell's rule block it is always in the right
position relative to the catch-all DROP, regardless of call order.
Also adds _get_cell_api_ip() helper (docker inspect cell-api) so the
destination IP is always current, and two new tests that verify both the
rule exists and that the insertion order guarantees it wins over DROP.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cell-api has no route to remote WG tunnel IPs — only cell-wireguard does.
Fix _push_permissions_to_remote() to use 'docker exec cell-wireguard curl'
so outbound sync HTTP traverses the WG tunnel from the right namespace.
On the receive side, add ensure_cell_api_dnat() which installs three
iptables rules inside cell-wireguard on startup:
- PREROUTING DNAT: wg0:3000 → cell-api:3000 (Docker bridge IP)
- POSTROUTING MASQUERADE: so cell-api's reply routes back via wg0
- FORWARD ACCEPT: allow the wg0→eth0 forwarded traffic
Called from _apply_startup_enforcement() so rules survive container restarts.
Tests updated to mock subprocess.run instead of urllib.request.urlopen.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 — connection fixes:
- routing_manager.stop(): remove iptables -F / -t nat -F nuclear flush that
would wipe WireGuard MASQUERADE and all peer rules on any UI stop action
- wireguard_manager.add_cell_peer(): reject vpn_subnet that overlaps the local
WG network (routing blackhole — was the root cause of no handshake)
- wireguard_manager._syncconf(): pass Endpoint to 'wg set' so cell peers with
static endpoints are synced to the kernel (not just AllowedIPs)
Phase 2 — service-sharing permissions backend:
- firewall_manager: add _cell_tag(), clear_cell_rules(), apply_cell_rules(),
apply_all_cell_rules() — iptables FORWARD rules for cell-to-cell traffic
using 'pic-cell-<name>' comment tags, distinct from 'pic-peer-*'
- app.py startup enforcement: call apply_all_cell_rules(cell_links) so rules
survive API restarts
- cell_link_manager: permissions schema {inbound, outbound} per service;
lazy migration for existing entries; update_permissions(), get_permissions();
apply_cell_rules wired into add_connection/remove_connection
- routes/cells.py: GET /api/cells/services, GET+PUT /api/cells/<n>/permissions;
RuntimeError now returns 400 (not 500) from add_connection
Removed broken 'test' cell (subnet 10.0.0.0/24 collided with local WG network).
Second PIC must use a distinct subnet (e.g. 10.0.1.0/24) before reconnecting.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sprint 1 — Security & correctness:
- Restore all 10 commented-out is_local_request() checks (vault, containers, images, volumes)
- Fix XFF spoofing: only trust the LAST X-Forwarded-For entry (Caddy's append), not all
- Require prefix length in wireguard.address (was accepting bare IPs like 10.0.0.1)
- Validate service_access list in add_peer (valid: calendar/files/mail/webdav)
- Fix dhcp/reservations POST/DELETE: unpack mac/ip/hostname from body (was passing dict as positional arg)
- Fix network/test POST: remove spurious data arg (test_connectivity takes no args)
- Fix remove_peer: clear iptables rules and regenerate DNS ACLs on deletion (was leaving stale rules)
- Fix CoreDNS reload: SIGHUP → SIGUSR1 (SIGHUP kills the process; SIGUSR1 triggers reload plugin)
- Remove local.{domain} block from Corefile template (local.zone doesn't exist, caused log spam)
- Fix routing_manager._remove_nat_rule: targeted -D instead of flushing entire POSTROUTING chain
Sprint 2 — State consistency:
- Atomic config writes in config_manager, ip_utils, firewall_manager, network_manager
(write to .tmp → fsync → os.replace, prevents truncated files on kill)
- backup_config: now also backs up Caddyfile, Corefile, .env, DNS zone files
- restore_config: restores all of the above so config stays consistent after restore
Sprint 3 — Dead code / documentation:
- Remove CellManager instantiation from app startup (was never called, double-instantiated all managers)
- Document routing_manager scope (targets host, not cell-wireguard; methods not called by any active route)
Sprint 4 — Test infrastructure:
- Add tests/conftest.py with shared tmp_dir, tmp_config_dir, tmp_data_dir, flask_client fixtures
- Add tests/test_config_validation.py: 400 paths for ip_range, port, wireguard.address validation
- Add tests/test_ip_utils_caddyfile.py: 14 tests for write_caddyfile (was completely untested)
- Expand test_app_misc.py: 7 new is_local_request tests covering XFF spoofing and cell-network IPs
- Add --cov-fail-under=70 to make test-coverage
- Add pre-commit hook that runs pytest before every commit
414 tests pass (was 372).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs caused DNS to fail when the domain name changes:
1. generate_corefile() hardcoded 'cell' as the zone name instead of
using the configured domain — on startup it would silently reset any
domain change back to 'cell'
2. apply_domain() regex replaced ALL non-dot zones (including local.cell)
with the new domain → duplicate zone blocks → CoreDNS crash
Fix: add a domain parameter to generate_corefile/apply_all_dns_rules,
add _configured_domain() helper in app.py, and delegate Corefile updates
in apply_domain() to generate_corefile() so the logic is in one place.
Also parameterise SERVICE_HOSTS ACL entries via the domain argument.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When ip_range changes in Settings, the new subnet is now applied to:
- DNS zone records (network_manager.apply_ip_range)
- Caddy virtual IPs (firewall_manager.ensure_caddy_virtual_ips)
- iptables per-service rules (firewall_manager.update_service_ips)
- docker-compose.yml static IPs if writable (ip_utils.update_docker_compose_ips)
New module ip_utils.py derives all container IPs from the subnet using
fixed offsets so the entire stack stays consistent from one setting.
321 tests pass (72 new tests added for ip_utils, apply_ip_range, update_service_ips).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Server-side access control:
- firewall_manager.py: per-peer iptables FORWARD rules in WireGuard container;
virtual IPs on Caddy (172.20.0.21-24) for per-service DROP/ACCEPT targeting
- CoreDNS Corefile regenerated with ACL blocks for blocked services per peer
- POST /api/wireguard/apply-enforcement re-applies rules after WireGuard restart;
wg0.conf PostUp calls it via curl so rules restore automatically on container start
WireGuard fixes:
- _syncconf uses `wg set peer` instead of `wg syncconf` to avoid resetting ListenPort
- add_peer validates AllowedIPs must be /32 — rejects full/split tunnel CIDRs that
would route internet or LAN traffic to that peer
- _config_file() checks for linuxserver wg_confs/ subdirectory first
UI:
- Peers page fetches /api/wireguard/peers/statuses for live handshake data;
status badge now shows real Online/Offline + seconds since last handshake
- IP field removed from Add Peer form (auto-assigned from 10.0.0.0/24)
Tests (246 pass):
- test_firewall_manager.py: 22 tests for ACL generation, iptables rule correctness,
comment tagging, clear_peer_rules filter logic
- test_peer_wg_integration.py: 10 tests for /32 enforcement, IP auto-assignment,
syncconf called on add/remove
- test_wireguard_manager.py: updated to reflect correct IPs and /32 requirement
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>