docs(phase-c): add four seed ADRs for key architectural decisions

New pages:
- ADR – 001 Isolated Build and Signing Pipeline for Store Images: two-stage
  kaniko/cosign pipeline; sacrificial gitea-action1 for untrusted builds;
  trusted gitea-action0 for scan/sign/push; enforce-by-default image
  verification on cells; private key only in Gitea Secrets.

- ADR – 002 Named Connection Instances for Connectivity: replaces v1 fixed
  fwmarks (0x10–0x50) with v2 named instances; fwmark pool 0x1000–0x1FFF
  stride 0x10; routing tables from 1000; per-peer assignment by UUID;
  configurable fail-open/fail-closed; cell_relay auto-derived; v1→v2 migration
  on first get_connectivity() call. Verified against connectivity_manager.py.

- ADR – 003 All Optional Functionality Ships as Store Services: core is six
  containers only; email/calendar/files/exits ship as store services with v3
  manifests; uniform install/uninstall/backup via ServiceRegistry; account
  provisioning via AccountManager HTTP dispatch.

- ADR – 004 The Wiki Is the Single Documentation Source: repo had five markdown
  manuals that drifted; decision to move all guides/references/decisions into
  this Gitea wiki; repo keeps README.md only; version-pinned content linked at
  tag/commit, not pasted.

All ADRs verified against /home/roof/pic source before asserting code claims.
Sidebar already references all four ADRs from Phase A commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:33:50 -04:00
parent d1a0c77b40
commit af809e1583
4 changed files with 229 additions and 0 deletions
@@ -0,0 +1,54 @@
> **Status:** Active | **Owner:** @roof | **Updated:** 2026-06-11
# ADR – 001 Isolated Build and Signing Pipeline for Store Images
## Context
The service store will accept community submissions. Any community-submitted pull request contains a Dockerfile that runs arbitrary code during `docker build`. Two distinct threat vectors exist:
1. **Malware in pre-built images.** A maintainer who blindly pulls and runs a pre-built image submitted by a community member has no assurance that the image matches the published source.
2. **Credential theft during build.** Building untrusted Dockerfiles on a CI runner that has access to signing keys, registry credentials, or other secrets gives the attacker direct access to those secrets via the build environment.
These two risks mean neither "trust the community image" nor "build on the trusted runner" is safe as a sole approach.
---
## Options Considered
### Option A — Mount Docker socket into the CI runner
The simplest approach: give a single CI runner both Docker build access and the ability to sign and push images. No isolation.
**Rejected because:** Docker socket access is root-equivalent on the host. A malicious Dockerfile that exploits the build to escape the container gains the runner's full credentials, including signing keys and registry push access. This is a complete supply-chain compromise.
### Option B — Rootless builders on the trusted runner
Use rootless Buildah or Podman on the trusted runner (`gitea-action0`) where secrets live, with strict process isolation but no VM boundary.
**Rejected because:** container escapes and Linux kernel exploits in rootless builders have a history. A sophisticated Dockerfile can still exfiltrate secrets through timing attacks, shared kernel state, or sandbox escapes without a VM boundary. The risk surface is smaller but not zero.
### Option C — Two-stage pipeline with a sacrificial VM (chosen)
Separate the build from the sign-and-push using two physically distinct runner VMs:
- **`gitea-action1`** (sacrificial, no secrets): accepts pull requests from the community, builds the Docker image using kaniko (no Docker daemon, no host privileges), and pushes an unverified image to a staging area. This runner is treated as potentially hostile — it is rebuilt between jobs and never given signing keys or registry push credentials beyond the staging target.
- **`gitea-action0`** (trusted, has secrets): pulls the built image, runs a Trivy vulnerability scan, signs with cosign using the PIC signing key (stored only in Gitea Secrets on this runner), re-tags and pushes to the production registry, then writes the `@sha256:` digest back into the manifest in the `pic-services` repository.
Cells bundle the cosign public key and verify every store image before starting its containers. Rollout used warn-by-default first; the default was flipped to enforce once all existing store images had been signed.
---
## Decision
We use the two-stage pipeline (Option C). Untrusted Dockerfiles are built in an isolated, no-secrets kaniko environment on `gitea-action1`. The resulting image is then pulled, scanned, signed, and published by the trusted `gitea-action0`. Cells reject unsigned or undigested images by default (`image_verification: enforce`).
---
## Consequences
- **Unsigned or undigested images refuse to install** on any cell running the default `enforce` mode. This blocks malware injected between the community's source and the image pull.
- **Service developers must disable or downgrade verification locally.** During development, set `image_verification.mode` to `warn` or `off` in `config/api/cell_config.json` (then restart the API container) to run unsigned local images. See [[Dev – Build a Store Service]].
- **The cosign private key lives only in Gitea Secrets on `gitea-action0`.** It is never on disk on `gitea-action1` or on any cell. The public key is bundled with PIC and used only for verification.
- **Trivy scan is a gate, not an alert.** A failed scan blocks the sign-and-push step. The definition of "fail" is a HIGH or CRITICAL severity CVE with a fixed version available.
- **`gitea-action1` is disposable.** It has no persistent state worth protecting. If it is compromised during a build, the worst outcome is a poisoned staging image that fails the cosign verification gate on `gitea-action0`.
- **Manifest digests are written back by CI.** The `image` field in a merged manifest is always a `@sha256:` digest, not a mutable tag. This prevents tag-redirect attacks after the manifest is merged.
@@ -0,0 +1,57 @@
> **Status:** Active | **Owner:** @roof | **Updated:** 2026-06-11
# ADR – 002 Named Connection Instances for Connectivity
## Context
The v1 connectivity model allocated one exit per exit type with fixed, hardcoded fwmarks (`wireguard_ext``0x10`, `openvpn``0x20`, `tor``0x30`, `sshuttle``0x40`, `proxy``0x50`) and fixed routing tables (`110``150`). This meant:
- You could not run two WireGuard providers simultaneously (for example, a personal VPN and a work VPN).
- You could not assign different peers to different exits of the same type.
- Adding a new peer to a connection required touching the global exit config rather than a per-peer record.
- UI navigation for connectivity was a single flat page with one configuration slot per type.
As the number of use-cases for connectivity grew (privacy routing for some peers, geographic routing for others, per-service egress), the one-per-type limit became a blocker.
---
## Options Considered
### Option A — Keep v1, add per-peer overrides
Extend the one-per-type model with a per-peer config field that allows overriding the type's single global config. For example, `peer.wireguard_ext_config` could point to an alternate server.
**Rejected because:** this still limits you to one WireGuard interface per type globally. Two peers needing different WireGuard upstreams would share one interface with multiplexed config, creating routing conflicts. The data model also becomes inconsistent — exit config partly global, partly per-peer.
### Option B — Named connection instances (chosen)
Replace the fixed-per-type model with N named connection instances. Each instance is identified by a UUID, has a human label, and is allocated its own fwmark and routing table from a pool. Peer assignment references the instance UUID, not the type.
---
## Decision
We adopted the named connection instance model (v2). The key decisions within that model:
- **fwmark pool:** `0x1000``0x1FFF`, stride `0x10` (allows up to 255 simultaneous instances). The old v1 marks (`0x10``0x50`) are in the pool but are not re-allocated to avoid conflicts on upgraded cells.
- **Routing table pool:** starting from `1000`, one table per instance.
- **Per-instance exit containers:** each `wireguard_ext` or `openvpn` instance gets a dedicated interface name (`wgext_<suffix>` or `ovpn_<suffix>`). Redirect-type instances (tor, sshuttle, proxy) get a dedicated redirect port from `9100``9199`.
- **Per-peer assignment:** each peer record stores `exit_connection_id` (UUID of the assigned instance). The legacy `route_via` field is kept in sync for backward compatibility.
- **Fail-open/fail-closed:** configurable per-instance with peer-level override. Default is fail-closed for all types except Tor (where fail-open is the default, because the typical Tor user prefers degraded-but-connected over completely blocked).
- **Per-connection health probes:** each instance has its own health check mechanism appropriate to its type. Results are cached for 30 seconds.
- **`cell_relay` auto-derived from cell links:** when a linked remote cell advertises an internet exit, a `cell_relay` connection instance appears automatically without admin configuration. These fail-closed by default.
- **UI moved to sub-pages:** `/connectivity` became a connection list, with each connection's peers and config on a sub-page.
**Migration v1 → v2:** on the first call to `get_connectivity()` after upgrade, `ConnectivityManager._migrate_connectivity_v1_to_v2()` runs if the stored schema version is less than 2. It creates one named instance per previously-configured exit type, re-points vault secret references to the new `conn_<id>_<field>` naming scheme, deletes the old references, and writes the new schema version. The migration is idempotent.
---
## Consequences
- **Operators must update client configs on major upgrade** if they used connectivity in v1. The migration runs automatically but peer re-assignment to the new instance UUIDs happens in the background.
- **v1 fwmarks (`0x10``0x50`) are reserved** and not re-used by the v2 allocator, to avoid collisions on upgraded cells during the migration window.
- **The Tor type remains limited to one instance per cell.** The Tor container does not support multiple simultaneous SOCKS listeners with isolated routing in the current implementation. This constraint is enforced by `ConnectivityManager.create_connection()` which returns an error if a second `tor` instance is requested.
- **UI for connectivity became deeper.** Admins navigate to a connection and then to its sub-pages (Peers, Config, Health). This is more clicks for simple use-cases but necessary for the multi-instance model.
- **Vault secret naming changed.** Old keys (`wireguard_ext_conf`) became `conn_<uuid>_conf`. The migration handles existing cells; new cells only ever see the new naming.
Internals: see [[Dev – Architecture]] (Connectivity v2 data model section).
@@ -0,0 +1,59 @@
> **Status:** Active | **Owner:** @roof | **Updated:** 2026-06-11
# ADR – 003 All Optional Functionality Ships as Store Services
## Context
The initial PIC implementation treated email, calendar/contacts, and file storage as built-in components. Their manager classes (`EmailManager`, `CalendarManager`, `FileManager`) lived directly in `api/`, their containers were defined in `docker-compose.yml` alongside the core stack, and their API routes were always registered regardless of whether the services were wanted.
This design created several problems:
- **Bloated core.** Every PIC install ran email, calendar, and file containers even when the admin had no use for them. This consumed RAM, disk, and attack surface for no benefit.
- **Coupled release cycles.** A bug in the email service required a PIC core release. An improvement to file storage blocked on a full core release cycle.
- **No uniform lifecycle.** Installing email was "enable in config and restart"; installing a new service later was a different process. There was no single install/uninstall/backup abstraction.
- **Growing `docker-compose.yml`.** Every new service added to the monolith made the compose file larger and harder to reason about.
---
## Options Considered
### Option A — Keep services built-in, add feature flags
Add configuration flags (`email_enabled`, `calendar_enabled`, etc.) so admins can disable services they do not want. Containers are still defined in `docker-compose.yml` but only started if enabled.
**Rejected because:** the compose file and the core API still carry all the service code. The attack surface reduction is superficial — the code paths are present even when disabled. Coupling release cycles remains a problem.
### Option B — Git submodules for each service
Move each service to its own repository as a git submodule. The core `docker-compose.yml` includes service compose files from submodules.
**Rejected because:** git submodule workflows are operationally complex for self-hosted users. Updating a service requires updating the submodule reference in the core repo, which re-couples release cycles. Backup and account provisioning still need to be wired into the core.
### Option C — Store services with a manifest-driven lifecycle (chosen)
Define a manifest schema (`schema_version: 3`, `"kind": "store"`) that encodes everything the core needs to know about a service: where to route HTTPS, what accounts to provision, what to back up, what egress options to expose. The core service store installs any service from the catalog at runtime without a core code change or restart.
The core shrinks to exactly six containers (Caddy, CoreDNS, NTP, WireGuard, Flask API, React WebUI). Everything else — email, calendar, files, webmail, connectivity exits — ships as a store service.
---
## Decision
All optional functionality ships as a store service (Option C). The six core containers are the stable, security-critical surface; everything else installs and uninstalls independently via the service store.
The `ServiceRegistry` (`api/service_registry.py`) is the single source of truth for all installed services. `CaddyManager`, the backup system, and the peer services endpoint all read from it rather than from hardcoded lists.
Account provisioning for store services uses the HTTP dispatch path in `AccountManager` (`api/account_manager.py`) via `_provision_http` / `_deprovision_http`, which calls the service container's `/service-api/accounts` endpoint. The reference services (email, calendar, files) use named internal managers for backward compatibility; new store services use `"manager": "http"` in their manifest.
---
## Consequences
- **Uniform install/uninstall/backup.** Every service follows the same lifecycle: install from the store, run via its own compose project under `data/services/<id>/`, back up via manifest-declared volume paths, uninstall with data preserved on disk. No special cases.
- **Smaller attack surface for minimal installs.** A cell that only needs VPN and DNS runs six containers. Email, calendar, and files add containers only when explicitly installed.
- **Services upgrade independently.** Bumping the email service version is a manifest version bump and a store re-install, not a PIC core release.
- **Core manager classes for email/calendar/files remain** (`EmailManager`, `CalendarManager`, `FileManager`), but they are invoked through `ServiceRegistry` rather than being always-active. They will eventually be moved into the service packages themselves as the migration matures.
- **The manifest schema (`schema_version: 3`) is a stability contract.** Breaking changes require a new schema version and a migration path. The six capability flags (`has_subdomain`, `has_accounts`, `has_admin_config`, `has_storage`, `has_egress`, `has_api_hooks`) are the extension points; adding a new integration type requires a new flag.
- **Community services must pass the signing pipeline** before cells will install them (under the default `enforce` image verification mode). See [[ADR – 001 Isolated Build and Signing Pipeline for Store Images]].
Internals: see [[Dev – Build a Store Service]] and [[Dev – Service Manifest Reference]].
@@ -0,0 +1,59 @@
> **Status:** Active | **Owner:** @roof | **Updated:** 2026-06-11
# ADR – 004 The Wiki Is the Single Documentation Source
## Context
As PIC grew, documentation accumulated in multiple places:
- `QUICKSTART.md` — install guide in the repo root
- `Personal Internet Cell – Project Wiki.md` — a monolithic ~5,000-word markdown file in the repo
- `docs/` directory — several topic-specific markdown files
- `webui/README.md` — frontend notes for developers
- `Dev-API.md` / `Dev-Architecture.md` — developer docs checked into the repo
These sources drifted rapidly from reality. Because they lived in the git repo, updating them required a code review and merge. Docs for last month's code frequently shipped alongside this month's features. Users reading the README on the repository landing page followed instructions that no longer applied.
The Gitea wiki (this wiki) also existed but was treated as optional — some topics appeared here and also in the repo, with the two copies diverging.
---
## Options Considered
### Option A — Docs-as-code in the repository
Keep all documentation in the repo, enforce updates via CI lint checks, require docs changes in the same PR as code changes.
**Rejected because:** this tightens the coupling without solving the drift problem. A CI lint check can verify that documentation exists but cannot verify that it is accurate. Requiring docs PRs for every code change slows down contributors without improving quality. The monolithic markdown file becomes a merge-conflict magnet.
### Option B — External hosted documentation (Docusaurus, Read the Docs, etc.)
Move all documentation to a dedicated docs site with versioning.
**Rejected because:** it introduces another infrastructure dependency for a project whose primary goal is minimising dependencies. It also creates a three-way split (repo, wiki, docs site) rather than resolving the two-way split that already exists.
### Option C — Gitea wiki as the single source (chosen)
Move all guides, references, and decisions into this Gitea wiki. Strip the repository down to `README.md` only (project overview and a pointer to the wiki). Content pinned to a code version (API reference, sample configs, build scripts, CHANGELOG) is linked to the repo at a tag/commit permalink rather than pasted into a wiki page.
The wiki is a git repository (`pic.wiki.git`) with its own commit history, accessible directly without going through a code review process, and editable by anyone with wiki write access. This lowers the friction for keeping docs current.
---
## Decision
This Gitea wiki is the single authoritative source for all explanations, guides, decisions, and troubleshooting. The repository's only documentation file is `README.md` (overview + pointer to the wiki). All other markdown documentation files in the repo have been deleted.
The rule is: when you merge a code change that changes user-visible behaviour, admin procedures, or the developer API, you update the wiki in the same session. The wiki must not drift from the code.
---
## Consequences
- **One place to update.** Contributors know where to look and where to write. There is no ambiguity about which copy is authoritative.
- **The wiki must be updated in the same change as the code.** This is a discipline constraint, not a technical enforcement. The pre-commit hook in the repo does not check the wiki. Reviewers are responsible for flagging wiki-missing PRs.
- **Version-sensitive content is linked, not pasted.** API reference pages, manifest examples, and install scripts in the wiki link to the repository at a specific commit or tag. Pasting them into a wiki page would re-introduce drift. The `Applies to: main (YYYY-MM)` header on version-dependent pages signals when a page was last verified against the code.
- **The `Archive –` prefix marks deprecated pages.** Old pages are never deleted; they are renamed with `Archive –` and moved out of the sidebar's main sections. This preserves historical context without polluting current navigation.
- **The wiki structure uses audience grades** (User / Admin / Dev / ADR / Meta / Archive). This prevents documentation for one audience from being buried in material aimed at another, which was a recurring problem with the monolithic markdown file.
See also [[ADR – 003 All Optional Functionality Ships as Store Services]] for the related decision that removed service documentation from the repo.