Disaster-Recovery-Konzept

Status: Single-Tenant-Recovery LIVE seit 2026-05-02. Restore via Backups-Page oder Tenant-Board-Card → ⚡-Button → 30s wiederhergestellt. Implementation-Details in session-recap-2026-05-02.md Block C+D.

Heutige Implementation:

Tenant-Box-Snapshots in Hetzner Object Storage Frankfurt (AES-256-CBC, per-tenant-Key via HKDF)
Daily-Backup-Cron + 30d Retention
Restore in <30s mit SHA256-Verify (verifiziert mit demo3-Restore)
Cross-Host-Recovery via tenant-migration-Pipeline (Restore mit targetHostId)

Noch offen:

L2/L3 (monthly/yearly) mit Object-Lock für Ransomware-Schutz
Restore-Drill-Cron mit Sandbox-Host (proactive Backup-Korruption finden)
Multi-Region-Replikation (Frankfurt → Helsinki)
Master-Key Escrow (aktuell nur in Backend-.env, kein Vault-Backup)

Ziel: Bei jedem realistischen Server-Ausfall kann der betroffene Tenant durch einen Klick im Admin-Portal wiederhergestellt werden — auf einen frischen Host, ohne SSH-Sessions, ohne Bastelei. Ein neuer Tenant ist innerhalb von 15-20 Minuten wieder erreichbar.

SLA-Ziele

Metrik	Ziel	Realitaetstest
RPO (max. Datenverlust)	24h (daily) → 1h (hourly opt-in)	"Wieviel verlieren wir?"
RTO (max. Downtime)	< 25 Min (auto-provision 10 + restore 10 + DNS 5)	"Wie schnell sind wir wieder live?"
Backup-Erfolgsrate	> 99 %	Health-Check + Alarm bei > 24h ohne erfolgreiches Backup
Restore-Test-Erfolgsrate	> 95 % (monatliches Random-Drill)	"Funktioniert Restore wirklich?"

Threat-Modell — wovon retten wir uns

Szenario	Wahrscheinlichkeit	Heute geschuetzt?	Mit DR-Konzept
Disk-Failure auf shared-host	mittel	❌	✅ Restore auf neuen Host
Hetzner-DC-Outage (FSN1 down)	niedrig	❌	✅ Restore in andere Location
Versehentliche Tenant-Loeschung	mittel	❌	✅ Point-in-Time-Restore
DB-Korruption (Software-Bug)	niedrig	❌	✅ Restore vorheriger Snapshot
Ransomware auf Host	sehr niedrig	❌	✅ Off-Host-Backup nicht erreichbar
Hetzner-Account kompromittiert	sehr niedrig	❌	teilweise (Backup-Bucket separat geschuetzt)
Code-Bug zerlegt Daten	niedrig	❌	✅ Rollback auf Vor-Stand

Architektur — drei Layer

┌─────────────────────────────────────────────────────────────────┐
│  Layer 3 — Restore-Engine + Admin-UI                            │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ admin.prilog.chat/recover/<tenant>                         │ │
│  │  → Wizard: Datum waehlen, Ziel waehlen, "Restore"          │ │
│  │  → Auto-Provision (falls neuer Host) + Restore + DNS       │ │
│  └────────────────────────────────────────────────────────────┘ │
│                              ▲                                  │
└──────────────────────────────│──────────────────────────────────┘
                               │
┌──────────────────────────────│──────────────────────────────────┐
│  Layer 2 — Backup-Inventory + Monitoring                        │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ Tabelle: backup_inventory                                  │ │
│  │  (tenantId, date, bundleUrl, sizeBytes, checksum, status)  │ │
│  │  → Health-Check: Alarm wenn > 24h ohne Backup              │ │
│  │  → Restore-Drill: monatlich random Tenant testen           │ │
│  └────────────────────────────────────────────────────────────┘ │
│                              ▲                                  │
└──────────────────────────────│──────────────────────────────────┘
                               │
┌──────────────────────────────│──────────────────────────────────┐
│  Layer 1 — Backup-Cron auf jedem Shared-Host                    │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ Cron 03:00: backup-all-tenants.sh                          │ │
│  │  pro Tenant:                                               │ │
│  │    pg_dump + mc mirror + tar synapse + tar compose         │ │
│  │    → bundle.tar.gz                                         │ │
│  │    → upload nach hetzner-obj-storage://prilog-backup/      │ │
│  │  Retention: 7 daily, 4 weekly, 12 monthly                  │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Layer 1 — Backup

Was wird gesichert

Aktuelle Pfade (Tenant-Box)

Mit der Tenant-Box-Architektur (LIVE seit 2026-05-02) liegt alles unter /srv/tenants/<slug>/. Das Snippet unten dokumentiert die ursprüngliche Plan-Variante; die LIVE-Implementation in tenant-backup.service.ts arbeitet ausschließlich mit /srv/tenants/<slug>/{postgres,minio,synapse,connectors,docker-compose.yml,homeserver.yaml}.

Pro Tenant:

Postgres-DB (pg_dump -Fc aus dem pg-<slug>-Container, ~50-500 MB)
MinIO-Bucket (mc mirror aus dem minio-<slug>-Container, alle DMS-Dateien)
Synapse-Media (Volume des synapse-<slug>-Containers, Chat-Anhaenge)
Box-Config (/srv/tenants/<slug>/ ohne *-data, also homeserver.yaml, docker-compose.yml, connectors/, credentials.env)

Plus pro Host: 5. Host-Config (/etc/prilog/host.json, /etc/prilog/port-registry.json) 6. Nginx-Tenants (/etc/nginx/prilog-tenants/)

Punkt 1-4 sind tenant-spezifisch und werden in einem Tarball pro Tenant gebundelt. Punkt 5-6 sind host-spezifisch und werden separat gebackupt (kleiner, seltener noetig).

Wo

Hetzner Object Storage — separater Bucket prilog-backup in einer anderen Hetzner-Location als die Shared-Hosts (z.B. NBG1 wenn Hosts in FSN1). Off-DC schuetzt vor lokalem Outage.

Path-Layout:

prilog-backup/
└── <host-name>/
    └── <tenant-sub>/
        ├── 2026-05-01/        ← Daily
        │   ├── bundle.tar.gz
        │   └── manifest.json  (Inhalt + checksum)
        ├── 2026-04-25/        ← Weekly (Sonntag)
        ├── 2026-05-01/        ← Monthly (1. des Monats)
        └── _latest → symlink auf neuestes Daily

Wann

Auf jedem Shared-Host:

Daily 03:00 UTC (Cron-Job, ~30-90 Min Laufzeit je nach Tenant-Anzahl)
Hourly (opt-in pro Tenant via TenantSetting backup.hourly) — fuer Premium-Plaene mit RPO 1h
On-Demand vor jeder Migration (zusaetzlich zur Migration-Engine als Sicherheit)

Retention

Typ	Behaltedauer	Wann erstellt
Daily	7 Tage	jeden Tag 03:00
Weekly	4 Wochen	jeden Sonntag (das Daily-Bundle wird "promoted")
Monthly	12 Monate	jeden 1. des Monats

Cron raeumt automatisch auf:

find prilog-backup/<host>/<tenant>/*/ -mtime +7 -name 'daily*' -delete
analog fuer weekly/monthly

Speicherkosten Beispiel (100-GB-Tenant, 7d+4w+12m = 23 Bundles ≈ 2.3 TB):

2,3 TB × 2,80 EUR/TB/Monat = 6,44 EUR/Monat pro Tenant mit 100 GB Daten
Realistischer: 5-10 GB pro Tenant → ~0,50 EUR/Monat

Implementierung

Neuer Agent-Handler prilog-agent/src/handlers/backup.ts:

export async function handleBackupTenant(args: { sub, dbName, bucketName, hostName }) {
  const date = new Date().toISOString().slice(0, 10);
  const dir = `/tmp/backup-${args.sub}-${date}`;
  await sh(`mkdir -p ${dir}`);

  // 1. DB
  await sh(`sudo -u postgres pg_dump -Fc ${args.dbName} > ${dir}/db.dump`);
  // 2. MinIO bucket
  await sh(`mc mirror --quiet local/${args.bucketName} ${dir}/bucket/`);
  // 3. Synapse-Media
  await sh(`tar -czf ${dir}/synapse.tar.gz -C /var/lib/prilog/synapse-${args.sub} .`);
  // 4. Compose-Config
  await sh(`tar -czf ${dir}/compose.tar.gz -C /opt/prilog/tenants/${args.sub} .`);
  // 5. Bundle + checksum
  await sh(`tar -czf ${dir}/bundle.tar.gz -C ${dir} db.dump bucket synapse.tar.gz compose.tar.gz`);
  const checksum = (await sh(`sha256sum ${dir}/bundle.tar.gz | cut -d' ' -f1`)).stdout.trim();
  // 6. Upload
  const remotePath = `backup/${args.hostName}/${args.sub}/${date}`;
  await sh(`mc cp ${dir}/bundle.tar.gz hetzner-backup/prilog-backup/${remotePath}/bundle.tar.gz`);
  await sh(`echo '{"date":"${date}","checksum":"${checksum}",...}' | mc pipe hetzner-backup/prilog-backup/${remotePath}/manifest.json`);
  // 7. Cleanup local
  await sh(`rm -rf ${dir}`);

  return { remotePath, checksum, sizeBytes };
}

Neuer Cron auf jedem Shared-Host (vom Agent registriert):

bash

0 3 * * * /opt/prilog-agent/dist/cli.js backup-all

Backend-Cron backup-monitor:

Listet alle Tenants
Prueft backup_inventory ob letzter Eintrag < 25h
Wenn nicht → Alarm-Mail an admin@prilog.chat

Layer 2 — Inventory + Monitoring

Schema

sql

CREATE TABLE backup_inventory (
  id            VARCHAR(50)   PRIMARY KEY,
  tenant_id     VARCHAR(64)   NOT NULL,
  host_name     VARCHAR(50)   NOT NULL,
  bundle_url    VARCHAR(255)  NOT NULL,
  size_bytes    BIGINT        NOT NULL,
  checksum      VARCHAR(80)   NOT NULL,
  type          VARCHAR(10)   NOT NULL,  -- 'daily' | 'weekly' | 'monthly' | 'on-demand'
  created_at    TIMESTAMPTZ   NOT NULL DEFAULT now(),
  expires_at    TIMESTAMPTZ   NOT NULL,
  status        VARCHAR(20)   NOT NULL DEFAULT 'success',  -- success | failed | corrupted
  notes         TEXT
);

CREATE INDEX idx_backup_tenant_date ON backup_inventory(tenant_id, created_at DESC);
CREATE INDEX idx_backup_status ON backup_inventory(status);

Agent meldet nach jedem Backup-Run an Backend (existing WS-Channel) → Backend persistiert in backup_inventory.

Health-Cron `backup-monitor`

Schedule: 0 4 * * * (1h nach Backup-Cron) Logik:

fuer jeden aktiven Tenant T:
  letzter = SELECT MAX(created_at) FROM backup_inventory WHERE tenant_id = T
  if letzter < now() - 25h:
    sende Alarm-Mail "Tenant T hatte > 24h kein Backup"
    erstelle Slack-Notification (spaeter: PagerDuty)

Restore-Drill-Cron (monatlich)

Schedule: 0 5 1 * * (1. des Monats, 05:00) Logik:

zufaellig 1 Tenant aus aktiven waehlen
auf einem dedizierten "drill-host" (separater Shared-Host, nur fuer Tests):
  download neuestes Bundle
  pg_restore in temporaere DB
  vergleiche row-counts mit Live-DB (sample)
  vergleiche bucket-Inhalte (Listing)
  wenn ok: status='verified', cleanup
  wenn fail: status='failed', Alarm
ergebnis wird in backup_drill_log persistiert

Layer 3 — Restore-Engine + Admin-UI

Admin-UI

Neue Route /admin/recover/<tenant-sub>:

┌─────────────────────────────────────────────────┐
│ Tenant 'leander' wiederherstellen               │
├─────────────────────────────────────────────────┤
│ Backup-Datum waehlen:                           │
│ ○ 2026-05-01 03:00 (heute)         8 MB        │
│ ○ 2026-04-30 03:00 (gestern)       8 MB        │
│ ○ 2026-04-29 03:00 (vorgestern)    8 MB        │
│ ○ 2026-04-25 03:00 (Wochen-Backup) 8 MB        │
│ ○ 2026-04-01 03:00 (Monats-Backup) 8 MB        │
│                                                 │
│ Ziel-Host:                                      │
│ ○ shared-1 (4/15 Tenants)                      │
│ ○ shared-2 (0/15 Tenants)                      │
│ ◉ Neuer Host auto-provisionieren (CCX13, FSN1) │
│                                                 │
│ ⚠ Alle Daten nach diesem Datum gehen verloren! │
│                                                 │
│        [Abbrechen]  [Wiederherstellen]         │
└─────────────────────────────────────────────────┘

Backend-Flow

POST /admin/tenants/<id>/recover { backupId, targetHostId? }

1. Wenn !targetHostId: createSharedHost() → warten bis active (~10 Min)
2. sendCommand zu Target-Agent: 'tenant.restore_from_backup'
   args: { backupUrl, dbName, bucketName, slug, ... }
3. Agent (handleRestoreFromBackup):
   - mc cp hetzner-backup://.../bundle.tar.gz /tmp/
   - sha256sum verify
   - extract bundle
   - createuser + createdb
   - pg_restore --clean --if-exists
   - mc mb local/tenant-<sub> + mc mirror /tmp/bucket/ local/...
   - tar -xzf synapse.tar.gz nach /var/lib/prilog/synapse-<sub>
   - tar -xzf compose.tar.gz nach /opt/prilog/tenants/<sub>
   - docker compose up -d
4. DNS-Update via Bunny (analog Migration-cutover)
5. ServerOrder.sharedHostId + synapsePort updaten
6. health-check Synapse
7. UI-Update: "Tenant wiederhergestellt unter https://<sub>.prilog.team"

Restore vs. Migration — Unterschiede

	Migration	Restore from Backup
Source	live source-host (rsync)	Hetzner Object Storage (Download)
Daten-Aktualitaet	wenige Sekunden alt	bis zu 24h alt
Ablauf	8 Steps mit freeze	5 Steps, kein freeze (Source kann tot sein)
Use-Case	geplante Verschiebung	Disaster oder Rollback

Implementierungs-Phasen

Phase 1 — Backup-Cron (~1-2 Tage)

Hetzner Object Storage Bucket einrichten + Service-Account
mc alias set hetzner-backup ... auf jedem Shared-Host (cloud-init + shared-host.sh)
Agent-Handler backup.ts (handleBackupTenant + handleBackupAllTenants)
Cron auf jedem Shared-Host
backup_inventory Tabelle + Migration
Test: 1 Tenant manuell backupen, Bundle in Hetzner Object Storage sehen

Phase 2 — Health-Monitoring (~½ Tag)

Backend-Cron backup-monitor
Email-Alarm bei > 24h ohne Backup
Admin-UI: kleine Sektion "Backup-Status" pro Tenant (letztes Datum + Groesse)

Phase 3 — Restore-Engine (~2-3 Tage)

Agent-Handler restore-from-backup.ts
Backend-Endpoint POST /admin/tenants/<id>/recover
Admin-UI Wizard mit Datums-Auswahl + Ziel-Host
DNS-Update (kann existing migration-cutover-Code wiederverwenden)
Test: Tenant absichtlich loeschen, restoren

Phase 4 — Restore-Drill (~½ Tag)

Monatlicher Cron restore-drill
Drill-Host als separater Shared-Host markiert
Verify-Logik (row-count + bucket-listing)
Slack/Email-Notification mit Ergebnis

Phase 5 — Hourly-Backup-Opt-in (~½ Tag)

TenantSetting backup.hourly Flag
Stuendlicher Cron pruft Tenants mit Flag
Inkrementelle Backups (nur DB-Diff statt Full-Dump)

Total-Aufwand: ~5-7 Tage. Phasen sind sequenziell sinnvoll — mit Phase 1 hast Du sofort Schutz vor Disk-Failure.

Was ist 1-Click

Nach Phase 3 hat der Admin im Disaster-Fall genau drei Klicks:

1. Login admin.prilog.chat  →  Tenant-Liste
2. "Wiederherstellen" auf Tenant-Karte
3. "Wiederherstellen" im Wizard (Default: aktuelles Backup, Auto-Host)

→ Nach 15-20 Min: Tenant wieder erreichbar, mit max. 24h Datenverlust.

Plus: bei totalem DC-Outage kann der Admin den Drop-down auf "Andere Location" wechseln → Tenant springt nach NBG1/HEL1 ohne weiteren Eingriff.

Trade-offs / Bewusst nicht enthalten

Was	Warum nicht	Wann ggf. einbauen
Hot-Standby (live-Replikation)	komplex, ~10x Aufwand, RTO-Vorteil nur Sekunden	nie sinnvoll fuer Schul-SaaS — RTO 25 Min reicht
Geo-Replication zwischen DCs	Hetzner Object Storage repliziert intern schon	wenn wir Hetzner verlassen
Encryption-at-rest fuer Backups	Hetzner Object Storage AES-256 default	falls Compliance-Frage kommt: server-side-encryption mit Customer-Keys
Version-by-Version-Restore (Browse history)	UI-Aufwand, fuer Nischenfaelle	Phase 6+
Selektives Tabellen-Restore	komplex, Schema-Drift-Risiko	wenn DB-Format-Migration mal stabilisiert

Anschlussfaehig an

tenant-migration-implementation.md — Migration-Engine, Code wird grossteils wiederverwendet
admin-tenant-board.md — Anwender-Sicht, Recovery-Button kommt dort hin

Disaster-Recovery-Konzept ​

SLA-Ziele ​

Threat-Modell — wovon retten wir uns ​

Architektur — drei Layer ​

Layer 1 — Backup ​

Was wird gesichert ​

Wo ​

Wann ​

Retention ​

Implementierung ​

Layer 2 — Inventory + Monitoring ​

Schema ​

Health-Cron backup-monitor ​

Restore-Drill-Cron (monatlich) ​

Layer 3 — Restore-Engine + Admin-UI ​

Admin-UI ​

Backend-Flow ​

Restore vs. Migration — Unterschiede ​

Implementierungs-Phasen ​

Phase 1 — Backup-Cron (~1-2 Tage) ​

Phase 2 — Health-Monitoring (~½ Tag) ​

Phase 3 — Restore-Engine (~2-3 Tage) ​

Phase 4 — Restore-Drill (~½ Tag) ​

Phase 5 — Hourly-Backup-Opt-in (~½ Tag) ​

Was ist 1-Click ​

Trade-offs / Bewusst nicht enthalten ​

Anschlussfaehig an ​