02 // project

Vertretungsplan

A two-worker Cloudflare system that scrapes, parses, stores and serves the AES Schwalbach substitution schedule. A cron worker snapshots PDFs into R2 every hour; a second worker parses the raw HTML live, merges dual sources, exposes a clean JSON API, and drives a React + Tailwind frontend at vp.l3on.org.

live
cloudflare workers
durable objects
r2 storage
d1 sqlite
vplan-aes-cf-worker — light mode Main worker architecture (light)
vplan-aes-cf-worker — dark mode Main worker architecture (dark)
2CF Workers
1hCron Interval
60sCache TTL
Edge PoPs
vplan-worker — the snapshot engine [TypeScript]
01 A Cloudflare Cron Trigger fires every hour. The worker launches a Puppeteer browser session (via @cloudflare/puppeteer) and navigates to both subst_001.htm and subst_002.htm on the Untis portal.
02 Before generating a PDF it scrapes the page for a date string — trying the title, a "Stand:" timestamp, header tags, and finally falling back to today's date if nothing is found.
03 Each page is printed to A4 PDF with page.pdf() and uploaded to R2 under the key DD.MM.YYYY/subst_00N.pdf with rich custom metadata (date, source URL, polledAt).
04 A separate GET /vplan/parse/:date endpoint extracts plain text from stored PDFs via unpdf (WASM, no Node) and runs a custom token-stream parser that handles the flat Untis table layout — splitting rows on repeating day+date anchors, expanding multi-class cells, and detecting Entfall via column 12 "x" markers.
vplan-aes-cf-worker — the API + frontend [JavaScript]
01 The main worker fetches subst_001.htm and subst_002.htm live on each request (with a 60-second Cloudflare cache). It decodes the response with the correct charset from the Content-Type header (often ISO-8859-1 for German Umlauts).
02 A regex-based HTML parser extracts table rows matching .list* CSS classes, reconstructing 12-column substitution records. Entfall detection is tri-modal: explicit "x" in column 12, info field containing "entfall", or both teacher and room being empty dashes.
03 The admin dual-merge endpoint fetches both files in parallel via Promise.allSettled. It compares their "Stand:" timestamps to decide which is primary, then merges them into a deduplicated map keyed by datum|klasse|stunde|fach. Multi-class rows like "07A, 07B, 07C" are expanded into individual records.
04 The frontend is a React 18 + Tailwind SPA served via env.ASSETS. It auto-refreshes every 60 s, splits compound lesson slots like "1-2" into individual hour cards, and persists the class filter and view mode to localStorage.
ChatRoom Durable Object — realtime chat [Durable Objects]
01 Each chat room is a Cloudflare Durable Object with an embedded SQLite database. On construction it creates two tables: messages and meta. Messages older than 1 hour are pruned on each write to keep storage lean.
02 Clients connect via WebSocket. The DO uses hibernation (state.acceptWebSocket) so it only consumes compute when messages arrive. Incoming JSON is validated, stored, and broadcast to all other connected sockets in the same room.
03 Mentioning @ai or @smartypants triggers Groq (llama-3.3-70b-versatile) via a 15-second timeout fetch. The last 20 messages are sent as conversation context. A 3-second per-room cooldown (stored in the meta table) prevents API flooding.
API endpoints exposed [Routes]
MethodPathWhat it doesAuth
GET /api/vertretungsplan/today Live parse of subst_001.htm → JSON entries none
GET /api/vertretungsplan/tomorrow Live parse of subst_002.htm → JSON entries none
GET /api/admin/vplan-dual Parallel fetch + timestamp-aware merge of both sources X-Admin-Code
GET /api/news Scrape + decode AES news feed, cache 5 min none
GET /api/calendar Proxy school iCal feed, cache 10 min none
GET /api/chat/:code/history Last hour of chat messages for a room none
GET /api/chat/:code/ws WebSocket upgrade → Durable Object room none
GET /api/hub/files List all files in the R2 file hub none
POST /api/hub/upload Multipart file upload to R2 (PNG/JPEG/PDF ≤10 MB) none
POST /api/mensa/sync FlareSolverr bypass → scrape & store Mensa PDFs env secret
GET /vplan/parse/:date Extract & parse stored R2 PDFs for a date (vplan-worker) X-API-Key
Interesting engineering decisions [Design]
No n8n dependency — early versions piped everything through n8n webhooks, adding 8–10 s of latency. Rewriting the HTML parser directly in the Worker dropped response time to under 300 ms globally.
🔀 Dual-source merge — the school sometimes updates subst_001 and subst_002 asynchronously. The admin merge endpoint compares ISO timestamps from the "Stand:" field and designates the newer file as "primary", letting it overwrite any conflicting rows from the secondary.
🔤 Charset detection — Untis serves HTML in ISO-8859-1 without declaring it consistently. The parser reads the charset= parameter from the Content-Type header before decoding, so German Umlauts (ä, ö, ü, ß) always survive the round-trip.
🗄️ SQLite in a Durable Object — instead of a separate D1 database for chat history, the DO uses its embedded SQLite via state.storage.sql.exec(). This keeps all room state co-located with the WebSocket connections and eliminates cross-datacenter round-trips.
🦎 WASM PDF parsing on the edge — the vplan-worker uses unpdf (compiled to WASM) to extract text from stored R2 PDFs without any Node.js runtime. The token stream is then walked with a state machine that anchors on repeating "Mi 6.5." / "Do 7.5." day+date patterns to split rows.
Current state & future plans [Post Mortem]
The app is live and used daily during the school year. The core vplan pipeline is stable — the main active development is around the surrounding features: the Datei-Hub (file sharing for class materials), the Mensa scraper (bypassing Cloudflare via self-hosted FlareSolverr), and the school calendar ICS proxy.

Next planned: push notifications when a class's schedule changes (comparing the new parse result against the previous snapshot stored in D1), and a teacher-view mode behind the admin code that shows room conflicts across all classes simultaneously.