Operations & runbook
In one line: how to build, deploy, roll back, and survive failures — including the LLM failover lever, the "technical difficulties" platform fallback, and the deployment-drift trap between
src/and the live build.
What it is
HIGHFIELD is live, so changes follow a strict discipline: diagnose before designing, additive-only, one primitive per deploy, smoke between. This page covers the deploy/rollback mechanics and the failure-mode levers an operator reaches for.
How it works — build, deploy, rollback
⚠️ This section is operational practice, not all source-verifiable. Confirm against your current
lua-cliversion and the team's runbook before acting.
- Typecheck:
npx tsc -p tsconfig.json --noEmit(ignoredist-v2errors). Avoid compoundtsc/lua compilecommands — the deploy hook blocks them. - Agent versioning is ON. Push with
lua push all --force, deploy withlua deploy all --force(granular push is deprecated). A bare push auto-deploys skills only; webhooks/persona/jobs needdeploy all. - Deploys are run by the human operator (the deploy hook blocks the assistant even with the prefix). The assistant stages/pushes; a person runs
LUA_DEPLOY_CONFIRMED=1 lua deploy <type>. - Rollback: per-primitive
lua deploy --set-version <prior>flips the live pointer without pushing (surgical).lua source rollbackis workspace-wide and trample-y — avoid unless intended. - Verify LIVE state by timestamp/version, not by assuming a push deployed:
lua <type> versionsshows the ⭐ DEPLOYED label.sync --check"in sync" only matches version labels, not code. - Poisoned per-user memory:
lua chat clear --user <id|email|mobile>wipes a user's agent conversation memory (the escape hatch for expired-S3-URL history or runaway context).
⚠️ Deployment drift. The deployed build mirror is
dist-v2/sources/<hash>.tsand can diverge from the working-treesrc/. Example: the live build hasLLM_FAILOVER_MODEL+model: () => resolveModel(...)and a different job set; the current branch hardcodes the model (src/index.ts:567) andsrc/utils/model-failover.tsdoes not exist. Never assumesrc/== production.
How it works — failure-mode levers
LLM failover (LLM_FAILOVER_MODEL)
In the deployed build, the agent model is () => resolveModel(env('LLM_FAILOVER_MODEL')). PRIMARY = anthropic/claude-sonnet-4-6, default failover google/gemini-2.5-pro. Toggle: unset/off/false/0 → primary; on/1/true/gemini → Gemini; an explicit provider/model string used verbatim. One-command flip, reversible in seconds, no redeploy — the lever to pull during an Anthropic 529 outage. Lua has no in-request retry/fallback on 529, which is why this env lever exists.
⚠️ This lever is in the deployed build, absent from the current branch source. See drift note above.
"Technical difficulties" platform fallback
The Lua platform substitutes a generic "...technical difficulties... try again later..." message on any channel when a turn crashes (oversized attachment, Anthropic 529). It is server-side and not SDK-configurable.
src/utils/error-reply-copy.tsprovides warm replacement copy (getAwayReplyMessage()), chosen by channel.- Live use: the email webhook's
AI.generatecatch (inbound-email.webhook.ts:651-673) is a status-based trigger — it emails the away copy and returns 200; if even that fails, silent 200. isPlatformErrorFallback(text)detects the platform string precisely (full contiguous phrases; never the bare words "technical difficulties"; anyMT-…reference is a hard negative guard). ⚠️ It is NOT wired as a postprocessor — a platform fallback on the native channel (not the webhook) would not be rewritten.
Retry / timeout / backoff
- No in-request LLM retry (use the failover lever instead).
- BC custom-extension writes:
withParentRetryretries on FK-not-yet-visible with backoff (1s/2s/4s);syncMessageretries on entryNo collision. HTTP timeouts 30s/60s. - Ticket-ID allocation: cross-process duplicate backstopped by BC's "already exists" retry.
- Slot no-pick timeout:
SLOT_PICK_TIMEOUT_HOURS(default 12h) auto-cancels.
Rate-limit backstop
Per-sender cap, default 40 replies / 10 min (AUTO_REPLY_RATE_LIMIT / AUTO_REPLY_RATE_WINDOW_MIN); fail-open; email-only; enforced at inbound-email.webhook.ts:113-119.
Message batching / serialization
src/index.ts:597-602: firstMessageDelayMs:0, debounceWindowMs:8000, maxBatchMessages:8, serializeProcessing:true — coalesces rapid follow-ups into one turn; a message arriving mid-turn queues behind it instead of aborting (eliminated duplicate/contradictory/side-effect-then-no-reply failures).
Incident triage quick reference
| Symptom | First check |
|---|---|
| Tenant gets "technical difficulties" emails | status.claude.com + grep logs for Overloaded/529 first (often an Anthropic outage, not our limiter). Then check for expired-S3 image history — lua chat clear the user |
| Outbound mail "missing" | Check SPAM (SPF/DKIM); confirm LUA_EMAIL_CHANNEL_ID set; it's the Lua send-message lane, not AgentMail |
| Brief sent 2–3× | Cron re-fire on a slow run; confirm the claim-guard collection (daily_report_sent/weekly_report_sent) |
| Burst of reminder emails | Backlog flush — a data-fix unblocked the hourly completion-feedback job |
| Vendor "ticket not recognised" | Duplicate vendor record (active vs inactive sharing an email); resolver prefer-active |
| Wrong/empty reply on heavy turn | ~60s platform fallback dropping the real reply; tool-sent receipts mitigate |
Gotchas & failure modes
dist-v2/sources≠src/. Verify live withlua <type> versions; treatdist-v2/sources/<hash>.tsas the deployed mirror.- Push auto-deploys skills only. Webhooks/persona/jobs need an explicit
deploy all. - No central state-machine guard on ticket status (see Data model).
- Humanize-error postprocessor not wired — only the webhook's status-based catch is live.
- BC secret placeholder offline → 401; verify BC-dependent fixes via the read-only
seed-datawebhook against prod.