Skip to main content

Operations & runbook

In one line: how to build, deploy, roll back, and survive failures — including the LLM failover lever, the "technical difficulties" platform fallback, and the deployment-drift trap between src/ and the live build.

What it is

HIGHFIELD is live, so changes follow a strict discipline: diagnose before designing, additive-only, one primitive per deploy, smoke between. This page covers the deploy/rollback mechanics and the failure-mode levers an operator reaches for.

How it works — build, deploy, rollback

⚠️ This section is operational practice, not all source-verifiable. Confirm against your current lua-cli version and the team's runbook before acting.

  • Typecheck: npx tsc -p tsconfig.json --noEmit (ignore dist-v2 errors). Avoid compound tsc/lua compile commands — the deploy hook blocks them.
  • Agent versioning is ON. Push with lua push all --force, deploy with lua deploy all --force (granular push is deprecated). A bare push auto-deploys skills only; webhooks/persona/jobs need deploy all.
  • Deploys are run by the human operator (the deploy hook blocks the assistant even with the prefix). The assistant stages/pushes; a person runs LUA_DEPLOY_CONFIRMED=1 lua deploy <type>.
  • Rollback: per-primitive lua deploy --set-version <prior> flips the live pointer without pushing (surgical). lua source rollback is workspace-wide and trample-y — avoid unless intended.
  • Verify LIVE state by timestamp/version, not by assuming a push deployed: lua <type> versions shows the ⭐ DEPLOYED label. sync --check "in sync" only matches version labels, not code.
  • Poisoned per-user memory: lua chat clear --user <id|email|mobile> wipes a user's agent conversation memory (the escape hatch for expired-S3-URL history or runaway context).

⚠️ Deployment drift. The deployed build mirror is dist-v2/sources/<hash>.ts and can diverge from the working-tree src/. Example: the live build has LLM_FAILOVER_MODEL + model: () => resolveModel(...) and a different job set; the current branch hardcodes the model (src/index.ts:567) and src/utils/model-failover.ts does not exist. Never assume src/ == production.

How it works — failure-mode levers

LLM failover (LLM_FAILOVER_MODEL)

In the deployed build, the agent model is () => resolveModel(env('LLM_FAILOVER_MODEL')). PRIMARY = anthropic/claude-sonnet-4-6, default failover google/gemini-2.5-pro. Toggle: unset/off/false/0 → primary; on/1/true/gemini → Gemini; an explicit provider/model string used verbatim. One-command flip, reversible in seconds, no redeploy — the lever to pull during an Anthropic 529 outage. Lua has no in-request retry/fallback on 529, which is why this env lever exists.

⚠️ This lever is in the deployed build, absent from the current branch source. See drift note above.

"Technical difficulties" platform fallback

The Lua platform substitutes a generic "...technical difficulties... try again later..." message on any channel when a turn crashes (oversized attachment, Anthropic 529). It is server-side and not SDK-configurable.

  • src/utils/error-reply-copy.ts provides warm replacement copy (getAwayReplyMessage()), chosen by channel.
  • Live use: the email webhook's AI.generate catch (inbound-email.webhook.ts:651-673) is a status-based trigger — it emails the away copy and returns 200; if even that fails, silent 200.
  • isPlatformErrorFallback(text) detects the platform string precisely (full contiguous phrases; never the bare words "technical difficulties"; any MT-… reference is a hard negative guard). ⚠️ It is NOT wired as a postprocessor — a platform fallback on the native channel (not the webhook) would not be rewritten.

Retry / timeout / backoff

  • No in-request LLM retry (use the failover lever instead).
  • BC custom-extension writes: withParentRetry retries on FK-not-yet-visible with backoff (1s/2s/4s); syncMessage retries on entryNo collision. HTTP timeouts 30s/60s.
  • Ticket-ID allocation: cross-process duplicate backstopped by BC's "already exists" retry.
  • Slot no-pick timeout: SLOT_PICK_TIMEOUT_HOURS (default 12h) auto-cancels.

Rate-limit backstop

Per-sender cap, default 40 replies / 10 min (AUTO_REPLY_RATE_LIMIT / AUTO_REPLY_RATE_WINDOW_MIN); fail-open; email-only; enforced at inbound-email.webhook.ts:113-119.

Message batching / serialization

src/index.ts:597-602: firstMessageDelayMs:0, debounceWindowMs:8000, maxBatchMessages:8, serializeProcessing:true — coalesces rapid follow-ups into one turn; a message arriving mid-turn queues behind it instead of aborting (eliminated duplicate/contradictory/side-effect-then-no-reply failures).

Incident triage quick reference

SymptomFirst check
Tenant gets "technical difficulties" emailsstatus.claude.com + grep logs for Overloaded/529 first (often an Anthropic outage, not our limiter). Then check for expired-S3 image history — lua chat clear the user
Outbound mail "missing"Check SPAM (SPF/DKIM); confirm LUA_EMAIL_CHANNEL_ID set; it's the Lua send-message lane, not AgentMail
Brief sent 2–3×Cron re-fire on a slow run; confirm the claim-guard collection (daily_report_sent/weekly_report_sent)
Burst of reminder emailsBacklog flush — a data-fix unblocked the hourly completion-feedback job
Vendor "ticket not recognised"Duplicate vendor record (active vs inactive sharing an email); resolver prefer-active
Wrong/empty reply on heavy turn~60s platform fallback dropping the real reply; tool-sent receipts mitigate

Gotchas & failure modes

  1. dist-v2/sourcessrc/. Verify live with lua <type> versions; treat dist-v2/sources/<hash>.ts as the deployed mirror.
  2. Push auto-deploys skills only. Webhooks/persona/jobs need an explicit deploy all.
  3. No central state-machine guard on ticket status (see Data model).
  4. Humanize-error postprocessor not wired — only the webhook's status-based catch is live.
  5. BC secret placeholder offline → 401; verify BC-dependent fixes via the read-only seed-data webhook against prod.