Skip to main content

Command Palette

Search for a command to run...

AI Production Operations Runbook for European SMEs

Runbook for European SMEs running AI in production: incident classification, cost monitoring, model versioning, and a 30-day rhythm.

Updated
9 min read
AI Production Operations Runbook for European SMEs

TL;DR: Runbook for European SMEs running AI in production: incident classification, cost monitoring, model versioning, and a 30-day rhythm.

Running AI tools in production is a different problem from choosing which tools to buy. Why this matters: a 20-person company that has deployed AI across customer communications, internal workflows, and reporting now has a set of operational dependencies that can fail, drift, or generate costs in ways that are invisible without structured monitoring. This runbook gives operations leaders and technical teams a practical framework for managing those dependencies without enterprise-scale tooling or headcount.

The term "runbook" comes from software operations. It means a documented set of procedures for recurring situations. For a mid-sized company running AI in production, it means having written answers to three questions before the situation arises: what broke, who owns it, and what do we do first.


What "AI in Production" Means for a 20-50 Person Company

Enterprise AI Ops frameworks assume dedicated ML engineering teams, model training infrastructure, and observability tooling that costs more than most SME annual AI budgets. That framing does not apply here.

For a growing software team or professional services firm, "AI in production" falls into four practical categories:

1. Customer-facing automation. AI that touches customers directly: email response drafting, chatbot interactions, proposal generation, or document summarisation sent externally. This category carries the highest reputational and regulatory risk. Any failure here is a business event, not just a technical one.

2. Internal process automation. AI running inside workflows that employees depend on daily: meeting summaries, internal report generation, data extraction from documents, HR query handling. Failures here affect productivity and employee trust.

3. Developer tooling. AI coding assistants, code review tools, or test generation tools used by a technical team. Failures here affect delivery speed. Security considerations around code exposure to external APIs apply specifically to this category.

4. Analytics and reporting. AI summarising data, flagging anomalies, or generating dashboards. Failures here affect decision quality, which may not be immediately visible.

Each category has a different failure mode, a different stakeholder who notices first, and a different acceptable response time. Your runbook should address all four that apply to your organisation.


Incident Classification for AI Systems

Not all AI failures are equal. A classification system keeps responses proportionate and prevents either under-reaction (ignoring a serious problem) or over-reaction (treating a minor model hiccup as a crisis).

P1: Critical. An AI system has produced and delivered an output that causes immediate harm. Examples: incorrect personal data sent to the wrong recipient via an automated email; a customer-facing chatbot providing false safety or legal information; an internal reporting tool generating figures that led to a financial decision before the error was caught.

Trigger: any external delivery of harmful AI output, or any internal AI output acted upon with material consequences before detection.

Who gets notified: AI Lead (or equivalent) within one hour, founder or CEO within two hours if the impact is external.

First response: pause the affected workflow. Do not attempt to correct with a follow-up AI output. Assess scope manually before any further automated action.

P2: Significant. An AI system is producing consistently degraded output, failing silently, or generating costs significantly above baseline. Examples: a model responding to all inputs with generic refusals; token costs three times the daily baseline for 24 hours; an automated workflow completing zero tasks over a business day.

Trigger: any quality or cost metric outside two standard deviations of baseline for more than four hours during business hours.

Who gets notified: AI Lead. Escalate to P1 if the root cause cannot be identified within four hours.

First response: switch affected workflows to manual process. Log the degradation with timestamps.

P3: Minor. Isolated output quality issues, single-instance errors not delivered externally, or expected model variability. No immediate action required beyond logging.

Who gets notified: logged in the weekly review. No individual notification required.


Model Version Management

AI vendors update their underlying models regularly. For a technical team running prompts against an API, a model update can silently change output format, tone, refusal behaviour, or reasoning quality. None of these changes require your consent.

Three practices manage this risk at SME scale:

When to update. Do not update automatically. When a vendor announces a new model version, schedule a planned update during a low-traffic period. Emergency updates (security patches) are an exception.

What to test before updating. Maintain a regression test list for each production workflow. This does not require a test framework. A spreadsheet with ten representative inputs and their expected output formats is sufficient. Before any model update, run those inputs against the new version and confirm outputs fall within acceptable range.

What to preserve. Version-control your prompts. A prompt is a contract between your workflow and the model. When a model changes, the prompt may need to change with it. If you cannot recall what your prompt said six weeks ago when a workflow started producing different results, you cannot diagnose the problem. Store prompts in your version control system alongside the code that calls them.

Output format contracts matter especially for internal process automation and analytics. If downstream systems parse AI outputs by structure (JSON, bullet lists, specific field names), preserve those format specifications explicitly in the prompt and test them with every model update.


Capacity and Cost Monitoring

AI API costs are consumption-based and can spike without warning. For a founder-led company or mid-sized company without a dedicated finance-tech function, three metrics are sufficient for operational control:

Daily token spend. Set a daily cost alert at 150 percent of your rolling 7-day average. Most AI API platforms support this natively. If they do not, a daily export to a spreadsheet with a conditional flag takes thirty minutes to set up and will catch runaway loops or unexpected usage spikes before they become monthly billing surprises.

Per-workflow cost baseline. For each production workflow, calculate the expected cost per run based on typical input/output token counts. When a workflow's per-run cost exceeds 200 percent of baseline, treat it as a P2 incident. Common causes: input data growing unexpectedly (someone feeding a 200-page document to a tool designed for 5-page inputs), prompt changes that increase output length, or a model update that generates more verbose responses.

Error rate. Track the percentage of API calls that return errors (rate limit, timeout, model error) per workflow per day. A baseline error rate of under 2 percent is typical. Above 5 percent sustained for more than two hours is a P2 incident. Above 20 percent is P1.

These three metrics do not require observability infrastructure. They require a daily review habit and a place to log the numbers.


The 30-Day Operational Rhythm

Consistent operational rhythm prevents the accumulation of small problems into large ones. For a 20-50 person company, three review cycles cover the full operational surface:

Weekly review (30 minutes). Owner: AI Lead or technical lead. Agenda: (1) review cost metrics against baseline, (2) review error rates across production workflows, (3) review any P3 incidents logged since last week, (4) confirm all production workflows ran as expected. Document any anomalies. Escalate anything that cannot be explained.

Monthly review (60 minutes). Owner: AI Lead with operations representative. Agenda: (1) performance review across all four production AI categories, (2) vendor notifications received since last month (model updates, terms changes, pricing changes), (3) review and update regression test lists for any workflows that changed, (4) confirm cost baselines are still accurate, (5) review any P1 or P2 incidents and confirm root causes are resolved. Document outcomes.

Quarterly review (90 minutes). Owner: AI Lead with founder or CTO as sponsor. Agenda: (1) model update schedule: review any vendor announcements, plan updates, (2) contract review: pricing, data processing agreements, service level terms, (3) portfolio review: which production workflows are delivering measurable value, which should be retired, (4) EU AI Act compliance check: any new guidance relevant to current deployments.

This rhythm is compatible with the AI governance committee structure described in AI Governance Committee Charter for European SMEs 2026. The weekly and monthly reviews are operational; the quarterly review feeds into the committee's strategic session.

If your organisation is assessing operational maturity against this framework and wants an external evaluation, the AI readiness assessment covers production AI operations as one of its five assessment dimensions.


FAQ

How do we handle an AI vendor changing their model without notice?

Most vendors provide some notice via changelog or release notes, but the timelines are short and the changes are not always flagged as breaking. The protection is maintaining your regression test list and running it as part of your weekly review, not just when you know an update is coming. If outputs start drifting, run your regression tests immediately to determine whether the model changed or your inputs changed. Prompt versioning means you can rule out internal changes first.

What should a 20-person company do if it cannot afford dedicated AI operations headcount?

The practices in this runbook are designed for existing staff with part-time operational responsibility. The AI Lead role (typically two to four hours per week for a company with three to five production AI workflows) covers the weekly review, incident response, and vendor monitoring. The monthly and quarterly reviews are group sessions, not individual work. No additional headcount is required for a portfolio of under ten production workflows. Above that threshold, consider whether AI operations has become a distinct function.

Is this runbook compatible with EU AI Act requirements?

The EU AI Act's deployer obligations include human oversight of high-risk systems, monitoring procedures, and incident logging. The incident classification system (P1/P2/P3), the regression testing practice, and the documented review rhythm together constitute the monitoring and human oversight procedures the Act requires. The quarterly compliance check in the quarterly review provides the scheduled touchpoint to incorporate new regulatory guidance as the Act's implementing acts are published through 2026 and 2027.


Further Reading