How to Run an AI Pilot to Production: A 90-Day Framework for European SMEs
Most AI pilots stall before production. This 90-day framework covers scoping, KPIs, kill criteria, and the handoff checklist for European SME teams.
TL;DR: Most AI pilots stall before production. This 90-day framework covers scoping, KPIs, kill criteria, and the handoff checklist for European SME teams.
Consider a 35-person logistics software firm in Rotterdam. Six months ago, the operations lead ran an AI pilot to summarise incoming client emails and draft initial routing responses. The pilot ran for three weeks. Everyone agreed it was "promising." Then the team lead who championed it moved on to a quarterly planning cycle, no one had a clear owner for production infrastructure, and the tool quietly stopped being used.
That story is not unusual. Across founder-led firms and mid-sized operations teams in Europe, AI pilots finish their trial period without a single production deployment. The technology worked well enough. The process did not.
This article gives you a structured AI pilot to production framework that runs in 90 days. It covers the three phases, how to write kill criteria before you start, what GDPR and the EU AI Act require you to log in production, and the four-item handoff checklist that determines whether your pilot becomes a real system or a completed experiment.
Why Most AI Pilots Stall
Three structural problems explain the majority of failed AI pilot to production transitions in European SMEs.
The wrong success metric. Teams often measure whether the tool produced output that looked reasonable. That is a subjective impression, not a metric. Without a baseline measurement taken before the pilot, there is nothing to compare the output against at the end.
No kill criteria defined before day one. If you never wrote down the condition under which you stop, you will extend indefinitely. "Let us give it another few weeks" replaces the decision. Kill criteria force a binary outcome: either the threshold was met, or it was not.
No production owner named from the start. The person who runs the pilot is often not the person whose job it is to maintain a production system. When those roles are separated and the handoff is not planned, the pilot finishes and the production question defaults to "someone else's problem."
The 90-day framework below is designed to close all three gaps before they open.
Phase 1 (Days 1 to 30): Scope, Baseline, and Use-Case Selection
The first 30 days are entirely about definition. No AI model should be running in a live context yet.
Pick one workflow, not the whole company. This is the most common scoping error in a professional services company or 30-person software house: the pilot tries to touch four different processes at once and produces diffuse, inconclusive results. Select one workflow where the input is structured, the output is reviewable by a human, and the time cost of the current process is measurable.
Take a baseline measurement. Before any AI tooling is introduced, measure the current state of that workflow. Typical baselines for SME pilots: average time per task, error rate per 100 outputs, or staff hours per week consumed. This number is your comparison point at day 60.
Write the kill criteria on day one. The kill criteria should be a single sentence: "If we do not see [X improvement or threshold] by day 60, we stop the pilot." Examples: "If processing time has not dropped by at least 25 percent" or "If error rate has not fallen below 5 percent on the reviewed sample." This sentence should be signed off by whoever holds budget authority: the CEO, the technical lead, or both.
Name a production owner before the pilot starts. This person is not the champion or the vendor contact. This is the internal operator who will own the system if it passes the day-60 decision. They should be involved from day one so they understand the architecture, the data flows, and the compliance obligations before they inherit the system.
For a broader view of how this scoping work fits into a multi-system transformation programme, see the 90-Day AI Platform Transformation Framework for Fractional CTOs.
Phase 2 (Days 31 to 60): Controlled Rollout and Measurement
In the second phase, you introduce the AI system to a group of three to five users in a controlled setting. This is not a full team rollout. The purpose is to generate clean, comparable data.
Format for the weekly check-in. Hold a 15-minute session each week with the pilot users and the production owner. The agenda has three fixed points: (1) what the metric looks like this week versus baseline, (2) any errors or anomalies that need to be logged, (3) any data-handling concerns to surface before the day-60 decision. Keep notes. These notes form part of your compliance documentation under GDPR accountability requirements.
Measure against the baseline every week. Do not wait until day 60 to look at the numbers. Weekly measurement surfaces problems early and gives you time to adjust the prompt, the input format, or the user workflow before the go/kill decision.
Log what the system is doing. For any AI system that will move into production in a European context, you need a logging structure in place during the pilot, not after. The EU AI Act (Regulation 2024/1689), particularly Article 13 on transparency obligations for high-risk AI systems, requires that production AI systems maintain records of decisions made, the model version in use, and the nature of the data inputs. Even if your use case falls below the high-risk threshold, building logging habits during the pilot dramatically reduces the compliance work required if the system scope expands later.
What to log during the pilot: the model version or API version used, the category of data input (not the personal data itself), the output decision or classification, and whether a human reviewed or overrode the output. This logging approach also satisfies the GDPR Article 5 accountability principle, which requires that your organisation can demonstrate how personal data is processed in automated systems.
Phase 3 (Days 61 to 90): Go/Kill Decision and Production Handoff
At day 60, you have six weeks of controlled measurement data. The go/kill decision is made against the kill criteria you wrote on day one.
If the threshold was not met, stop. Document the result, share it with the team, and treat it as useful data. A clean kill is not a failure; it is the outcome the framework was designed to produce when the evidence does not support continuing.
If the threshold was met, execute the production handoff checklist.
The checklist has four mandatory items:
Incident response owner confirmed. Who receives the alert if the system produces a critical error or behaves unexpectedly? This must be a named individual, not a team or a role.
Rollback procedure documented and tested. Before go-live, the production owner must have executed a rollback at least once in the test environment. This means reverting to the pre-AI workflow without data loss.
Usage monitoring in place. Volume of requests, latency, error rate, and human override rate should all be observable from day one of production. This is both an operational requirement and the foundation for the post-production review at day 90.
Model version pinned. The specific model or API version used in the pilot must be locked at go-live. Automatic model upgrades from a vendor can change output behaviour in ways that invalidate your baseline comparison. Pin the version and schedule a deliberate review before any upgrade.
The full production operations runbook (covering incident response, monitoring architecture, and version governance) is documented in the AI Production Operations Runbook for European SMEs.
Days 61 to 90 in production serve as an extended observation window. The production owner reviews usage data weekly. At day 90, the team holds a structured review: did the system maintain the improvement seen in the pilot? Have any new compliance questions surfaced? Is the use case ready to be treated as standard infrastructure, or does it need a further review cycle?
GDPR and EU AI Act: What to Log in Production
For any AI system in production at a European company, the minimum logging standard covers four categories.
Decisions made by the system. What did the AI output, and what action did the business take as a result? This record supports both internal audit and any GDPR data subject access request.
Model version. The exact model identifier or API version active at the time of each decision. This matters when outputs change after a vendor update.
Data input category. Not the raw personal data, but the category: "email text from a customer," "invoice line items," "HR leave request." This is sufficient for Article 30 GDPR records of processing activities and does not require storing the underlying data in the log.
Human review or override. Whether a human reviewed the AI output before it affected a decision. Systems where humans routinely override AI outputs tell you something important: either the model is underperforming, or the workflow design needs adjustment.
For teams exploring more autonomous AI systems where the human review step is reduced or removed, the governance and accountability requirements increase substantially. The Agentic AI Adoption Framework for European SMEs covers the additional compliance architecture required in those contexts.
Summary
The 90-day AI pilot to production framework reduces the three structural failure modes to process decisions made before the pilot begins. Define a single workflow. Measure the baseline. Write the kill criteria. Name the production owner. Run a controlled 30-day rollout with weekly check-ins. Make the go/kill call against the data at day 60. If you proceed, execute the four-item handoff checklist and operate under structured logging from day one.
For a technical lead or operations director at a 30-person software house or mid-sized operations team, this framework turns a pilot from an open-ended experiment into a time-bounded, evidence-driven decision. The discipline required is organisational, not technical.
If you want to assess whether your organisation has the governance structure to support a production AI system, start with our AI Readiness Assessment.
FAQ
How long should an AI pilot last before we decide to go to production?
The 90-day framework places the go/kill decision at day 60, after 30 days of controlled measurement with three to five users. Days 61 to 90 serve as the initial production observation period. Pilots that run beyond 90 days without a clear decision point tend to drift: the team loses focus, the baseline comparison becomes stale, and the vendor contract starts to feel like a sunk cost. Sixty days of measurement data is sufficient for most SME use cases to make a defensible decision.
What should our kill criteria look like in practice?
Kill criteria should be a single, measurable sentence agreed before day one: "If the average processing time for invoice classification has not fallen by at least 30 percent by day 60, we stop." The metric must be something you measured before the pilot started, and the threshold must be agreed by whoever holds budget authority. Avoid vague criteria like "significant improvement" or "positive feedback from users." Those are impressions, not measurements.
Does the EU AI Act apply to our AI pilot, or only to production systems?
The EU AI Act's obligations (including logging requirements under Article 13 for high-risk systems) apply to systems in operation, not to time-limited pilots running in a controlled setting. However, building your logging architecture during the pilot means you have no compliance retrofit work if the system moves to production. Most European SMEs benefit from treating pilot logging as production-standard from day 30 onward, even if the strict regulatory obligation does not attach until go-live.

