Skills, Memory, Agent Harnesses: The Next AI Platform Layer

TL;DR: European scale-ups can turn reusable skills, governed memory, and audited harnesses into a compliance layer for EU AI Act and DORA.

The AI agent hype cycle has crested. Chat interfaces alone do not make agents useful. The next platform layer is reusable skills, governed memory, and audited execution harnesses. Why this matters: European scale-ups face the EU AI Act regulatory sandbox milestone by 2 August 2026 (S1) and DORA Article 28 third-party risk reporting (S6). For CTOs, platform engineering leads, AI transformation leads, security leads, privacy leads, product/operations leaders, and procurement-aware engineering managers, this layer is not a convenience; it is a compliance artefact.

The short version

Skills are versioned, reusable capabilities any agent can call. Memory is the governed subset of context an agent can read and write across sessions, with a documented retention and scope policy (S4, S9). The harness is the execution and audit layer: it governs who calls which skill, with what arguments, against which tools, and produces a structured log for compliance (S5, S7). Together they form a platform layer that turns AI agents from experimental toys into operational tools. For European scale-ups, this layer is the natural home for EU AI Act conformity assessments and DORA third-party risk reports.

Why this matters for European scale-ups

Founder-led companies and growing software teams are under pressure to deploy AI without introducing uncontrolled risk. The EU AI Act (S1) requires high-impact AI systems to be auditable, explainable, and reversible by 2026. DORA (S6) demands that financial sector third-party risk from AI tools be tracked and reported. Meanwhile, security leads worry about prompt injection (S8) and memory poisoning, while privacy leads must ensure that agent memory does not become an ungoverned data lake (S9). The skills-memory-harness layer addresses all these concerns structurally, not through policy workarounds.

The audit-facing version of this argument is sharper. EU AI Act Article 16 obligations are not satisfied by a vendor questionnaire response that says "our agent calls an LLM and uses good prompts". They are satisfied by repeatable evidence: a versioned skill registry with an owner per skill, a memory policy document with retention windows per data class, and a harness log that captures every tool call with prompt context, arguments, result, and latency. The same three artefacts feed DORA Article 28 (S6) third-party risk reports without re-derivation. For a 30-person engineering scale-up, the difference between a half-day audit response and a two-week audit response is whether the skills-memory-harness layer was structured up front. Finance teams and operations leaders who sign off the compliance budget should treat the layer as the single most reusable compliance investment available; the same artefacts cover EU AI Act, DORA, and most internal incident response runbooks at once. ENISA (S14) explicitly frames data-flow and tool-execution risks as priority threats for European deployments, which makes the harness log a regulator-readable artefact, not just an internal nicety.

What "skills, memory, and harnesses" actually means

These three components form a new platform layer that sits between the language model and the application. Skills replace ad-hoc prompts with named, versioned capabilities. Memory replaces raw context with a controlled, policy-governed store. The harness replaces the chat interface as the primary execution surface, logging every call and enforcing boundaries.

The shift is similar in shape to what happened with web applications around 2008 to 2012. Teams stopped writing PHP files per page and started using frameworks that separated routing, business logic, persistence, and audit logging into distinct, testable layers. The framework imposed structure, and structure produced reliability. Agents in 2026 are at the equivalent inflection point: prompts have done their job, the same way raw PHP did its job, and the next reliability gain comes from a structured platform layer. A skill catalog plays the role of route handlers. Governed memory plays the role of persistence with controlled access. The harness plays the role of the request-response middleware: authentication, authorization, logging, rate limiting, and traceability. Once a team has all three in place, the question shifts from "did the model produce a good response" to "can we explain why the agent took this action, and can we reverse it cleanly". That is the question regulators ask, and it is the question that distinguishes a controlled rollout from a risk event.

The three operating components

Skills: Reusable, named, versioned capabilities that a team can call from any agent. Skills encapsulate tool calls, data lookups, and business logic, and are stored in a versioned registry (S2, S3). They enable teams to reuse proven components across multiple agents without duplicating prompts.

Memory: The controlled subset of context an agent is allowed to read and write across sessions. Memory has a documented retention policy, a scope boundary (e.g., user-level vs. session-level), and a privacy review process that aligns with GDPR Article 30 (S4) and EDPB guidance (S9).

Harness: The execution and audit layer that governs how skills are called, which tools are permitted, and what is logged. The harness is the enforcement point for access control, rate limiting, and audit-trail generation. It also exposes a surface for MCP servers (S5) to plug in, and it is the primary defence against prompt-injection attacks (S8) because it controls tool invocation independently of the prompt.

Concrete shape of each component for a 30-person scale-up. The skill catalog is typically a git-tracked directory of YAML or TypeScript files, one per skill, each with a name, a version, a description, an owner, the tools it may call, the input and output schema, and a list of test fixtures. The memory store is typically a row-per-record table in PostgreSQL or a vector store with a per-record retention timestamp and a scope key; the policy that governs reads and writes lives in the same git repository. The harness is a small service (usually under 1,000 lines) that wraps the model call, the skill registry, and the memory store, and that emits a structured audit log per invocation. None of these pieces require a new vendor; all three can be built from open-source components combined with the team's existing observability stack. The investment is in the discipline of treating skills, memory, and the harness as first-class platform concerns rather than scattered prompt files.

The maturity model: from prompts to platform layer

Level	Description	What lives where	Failure mode
L0 prompt chaos	Ad-hoc prompts scattered across teams	No central repository	Teams rebuild same agent logic repeatedly
L1 prompt library	Shared folder of prompts	File system or wiki	Prompts become stale; no versioning
L2 reusable skills	Named, versioned skills in a catalog	Dedicated skill registry	Skills lack memory governance
L3 governed memory	Memory with retention and scope policies	Policy engine + context store	No execution audit trail
L4 audited harness	Full execution logs, access control, SBOM	Harness platform	Overhead if not automated

A 30-day adoption path

Days 1-7: Audit + skill catalog

Owner: CTO, platform engineering lead
Artifact: Skill inventory (list of all existing prompt-based agent capabilities)
Success criterion: 20 skills cataloged in a versioned registry (e.g., a Git-based store with semantic versioning)

Days 8-21: Controlled-memory pilot

Owner: AI transformation lead, security lead, privacy lead
Artifact: Memory policy document (retention, scope, access rules)
Success criterion: Pilot agent passes privacy review; memory operations are logged and reversible

Days 22-30: Harness + audit-trail rollout

Owner: Operations leader, procurement-aware engineering manager
Artifact: Harness configuration (skill-to-tool mappings, audit-log schema)
Success criterion: First compliance report generated from harness logs, covering skill calls, memory accesses, and tool invocations

What you can govern safely today

Using existing open-source and commercial tools, you can implement a basic harness that logs all agent actions, enforces a whitelist of MCP servers (S5), and rejects calls that exceed defined argument bounds. You can generate SBOMs (S7) for the harness binary itself and apply SLSA build provenance (S11) to your skill registry. OpenSSF Scorecard (S10) can rate the health of your skill repositories. NIST AI RMF (S13) provides a structured framework to communicate trade-offs to non-technical stakeholders. ENISA AI threat landscape (S14) gives a European-specific attack taxonomy to validate your harness defences.

Concrete first-week actions a platform engineering lead can take without budget approval. (a) Inventory every prompt currently in production by grepping the codebase for known prompt markers; classify each as a candidate skill, a candidate memory entry, or a one-off. (b) Pick the top three highest-traffic prompts and rewrite them as named YAML skills in a new git directory; this is the seed of the skill catalog. (c) Add a per-invocation structured log to the existing model call site, capturing prompt, arguments, result, latency, and a stable trace identifier. (d) Run the OpenSSF Scorecard (S10) against the new skill repository and record the score as the baseline. (e) Sketch the memory policy on one page: what data classes the agent may read, what it may write, where it lives, how long it is retained, and which role purges it. None of these five actions require a vendor decision; all five produce evidence the security lead and the procurement-aware engineering manager can review on day 7.

What must stay human-reviewed

Even with a mature harness, certain decisions must remain under human control to prevent catastrophic failures.

Do not let a skill catalog become shadow automation; every skill must have a documented owner, a tested rollback, and a published audit signature.
Do not allow memory to accumulate indefinitely without a retention review; memory that stores PII must be purged according to GDPR Article 30 (S4).
Do not grant an agent write access to production databases without a human-in-the-loop approval gate.
Do not deploy a harness that can call arbitrary MCP servers; maintain an allowlist reviewed by operations and security leads.
Do not use agent-generated logs as the sole source of truth for compliance; cross-reference with harness logs that cannot be modified by the agent.
Do not give the same agent access to both internal finance data and customer support tools without separate context boundaries.

How the platform layer maps to EU AI Act and DORA

(a) Skill metadata (version, author, test results), memory policy (retention, scope, purges), and harness logs (every call, argument, result, duration) together form the evidence required for EU AI Act (S1) conformity assessments. A scale-up can produce a compliance artifact that shows exactly which skills were used, in what context, and how memory was handled.

(b) DORA Article 28 (S6) requires financial firms to monitor third-party ICT providers and report incidents. A harness that logs every skill invocation and tool call can be consumed directly by DORA reporting pipelines, providing an unbroken chain of evidence for any incident involving an AI agent.

A concrete walk-through for a 30-person engineering scale-up. The platform engineering lead exports the skill catalog as JSON; each row carries skill name, version, owner, last test pass date, list of permitted tools. The privacy lead exports the memory policy as a one-page register: per data class, the retention window, the scope key, the purge cadence, and the legal basis. The AI transformation lead exports the harness audit log for the inventory window: per invocation, the timestamp, the agent identity, the skill name, the tool calls made, the result code, the latency, and a stable trace id. Combine the three exports and you have the EU AI Act technical-documentation kernel for any high-risk AI system using this agent platform; the same three exports become the DORA third-party-risk register entry and the EDPB-aligned (S9) Article 30 entry for processing activities. The procurement-aware engineering manager attaches these artefacts to the third-party register; the security lead cross-references them against incident response runbooks; the operations leader uses the same audit log to drive the quarterly skill-deprecation review. One investment, three regulators, four internal processes.

To assess your organization's readiness for this platform layer, start with our AI Readiness Assessment: https://radar.firstaimovers.com/page/ai-readiness-assessment. For hands-on implementation support, visit our AI Consulting page: https://radar.firstaimovers.com/page/ai-consulting.

Limits and failure modes

No platform layer is a silver bullet. Skills can become too granular, leading to a proliferation of micro-skills that are hard to manage. Memory policies can be too restrictive, crippling agent usefulness. The harness itself can become a bottleneck if not designed for low latency and high throughput. Prompt-injection attacks (S8) can still bypass some harness controls if the harness does not validate tool arguments server-side. Finally, the layer does not solve the fundamental problem of model hallucination or bias; it only governs how the model interacts with the world.

A second class of failure is operational. A skill registry that grows past 200 entries without a deprecation policy becomes a parking lot for half-tested capabilities. Mitigations: schedule a quarterly review where the platform engineering lead and the AI transformation lead jointly retire any skill with zero invocations in the prior 90 days; require a written deprecation notice for any skill an active workflow depends on; track skill-count growth as a process metric in the CTO's monthly review. The same applies to memory: a memory store without a documented purge cadence is a discovery liability waiting for a regulator request.

A third class is human. Engineers under shipping pressure will route around the harness for a quick fix by calling the model directly. This is the easiest failure to miss because the system still works. Mitigations: gate model-API credentials behind the harness service so direct calls are observable; surface direct-call counts as a process metric; require a written exception with a documented rollback for any temporary bypass. The security lead owns the metric; the operations leader owns the exception process. Track the bypass rate per quarter; if it exceeds 10 percent of agent invocations, the harness itself needs a redesign, not the bypass policy.

A fourth class deserves naming because European scale-ups hit it harder than US peers. Memory poisoning through indirect injection. An agent retrieves a document into its context, and that document contains instructions that look benign to a human but tell the model to write a record into memory with elevated trust. The next agent run reads that record and treats it as authoritative. OWASP LLM Top 10 (S12) frames the broader risk class; the harness-specific mitigation is to namespace memory writes by the writing agent's identity, require a signed assertion for any cross-namespace read, and never let the model itself decide whether a memory record is authoritative.

Frequently Asked Questions

Q: Are skills, memory, and harnesses just another vendor wrapper? A: No. This is an architectural pattern, not a product. Several vendors offer components (e.g., Anthropic's tools (S2), OpenAI's Assistants API (S3), MCP specification (S5)), but the pattern is implementable with open-source building blocks like LangChain, Haystack, or custom code.

Q: How do we decide what belongs in memory versus a skill? A: Memory stores data that the agent reads or writes across sessions (e.g., user preferences, conversation history). Skills encapsulate actions and computations that do not persist beyond the current use. A rule of thumb: if you need to keep a value for a future session, it belongs in memory; if you need to perform a deterministic operation, it belongs in a skill.

Q: Does this layer replace prompt engineering? A: It does not replace prompt engineering; it constrains it. Prompts still matter for instructing the model, but the skills-memory-harness layer ensures that even a poorly written prompt cannot break out of the governed boundaries. The harness is the structural defence against prompt injection (S8).

Q: How does an agent harness interact with MCP servers? A: MCP servers (S5) expose tools that the harness can call, but the harness sits between the agent and the MCP server. The harness decides which MCP server calls are allowed, validates arguments, and logs the interaction. This decouples the model from direct tool access.

Q: How long does the first useful maturity jump take? A: A team that already has a basic agent in production can reach Level 2 (reusable skills) in approximately one sprint and Level 3 (governed memory) in two to three sprints, depending on privacy review requirements. The full audited harness (Level 4) typically takes a quarter if the organisation already has compliance processes in place. The five first-week actions in the "What you can govern safely today" section are deliberately scoped so a platform engineering lead can complete them inside a single sprint without a budget request, which is usually the gating factor for these projects.

Skills, Memory, and Agent Harnesses Are the Next AI Platform Layer

The short version

Why this matters for European scale-ups

What "skills, memory, and harnesses" actually means

The three operating components

The maturity model: from prompts to platform layer

A 30-day adoption path

Days 1-7: Audit + skill catalog

Days 8-21: Controlled-memory pilot

Days 22-30: Harness + audit-trail rollout

What you can govern safely today

What must stay human-reviewed

How the platform layer maps to EU AI Act and DORA

Limits and failure modes

Frequently Asked Questions

Further Reading

Comments

More from this blog

How to Map Data Flows in a Local-First AI Assistant

The Local-First AI Stack: Privacy Trade-Offs European Teams Need to Understand

How to Evaluate MCP Servers Before You Connect Them to Enterprise Workflows

Should Your Maintainer Health Rubric Change by Dependency Tier?

Command Palette

The short version

Why this matters for European scale-ups

What "skills, memory, and harnesses" actually means

The three operating components

The maturity model: from prompts to platform layer

A 30-day adoption path

Days 1-7: Audit + skill catalog

Days 8-21: Controlled-memory pilot

Days 22-30: Harness + audit-trail rollout

What you can govern safely today

What must stay human-reviewed

How the platform layer maps to EU AI Act and DORA

Limits and failure modes

Frequently Asked Questions

Further Reading

Comments

More from this blog