Skip to main content

Command Palette

Search for a command to run...

AI Data Quality Framework for European SMEs: What to Fix Before You Deploy

Poor data quality causes AI projects to fail at growing companies. Covers assessment, GDPR mapping, cleaning pipelines, and a go/no-go checklist.

Updated
10 min read
AI Data Quality Framework for European SMEs: What to Fix Before You Deploy
D
PhD in Computational Linguistics. I build the operating systems for responsible AI. Founder of First AI Movers, helping companies move from "experimentation" to "governance and scale." Writing about the intersection of code, policy (EU AI Act), and automation.

TL;DR: Poor data quality causes AI projects to fail at growing companies. Covers assessment, GDPR mapping, cleaning pipelines, and a go/no-go checklist.

Why this matters: the most common reason AI projects fail at mid-sized companies is not the AI model. It is the data. An AI system trained on or processing incomplete, inconsistent, or duplicated business data produces outputs that operators cannot trust. Two months into deployment, the team stops using the tool. The investment is wasted.

This framework covers how to assess data quality before deploying AI, what to fix, and how to map your data processing to GDPR requirements at the same time. The two activities complement each other: a data quality audit is also the foundation for a GDPR Article 30 records-of-processing inventory.

The Four Dimensions of Data Quality

Not all data quality problems are equal. Before investing in cleanup, identify which dimension is the bottleneck for your specific AI use case.

Completeness: the percentage of required fields that contain a value. An AI system for customer churn prediction needs complete customer profile data (contract start date, product tier, contact history). If 30% of customer records are missing the contract start date, the model's churn predictions will be systematically biased.

How to measure: run a completeness check on the fields your AI use case requires. Count the percentage of records with non-null, non-empty values in each required field. Target: 95%+ for the fields the AI model uses as inputs.

Consistency: the same entity is represented the same way across systems. Customer "Müller GmbH" appears as "Mueller GmbH" in your CRM, "Müller" in your invoicing system, and "muller gmbh" in your support tickets. An AI system joining these records produces duplicate entries, incorrect aggregations, and unreliable outputs.

How to measure: pick your top 20 customer or supplier records and compare their representation across the three to five systems your AI project will query. Manual review for 20 records takes under an hour and surfaces the consistency patterns quickly.

Accuracy: the values in your records reflect the current real-world state. A customer address that was correct two years ago may now point to a former employee who left the company. An AI that sends automated communications based on this data creates friction and compliance risk.

How to measure: accuracy is harder to measure systematically than completeness. For customer and prospect data, a periodic validation campaign (confirmation emails, returned mail checks) is the most reliable method. Target: annual refresh for high-volume transactional datasets.

Timeliness: the data is recent enough to be useful for the AI use case. A demand forecasting model that uses sales data from 18 months ago is not useful for a business that shifted its product mix in the last quarter.

How to measure: identify the staleness threshold for your use case (last 6 months, last 12 months, last 3 years) and calculate what percentage of your dataset falls within it.

Assessment Process: Five Steps Before Deployment

Step 1: Define the AI use case precisely. What decision will the AI system help with? What data inputs does it need? What output does it produce? Write this in one paragraph before touching any data. This prevents scope creep in the assessment and focuses the cleanup effort on what actually matters.

Step 2: Inventory the source data. List every data source the AI system will access: CRM, ERP, support tickets, financial records, HR data. For each source, document: the system name, the data type, approximate record count, and the responsible data owner (the person who can authorise access and cleanup).

Step 3: Run the quality checks. For each required field in each source, measure the four dimensions above. Use whatever tool is available: a SQL query, an Excel formula, or a basic Python script. The goal is a scorecard, not a full data science analysis.

A simple scorecard format:

Data sourceFieldCompletenessConsistency issue?Last validated
CRMCompany name97%Yes (3 variants)2025-01
CRMContract start71%No2024-06
ERPInvoice amount100%No2026-03
Support systemCustomer ID84%Yes (CRM mismatch)Unknown

Step 4: Apply the go/no-go threshold. Before deploying an AI system, set minimum quality thresholds. A reasonable starting point for most operational AI use cases:

  • Completeness: at least 90% for all required fields
  • Consistency: entity matching across systems resolved for the top 200 records (by volume)
  • Accuracy: validated within the last 12 months for the primary entity records (customers, suppliers)
  • Timeliness: at least 80% of records within the use case's relevant time window

If your data does not meet these thresholds, do not deploy the AI system. Fix the data first.

Step 5: Build the cleaning pipeline. Data cleanup is not a one-time activity. Once you have resolved the initial quality issues, build a lightweight ongoing pipeline:

  • Duplicate detection run on a weekly or monthly schedule (depending on data volume)
  • Field validation rules applied at data entry (CRM form validation, ERP field constraints)
  • An alerting rule that flags records with critical missing fields for manual review

Most small and mid-sized companies can implement this with existing tools: a scheduled report in their CRM, a Zapier or Make automation, or a simple Python script in their data stack.

GDPR Integration: Run Both Assessments Together

The data quality assessment described above maps almost directly to the GDPR Article 30 records-of-processing inventory that regulated EU businesses are required to maintain. Running both at the same time is more efficient than treating them as separate projects.

What the quality assessment captures that GDPR also needs:

Data source inventory: required for Article 30 (processor names and contact details, description of processing categories).

Data fields and record types: required for Article 30 (categories of data subjects and personal data).

Data owner and responsible person: required for Article 30 (controller identity and representative).

Last validated date: the Article 5(1)(d) accuracy principle requires that personal data be kept "accurate and, where necessary, kept up to date."

What the GDPR assessment adds:

Legal basis for processing: for each data source that contains personal data, document the GDPR lawful basis (Article 6): contract, legitimate interests, consent, or legal obligation. This is a GDPR-only requirement that the quality framework does not cover.

Retention period: the Article 5(1)(e) storage limitation principle requires that personal data not be retained longer than necessary. Set and enforce retention periods as part of the data pipeline cleanup step above.

Sub-processor notification: if the AI tool you are deploying is a data processor (a third-party tool processing personal data on your behalf), verify that the vendor is listed as a sub-processor in your DPA and that they handle data in accordance with GDPR.

Common Failure Patterns at Growing Companies

The legacy CRM problem. A 35-person software company had 8,000 CRM records accumulated over 6 years. 3,000 records were for contacts at companies that had been acquired, dissolved, or were no longer relevant. The AI-powered lead scoring tool they deployed produced nonsensical scores because 38% of its input records were stale. Fix: a one-time archival of records with zero activity in 24 months reduced the dataset to 5,000 live records and the model performance improved immediately.

The system mismatch problem. A 40-person logistics company wanted to deploy an AI forecasting tool that joined their order management system (orders) with their ERP (invoices). The customer IDs did not match between systems: 800 customer records in the order system had no corresponding invoice record. The root cause was that a legacy import two years earlier had used a different ID format. Fix: a mapping table created in a single afternoon resolved the join issue and the forecasting tool deployed successfully.

The "clean enough" trap. A 25-person marketing agency ran a data quality check and found 94% completeness on their email field. They declared the data "clean enough" and deployed an AI email personalisation tool. The 6% missing emails translated to 180 broken records in a 3,000-contact database. The personalisation tool produced errors for those contacts and the team spent two weeks manually investigating. Fix: 94% is not clean enough for a process where missing data produces a visible error. Set the threshold to 99%+ for fields where missing values cause failure rather than just lower accuracy.

FAQ

How long does a data quality assessment take for a 30-person company?

For a company with 3-5 data sources and 1,000-10,000 records per source, the assessment takes 3-5 days of focused effort: 1 day to inventory sources and define required fields, 1-2 days to run the quality checks, 1 day to produce the scorecard and go/no-go recommendation. This assumes the data owners are accessible and the tools are available. It does not require a data scientist, only a person comfortable with spreadsheet formulas or basic SQL.

Do I need to fix all data quality issues before deploying AI?

No. Fix the issues that affect your specific AI use case. If you are deploying an AI tool for invoice processing, the completeness and accuracy of your HR records are irrelevant. Focus the cleanup effort on the data the AI system will actually use. This is both more efficient and more defensible under GDPR's data minimisation principle (Article 5(1)(c)).

Is a one-time cleanup enough?

No. Data quality degrades over time as new records are created, old records become stale, and system migrations introduce inconsistencies. Build the lightweight ongoing pipeline described in Step 5 to maintain the quality levels you establish in the initial cleanup. The pipeline effort is minimal: typically 2-4 hours per month of automated monitoring, with manual intervention when alerts fire.

How does data quality connect to EU AI Act compliance?

EU AI Act Article 10 requires that training, validation, and testing data for AI systems meet quality criteria: appropriateness for the intended purpose, freedom from errors and biases, and completeness. For SMEs using third-party AI tools rather than building their own models, Article 10 applies to the AI provider, not to you directly. However, the data you feed into the AI system as inputs is subject to GDPR quality requirements regardless. The quality framework above satisfies both.

Further Reading

If you want a structured review of your data readiness before committing to an AI deployment, take the AI readiness assessment or speak with an AI consulting specialist.