Thinkscoop
Strategy

The Data Readiness Checklist Before Your First AI Deployment

Thinkscoop Engineering Dec 5, 2023 13 min read
The Data Readiness Checklist Before Your First AI Deployment

Most AI projects that stalled between 2022–24 didn’t have a model problem. They had an accessibility, governance, or freshness problem in the data.

By late 2023 we had a depressing pattern in our inbound calls: teams that tried an AI pilot that year but never made it to production. When we unpacked what happened, nine times out of ten the blockers were data-related - and almost all of them could have been spotted during a rigorous discovery session before a single line of code was written.

The cruel irony is that data problems are often invisible during early prototyping. You grab a sample dataset, connect it to a demo environment, and the prototype looks great. The problems only surface when you try to connect the real system to the real data - which happens weeks or months later, when expectations are high and timelines are tight.

Five Dimensions of Data Readiness

Across dozens of AI pre-mortems between 2022 and 2024, we identified five dimensions that predict whether a project will hit a data wall. Scoring each data source on these five dimensions early in discovery gives teams a concrete picture of risk.

  • Accessibility: can the AI system actually read this data from its runtime environment, not just from a laptop with VPN?
  • Quality: is the data coherent and consistent enough to serve as ground truth for either retrieval or training?
  • Volume: do we have enough labelled examples, documents, or records for this specific use case and technique?
  • Governance: are we contractually, legally, and ethically permitted to use this data for this AI purpose?
  • Freshness: will the data still be correct and relevant when the model reads it at query time?

How to use this as a scorecard

Rate each data source 1 (blocker), 2 (concern), or 3 (green) on each dimension. A single '1' on governance or accessibility is enough to flag the project as high-risk. Do this before architecture discussions begin - it changes the conversation completely.

Accessibility: The Most Underestimated Blocker

Of all five dimensions, accessibility caused the most surprise. Teams assumed that because they could see the data - in a SharePoint folder, an ERP system, or a legacy database - the AI system could too. Not so. Production AI systems run in specific network environments with specific service accounts, and the path from 'I can see this data' to 'the system can reliably read this data at query time' is often weeks of infrastructure work.

  1. 1What service account will the AI system use, and can it be granted read access to all required sources?
  2. 2Are there network boundaries (firewalls, VPNs, on-premise systems) that block programmatic access?
  3. 3Are there rate limits or API quotas that would be hit by realistic query volumes?
  4. 4Is there a stable, versioned API or connection string, or does it change with system updates?
  5. 5What happens to AI availability if the data source has an outage?

Quality: When Good Enough for Humans Is Not Good Enough for AI

Many enterprise data sources are good enough for human workers who know the context - they can skip over inconsistencies, interpret ambiguous abbreviations, and mentally fill gaps. AI systems generally cannot. A knowledge base where half the articles are outdated or contradictory will generate a RAG system that confidently gives different answers to the same question depending on which documents get retrieved.

  • Duplicate documents with conflicting information - especially policy and compliance docs updated without deleting old versions
  • Scanned PDFs with poor OCR - structurally present but semantically noisy
  • Internal abbreviations and jargon not defined anywhere in the corpus - leaves the model guessing
  • Mixed languages or regional variants not reflected in the embedding model's training
  • Structured data with inconsistent null handling - empty strings, nulls, and zeros all meaning 'unknown'

Volume: Different Techniques Need Very Different Amounts

Volume requirements vary enormously depending on what you are building. A RAG system on top of policy documents can work with a few hundred well-structured documents. A fine-tuned model for a specific task needs thousands of labelled examples. An evaluation set needs at least 100–300 carefully-labelled queries before it gives trustworthy signal.

  • RAG knowledge base: 50+ well-structured, reasonably current documents to start; quality matters more than raw count
  • Evaluation set: 100–300 representative questions with human-labelled expected answers, adversarial cases included
  • Fine-tuning: 500–5,000+ labelled input/output pairs depending on task complexity and base model capability
  • RLHF or preference data: typically 1,000–10,000 pairwise comparisons for meaningful signal

Governance: The Most Expensive Problem to Find Late

Governance discovered in sprint four is far more expensive than governance discovered in week one. We saw several projects either stall or require significant re-architecture because the governance check happened after the system was mostly built. Common governance questions to resolve before starting:

  1. 1Does the data contain personal data covered by GDPR, CCPA, or other privacy regulations - and does our use case satisfy the lawful basis?
  2. 2Are there vendor contracts that restrict how data can be processed or shared with third-party AI APIs?
  3. 3Has the legal or compliance team signed off on sending this data to a cloud AI provider?
  4. 4If we fine-tune a model on this data, who owns the resulting weights?
  5. 5What is the retention and deletion policy for data that passes through the AI system's logs and caches?

Freshness: The Slow Failure

Freshness failures are insidious because they do not break the system - they make it gradually wrong. A RAG system built on documents that are refreshed monthly will give confident but outdated answers on fast-moving topics. A model trained on data from 2022 will not know about policy changes from 2024.

A real pattern from 2023

One client's internal search assistant was built on a document corpus refreshed quarterly. Within two weeks of launch, users discovered the system gave confident answers about pricing that had changed. Trust collapsed. The fix required a near-daily re-index pipeline - which nobody had scoped or budgeted. Freshness requirements caught early would have changed the architecture before a line of code was written.

Running the Scorecard in Practice

The most effective way to run this checklist is as a structured workshop in the first week of discovery. Bring together the AI team, the data owners, and the security or compliance representative. Go through each data source that the proposed system will use, and score each of the five dimensions together.

A project with two or more red scores across different data sources is a project that needs its scope reduced or its timeline extended before any technical work begins. This conversation is uncomfortable, but it is vastly more comfortable than having it three months later when a launch is looming.

Building something in this space?

We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.

Book a 30-minute call

Key takeaways

  • Data accessibility blocks more projects than data quality
  • Volume requirements differ massively for RAG, fine-tuning, and eval sets
  • Governance discovered late is the most expensive kind
  • Freshness requirements should be explicit in every AI design doc
  • A simple scorecard catches most readiness issues early
Back to all articles