Is Your Data Ready for AI? A Practical Checklist

I talk to business owners every week who want to "add AI" to their operations. They have seen the demos, read the case studies, and they are ready to buy a tool. My first question is always the same: "Show me your data."

That is usually where the conversation gets uncomfortable.

Here is the reality that the AI vendor pitches will not tell you: most AI projects fail because of data problems, not algorithm problems. The technology works. It works remarkably well. But it works on clean, accessible, structured data -- and most businesses do not have that.

Before you spend a dollar on an AI tool, SaaS subscription, or consulting engagement, run your data through this checklist. It will save you months of frustration and potentially tens of thousands in wasted investment.

Why Data Quality Kills AI Projects

AI models learn from patterns in your data. If your data is full of gaps, inconsistencies, and duplicates, the patterns the model learns will be garbage. This is not a technology limitation -- it is math. Feed a model bad inputs and you get bad outputs. Every time.

I have seen companies spend six figures on AI-powered forecasting tools only to discover their historical sales data was incomplete, their product categories were inconsistent across systems, and half their customer records were duplicates. The AI did exactly what it was told: it found patterns in the mess and produced confidently wrong predictions.

The fix is not better AI. The fix is better data.

The Checklist

1. Data Completeness: Are the Fields Actually Filled In?

This is the most basic check and the one that catches the most problems. Pull up your key datasets -- customer records, sales transactions, product catalog, whatever the AI tool will consume -- and look at fill rates.

What to check:

For each critical field, what percentage of records have a value?
Are "empty" fields truly empty, or filled with placeholder junk like "N/A", "TBD", "unknown", or "0"?
Are required fields enforced at the point of entry, or optional?
When data is missing, is there a pattern? (e.g., all records before 2023 lack a certain field)

Thresholds that matter:

If more than 20% of a key field is blank, fix that before buying any AI tool. The model will either ignore those records (reducing your effective dataset) or learn the wrong patterns from the gaps.
If more than 50% of a field is blank, that field is unusable for AI purposes. Do not include it.
For classification and prediction tasks, you need the target variable (the thing you are trying to predict) filled in on at least 80% of historical records. Less than that and you are training on too little signal.

Common offenders: email addresses entered as "none@none.com", phone numbers stored as "000-000-0000", dates left as the system default, free-text fields where a dropdown should exist.

2. Data Consistency: Same Thing, Same Way, Every Time

Consistency is where I see the most painful problems. Your data might be complete -- every field filled -- but if the same information is recorded differently across records, the AI sees them as different things entirely.

What to check:

Are names, categories, and labels standardized? ("US" vs "USA" vs "United States" vs "U.S." are four different values to a machine)
Are dates in a single format? (MM/DD/YYYY vs DD/MM/YYYY vs "January 5, 2025" will cause real problems)
Are numerical values in consistent units? (Revenue in dollars vs thousands of dollars; weight in pounds vs kilograms)
Do the same products, customers, or entities have the same ID across all systems?
Are status fields standardized? ("Active", "active", "ACTIVE", "A" are four different values)

Thresholds that matter:

If a categorical field has more than 30% more unique values than it should (e.g., you have 50 product categories but the field contains 200 unique entries), you have a consistency problem.
If you find more than 3 different formats for the same type of data (dates, phone numbers, addresses), standardize before proceeding.

The duplicate record problem: Duplicates are a consistency issue that deserves special attention. If the same customer appears 3 times with slightly different names ("John Smith", "J. Smith", "John A. Smith"), any AI analysis of customer behavior will be wrong. It will see three customers where there is one. Run a deduplication pass before any AI project. Match on email, phone, or address -- not just name.

3. Data Freshness: How Old Is This Stuff?

AI models trained on stale data make predictions about a world that no longer exists. This matters more in some domains than others, but it always matters.

What to check:

When was the dataset last updated?
Is there a regular update cadence, or is it ad hoc?
Are there gaps in the time series? (e.g., no data for Q3 2024)
Has anything fundamental changed since the data was collected? (new product lines, pricing changes, market shifts, a pandemic)

Thresholds that matter:

For customer behavior predictions: data older than 12-18 months loses relevance fast. Use 2+ years for seasonality, but weight recent data more heavily.
For operational forecasting (inventory, staffing): you need at least weekly data points. Monthly is too coarse for most AI use cases.
For any model: if your business went through a major change (acquisition, new product launch, price restructuring), data from before that change may actively hurt the model. Consider using only post-change data.

4. Data Accessibility: Can the Tool Actually Reach It?

This is the one that trips up even technically sophisticated teams. Your data might be clean and complete, but if the AI tool cannot access it programmatically, it is useless.

What to check:

Where does the data actually live? (Database, CRM, spreadsheets, email inboxes, paper files, someone's head)
Is there an API or export mechanism?
Is the data behind authentication or firewall restrictions?
Can you extract it without manual copy-paste?
Are there legal or compliance restrictions on sharing the data with a third-party AI tool?
Is there a single source of truth, or do multiple conflicting versions exist?

Red flags:

Data lives in Excel files on individual employees' laptops. This is more common than anyone admits. Until that data is in a centralized system, AI cannot use it reliably.
"The data is in our CRM, but we would need to export it manually." If you cannot automate the data pipeline, the AI project becomes a manual process with extra steps.
Multiple systems contain overlapping but inconsistent data (CRM says one thing, accounting says another, the spreadsheet says a third). Decide on a source of truth first.

5. Data Volume: Is There Enough to Learn From?

AI needs examples to learn from. The question is always: how many?

What to check:

How many records do you have for the task at hand?
For classification tasks: how many examples of each category?
Is the data balanced, or are some outcomes vastly overrepresented?
Can you supplement with external data if internal data is thin?

Thresholds that matter (rough minimums):

Text classification (spam detection, sentiment, categorization): 100-500 labeled examples per category, minimum. More is better.
Demand/sales forecasting: 2+ years of historical data with at least weekly granularity. Less than that and you are guessing with extra steps.
Customer segmentation: At least 1,000 customer records with behavioral data attached. Below that, traditional analysis works just as well and costs less.
Anomaly detection (fraud, error detection): You need thousands of "normal" examples. The anomalies themselves can be rare, but the baseline must be robust.
Chatbots and Q&A systems: At least 200 real question-answer pairs from your actual customers. Synthetic or made-up training data produces chatbots that sound confident and get things wrong.

The imbalanced data problem: If you are trying to predict something rare (fraud, churn, equipment failure), you might have 10,000 "normal" records and 50 "event" records. Most off-the-shelf AI tools will struggle with this ratio. You will need either more event data, synthetic oversampling techniques, or a model designed for imbalanced classes. This is a conversation to have with your vendor before signing anything.

Before You Buy: The 5-Minute Gut Check

Before any AI investment, answer these five questions honestly:

Can I describe the specific business problem this will solve in one sentence? If not, you are shopping for technology, not solutions.
Can I pull the relevant data into a single spreadsheet right now? If extracting the data is a project in itself, that is your first project -- not the AI tool.
Would I trust a new employee to make decisions based on this data? If the data is too messy for a human to work with, it is too messy for AI.
Do I have at least 6 months of clean historical data for the task? Less than that and most AI approaches will not outperform simple rules or human judgment.
Is there someone on my team who will own data quality ongoing? AI is not a one-time setup. The data pipeline needs maintenance. If nobody owns it, quality will degrade within months.

The Bottom Line

Data quality work is not glamorous. Nobody gets excited about deduplicating customer records or standardizing date formats. But it is the foundation that determines whether your AI investment produces results or produces expensive noise.

The companies I have seen succeed with AI are not the ones with the fanciest tools or the biggest budgets. They are the ones who did the boring work first: cleaned their data, centralized it, established standards, and only then went looking for AI tools to build on that foundation.

If you run through this checklist and find significant problems, that is actually good news. You have just saved yourself from investing in an AI project that would have failed. Fix the data first. The AI tools will still be there when you are ready -- and they will actually work.

Is Your Data Ready for AI? A Practical Checklist

Why Data Quality Kills AI Projects

The Checklist

1. Data Completeness: Are the Fields Actually Filled In?

2. Data Consistency: Same Thing, Same Way, Every Time

3. Data Freshness: How Old Is This Stuff?

4. Data Accessibility: Can the Tool Actually Reach It?

5. Data Volume: Is There Enough to Learn From?

Before You Buy: The 5-Minute Gut Check

The Bottom Line

About the Author

Related Posts

Want this for your own business?