
Most companies investing in AI focus on the model, the algorithm, the tooling. The data underneath gets treated as someone else's problem. But Gartner estimates that poor data quality costs organizations an average of $12.9 million per year, and a McKinsey Global Data Transformation Survey found that employees spend 30 percent of their time on non-value-added tasks because of poor data quality and availability. AI doesn't fix these problems; it scales them. A model trained on messy data produces confident wrong answers, and it does so across every decision it touches. This is a big part of why Gartner projects that through 2026, 60 percent of AI projects will be abandoned because of insufficient data quality.
This post walks through what data preparation actually involves, how to approach it practically, and why it matters even more when you're building with RAG (Retrieval-Augmented Generation), the architecture behind most enterprise AI knowledge tools today.
The most expensive misconception about data preparation is that it belongs entirely to engineering. The consequences of bad data land on revenue, operations, and compliance, and that's where ownership needs to sit too.
A machine learning model trained on flawed data produces a confident wrong answer that someone on the commercial team acts on. A post-merger integration that stalls because nobody reconciled customer databases is an operations cost with a dollar figure attached. An AI support tool referencing outdated documents creates compliance exposure that legal will eventually have to address.
Inside most organizations, these costs are visible if you look for them:
None of it appears on a P&L as "cost of bad data." It shows up as slower decisions, larger teams than necessary, and AI investments that deliver a fraction of what they should.
When people say "clean your data," they almost never define what that means. And the definition changes depending on the use case.
There are six dimensions of data quality that show up across virtually every AI project. They determine where time and money are best spent during preparation.
Most organizations have data spread across disconnected systems with no single team seeing the full picture. This is why inconsistency across sources is the most commonly cited data quality challenge, year after year.
The practical approach is to prioritize by use case: which business functions carry the highest stakes if the data is wrong? Those get addressed first.
Companies typically spend about 80 percent of their data-related time finding, cleaning, and organizing information, with only about 20 percent going to analysis. That ratio reflects the actual scope of preparation work needed to support reliable AI.
Before cleaning anything, the organization needs to know what it has. That means auditing every relevant data source (CRM, marketing tools, web analytics, legacy databases, spreadsheets) and documenting what each contains, how it's structured, and where quality problems are concentrated.
Profiling tools map the statistical distribution of the data and identify where gaps and anomalies sit. When those results are visualized, it becomes much easier to build a case for targeted work rather than committing to an unfocused enterprise-wide cleanup.
Data from different sources gets merged into a single environment. This tends to be more complex than expected: overlapping records, contradictory schema definitions, and different naming conventions across departments are normal. The goal is to connect systems securely without overwriting historical data that may still have value.
This is the most labor-intensive phase, and where the most direct value gets created. Raw business data is full of typos, format inconsistencies, outdated entries, and missing fields. The work involves correcting errors, filling gaps where context exists, and standardizing formats so AI can process the data without misinterpreting anything.
Deduplication is especially important here. Most companies have the same customer or product appearing in multiple systems under slightly different names. Consolidating those into a single authoritative version (a "golden record") is what gives downstream models a reliable base to work from.
Cleaned data gets restructured to match what the target AI model expects. That might mean normalizing number ranges, encoding text categories into machine-readable formats, or aggregating daily transactions into monthly behavioral patterns.
The prepared dataset is checked against predefined business rules to confirm that the cleaning process hasn't accidentally changed the meaning of anything.
The validated data goes into a centralized warehouse, data lake, or feature store where models can access it for training and real-time use.
Whether this effort produces lasting value or turns into a one-time project often comes down to automation. Manual cleaning doesn't scale. Organizations seeing sustained AI performance have built automated pipelines that continuously profile, clean, and structure incoming data, so quality standards hold up as new information flows in.
Even thoroughly cleaned internal data only captures what happens inside the organization's own systems: transactions, support interactions, operational logs. AI needs external context to make useful predictions, and data enrichment fills this gap.
Cloud marketplaces on Snowflake and AWS have made this much more accessible. Organizations can subscribe to third-party datasets and run enrichment directly inside their data warehouse, without building separate transfer infrastructure.
Clean first, enrich second. Layering third-party data onto duplicated or misspelled internal records amplifies existing problems and wastes the enrichment budget.
Check compliance before appending. Consumer attributes that violate GDPR or CCPA can turn a data improvement initiative into a legal issue.
Everything described above, the cleaning, the structuring, the enrichment, leads to one question: what do you actually build on top of it?
For most organizations deploying AI to work with internal knowledge, the answer is Retrieval-Augmented Generation (RAG). RAG connects an existing large language model to the organization's own document library. When someone asks a question, the system retrieves relevant content and feeds it to the model in real time.
This approach avoids the cost and complexity of fine-tuning a custom model, reduces hallucinations by grounding answers in actual source material, and keeps information current because the organization controls the retrieval database. If you're weighing different approaches, we wrote a detailed comparison of ChatGPT Enterprise vs. custom RAG knowledge bases that covers the trade-offs.
The place where RAG most often fails is document quality. When an organization loads millions of raw, uncurated files into the system (internal wikis with outdated policies, PDFs with formatting artifacts, Word files with broken layouts), the retrieval system pulls from all of it indiscriminately. Users get unclear or contradictory answers, they lose confidence in the tool, and the investment goes unused.
Teams getting real performance from RAG spend most of their effort on the documents themselves. Practitioners often describe the split as 90 percent document preparation and 10 percent AI implementation.
Extraction and parsing removes noise from source files (headers, footers, raw HTML tags, page numbers, boilerplate). Leaving this in the system raises token-based processing costs and makes it harder for the AI to find relevant content.
Semantic chunking breaks long documents into smaller pieces the AI can process within its context window. The pieces need to be split at logical boundaries like paragraph breaks or section headings. Splitting at a fixed character count cuts sentences in half and destroys context.
Metadata tagging labels each piece with a document ID, creation date, version number, and source pointer. This ensures the system retrieves the current version of a policy rather than an outdated one, and preserves document-level access permissions.
For a deeper look at how RAG specifically addresses hallucination risk, see our resource on tackling hallucinations in LLMs with RAG.

A RAG deployment doesn't maintain itself. Three things need to be in place to keep it reliable over time.
Governance is the framework of policies, access controls, and ownership that keeps data accurate, secure, and compliant as time passes. A 2025 Dataversity survey found that only 4 percent of organizations report high maturity in both data governance and AI governance, which is a significant gap given how quickly document libraries change. At minimum, a governance program needs defined ownership across functions (not just IT), a data classification system, granular access controls for AI agents, and real-time monitoring that blocks unauthorized actions before they execute.
Bias auditing matters whenever AI outputs affect people. Amazon's recruiting AI, trained on predominantly male resumes, learned to downgrade applications containing the word "women's" and was scrapped entirely. HireVue's interview AI triggered an FTC investigation over demographic bias. iTutorGroup paid a $365,000 EEOC settlement after its algorithm rejected applicants solely for being over 55. The fix starts in data preparation: audit datasets for demographic representation, define fairness metrics, and run continuous monitoring in production. With the EU AI Act and evolving U.S. guidelines, this is becoming a compliance requirement, not an optional safeguard.
Human-in-the-Loop (HITL) provides the ongoing human oversight that AI systems need. Data labeling and annotation take up to 80 percent of an AI project's timeline, and model quality is bounded by label quality. Organizations seeing the best results use subject matter experts (physicians, lawyers, financial analysts) rather than outsourced generalist labor. HITL also serves as a risk layer: "red teamers" deliberately try to provoke the AI into harmful outputs before it reaches real users, catching problems while they can still be fixed.
The difference between AI projects that deliver and those that stall consistently traces back to how the organization treated its data before building anything.
DHL feeds carefully prepared historical routing and inventory data into its AI systems. Automated sorting handles over 1,000 parcels per hour at 99 percent accuracy, and warehouse picking productivity has increased by up to 180 percent.
Britannia used rigorously standardized employee competency data to restructure their assessment process. Clean, validated metrics let AI cut evaluation time by 75 percent, compressing what used to take 10 weeks and freeing up over 280 hours of productivity in the first phase.
Mercari, Japan's largest online marketplace, connected generative AI to deeply profiled customer transaction data across support operations. The initiative is expected to return 500 percent ROI while cutting the support team's manual workload by 20 percent.
In each case, the competitive advantage came from the data work, not the model selection.
Not sure where your organization stands? Our AI Readiness Score tool can help you assess your current position and identify where to focus first.
Most companies investing in AI focus on the model, the algorithm, the tooling. The data underneath gets treated as someone else's problem. But Gartner estimates that poor data quality costs organizations an average of $12.9 million per year, and a McKinsey Global Data Transformation Survey found that employees spend 30 percent of their time on non-value-added tasks because of poor data quality and availability. AI doesn't fix these problems; it scales them. A model trained on messy data produces confident wrong answers, and it does so across every decision it touches. This is a big part of why Gartner projects that through 2026, 60 percent of AI projects will be abandoned because of insufficient data quality.
This post walks through what data preparation actually involves, how to approach it practically, and why it matters even more when you're building with RAG (Retrieval-Augmented Generation), the architecture behind most enterprise AI knowledge tools today.
The most expensive misconception about data preparation is that it belongs entirely to engineering. The consequences of bad data land on revenue, operations, and compliance, and that's where ownership needs to sit too.
A machine learning model trained on flawed data produces a confident wrong answer that someone on the commercial team acts on. A post-merger integration that stalls because nobody reconciled customer databases is an operations cost with a dollar figure attached. An AI support tool referencing outdated documents creates compliance exposure that legal will eventually have to address.
Inside most organizations, these costs are visible if you look for them:
None of it appears on a P&L as "cost of bad data." It shows up as slower decisions, larger teams than necessary, and AI investments that deliver a fraction of what they should.
When people say "clean your data," they almost never define what that means. And the definition changes depending on the use case.
There are six dimensions of data quality that show up across virtually every AI project. They determine where time and money are best spent during preparation.
Most organizations have data spread across disconnected systems with no single team seeing the full picture. This is why inconsistency across sources is the most commonly cited data quality challenge, year after year.
The practical approach is to prioritize by use case: which business functions carry the highest stakes if the data is wrong? Those get addressed first.
Companies typically spend about 80 percent of their data-related time finding, cleaning, and organizing information, with only about 20 percent going to analysis. That ratio reflects the actual scope of preparation work needed to support reliable AI.
Before cleaning anything, the organization needs to know what it has. That means auditing every relevant data source (CRM, marketing tools, web analytics, legacy databases, spreadsheets) and documenting what each contains, how it's structured, and where quality problems are concentrated.
Profiling tools map the statistical distribution of the data and identify where gaps and anomalies sit. When those results are visualized, it becomes much easier to build a case for targeted work rather than committing to an unfocused enterprise-wide cleanup.
Data from different sources gets merged into a single environment. This tends to be more complex than expected: overlapping records, contradictory schema definitions, and different naming conventions across departments are normal. The goal is to connect systems securely without overwriting historical data that may still have value.
This is the most labor-intensive phase, and where the most direct value gets created. Raw business data is full of typos, format inconsistencies, outdated entries, and missing fields. The work involves correcting errors, filling gaps where context exists, and standardizing formats so AI can process the data without misinterpreting anything.
Deduplication is especially important here. Most companies have the same customer or product appearing in multiple systems under slightly different names. Consolidating those into a single authoritative version (a "golden record") is what gives downstream models a reliable base to work from.
Cleaned data gets restructured to match what the target AI model expects. That might mean normalizing number ranges, encoding text categories into machine-readable formats, or aggregating daily transactions into monthly behavioral patterns.
The prepared dataset is checked against predefined business rules to confirm that the cleaning process hasn't accidentally changed the meaning of anything.
The validated data goes into a centralized warehouse, data lake, or feature store where models can access it for training and real-time use.
Whether this effort produces lasting value or turns into a one-time project often comes down to automation. Manual cleaning doesn't scale. Organizations seeing sustained AI performance have built automated pipelines that continuously profile, clean, and structure incoming data, so quality standards hold up as new information flows in.
Even thoroughly cleaned internal data only captures what happens inside the organization's own systems: transactions, support interactions, operational logs. AI needs external context to make useful predictions, and data enrichment fills this gap.
Cloud marketplaces on Snowflake and AWS have made this much more accessible. Organizations can subscribe to third-party datasets and run enrichment directly inside their data warehouse, without building separate transfer infrastructure.
Clean first, enrich second. Layering third-party data onto duplicated or misspelled internal records amplifies existing problems and wastes the enrichment budget.
Check compliance before appending. Consumer attributes that violate GDPR or CCPA can turn a data improvement initiative into a legal issue.
Everything described above, the cleaning, the structuring, the enrichment, leads to one question: what do you actually build on top of it?
For most organizations deploying AI to work with internal knowledge, the answer is Retrieval-Augmented Generation (RAG). RAG connects an existing large language model to the organization's own document library. When someone asks a question, the system retrieves relevant content and feeds it to the model in real time.
This approach avoids the cost and complexity of fine-tuning a custom model, reduces hallucinations by grounding answers in actual source material, and keeps information current because the organization controls the retrieval database. If you're weighing different approaches, we wrote a detailed comparison of ChatGPT Enterprise vs. custom RAG knowledge bases that covers the trade-offs.
The place where RAG most often fails is document quality. When an organization loads millions of raw, uncurated files into the system (internal wikis with outdated policies, PDFs with formatting artifacts, Word files with broken layouts), the retrieval system pulls from all of it indiscriminately. Users get unclear or contradictory answers, they lose confidence in the tool, and the investment goes unused.
Teams getting real performance from RAG spend most of their effort on the documents themselves. Practitioners often describe the split as 90 percent document preparation and 10 percent AI implementation.
Extraction and parsing removes noise from source files (headers, footers, raw HTML tags, page numbers, boilerplate). Leaving this in the system raises token-based processing costs and makes it harder for the AI to find relevant content.
Semantic chunking breaks long documents into smaller pieces the AI can process within its context window. The pieces need to be split at logical boundaries like paragraph breaks or section headings. Splitting at a fixed character count cuts sentences in half and destroys context.
Metadata tagging labels each piece with a document ID, creation date, version number, and source pointer. This ensures the system retrieves the current version of a policy rather than an outdated one, and preserves document-level access permissions.
For a deeper look at how RAG specifically addresses hallucination risk, see our resource on tackling hallucinations in LLMs with RAG.

A RAG deployment doesn't maintain itself. Three things need to be in place to keep it reliable over time.
Governance is the framework of policies, access controls, and ownership that keeps data accurate, secure, and compliant as time passes. A 2025 Dataversity survey found that only 4 percent of organizations report high maturity in both data governance and AI governance, which is a significant gap given how quickly document libraries change. At minimum, a governance program needs defined ownership across functions (not just IT), a data classification system, granular access controls for AI agents, and real-time monitoring that blocks unauthorized actions before they execute.
Bias auditing matters whenever AI outputs affect people. Amazon's recruiting AI, trained on predominantly male resumes, learned to downgrade applications containing the word "women's" and was scrapped entirely. HireVue's interview AI triggered an FTC investigation over demographic bias. iTutorGroup paid a $365,000 EEOC settlement after its algorithm rejected applicants solely for being over 55. The fix starts in data preparation: audit datasets for demographic representation, define fairness metrics, and run continuous monitoring in production. With the EU AI Act and evolving U.S. guidelines, this is becoming a compliance requirement, not an optional safeguard.
Human-in-the-Loop (HITL) provides the ongoing human oversight that AI systems need. Data labeling and annotation take up to 80 percent of an AI project's timeline, and model quality is bounded by label quality. Organizations seeing the best results use subject matter experts (physicians, lawyers, financial analysts) rather than outsourced generalist labor. HITL also serves as a risk layer: "red teamers" deliberately try to provoke the AI into harmful outputs before it reaches real users, catching problems while they can still be fixed.
The difference between AI projects that deliver and those that stall consistently traces back to how the organization treated its data before building anything.
DHL feeds carefully prepared historical routing and inventory data into its AI systems. Automated sorting handles over 1,000 parcels per hour at 99 percent accuracy, and warehouse picking productivity has increased by up to 180 percent.
Britannia used rigorously standardized employee competency data to restructure their assessment process. Clean, validated metrics let AI cut evaluation time by 75 percent, compressing what used to take 10 weeks and freeing up over 280 hours of productivity in the first phase.
Mercari, Japan's largest online marketplace, connected generative AI to deeply profiled customer transaction data across support operations. The initiative is expected to return 500 percent ROI while cutting the support team's manual workload by 20 percent.
In each case, the competitive advantage came from the data work, not the model selection.
Not sure where your organization stands? Our AI Readiness Score tool can help you assess your current position and identify where to focus first.
