The Value of Data

Abstract Every discussion of enterprise AI eventually reaches for the model — which one, how large, how fine-tuned. This is the wrong altitude. A model is a lens; the picture is the data beneath it. The value an AI system can return is bounded, above everything else, by the quality and connectedness of the data it reasons over. Two facts hold most enterprise data below the threshold of usefulness: most of it is never collected or connected (the dark-data problem), and most of what is collected cannot be trusted (the data-quality problem). Both carry a large, mostly unbudgeted cost. But the deeper cost is strategic: dirty, disconnected data forecloses the cross-silo joins where the non-obvious value lives — the insight that only appears when a machine reading, a complaint email and a supplier's news are the same entity in one picture. We argue that collection quality, entity resolution and provenance — performed at ingest — are the real, compounding, un-copyable moat.

1. The data is the product

The interface is a commodity. A chat box, a dashboard, a report — these are the depreciating surface of an AI system, and they are increasingly interchangeable. What is not interchangeable is the substrate: the corpus of a specific company's commitments, invoices, contracts, machine readings, emails and hard-won operational judgment, collected accurately, connected correctly, and kept over time. The data is the product; the interface is just an instrument for reading it.

This inverts the usual buying conversation. "Garbage in, garbage out" is not a slogan; it is an architectural constraint. The most capable model in the world, pointed at a firm's duplicated supplier records and un-timestamped spreadsheets, will produce confident, fluent, wrong answers — and do so faster than a human could. The binding constraint on enterprise-AI value has moved: it is no longer the intelligence of the model but the fidelity of the foundation. Most AI initiatives that stall do not stall at the model. They stall because the data underneath was never collected, never connected, or never trustworthy.

2. The dark-data problem — what you never collected

Gartner coined "dark data" for the information assets an organisation collects, processes and stores in the course of business but generally fails to use for any other purpose.[1] The category is enormous. IDC's Global DataSphere work has long held that the large majority of enterprise data — on the order of 80% or more — is unstructured, and the overwhelming share of it is never analysed after capture.[4] The machine historian that logs a vibration trace nobody reads; the inbox where a customer's third complaint about the same defect sits unlinked to the first two; the drive folder of countersigned PDFs no system can query — each is a pool of value that exists but does not count.

Unstructured~80%+

Structured~20%

Analyseda fraction

FIG 1 — The dark-data iceberg: most enterprise data is unstructured and never analysed after capture. Indicative shares. Source: IDC.[4]

The defining property of dark data is that its cost is invisible by construction. You cannot miss what you never measured. A firm that never collected the join between its machine readings and its warranty claims does not see a line item labelled "insight foregone"; it simply never has the conversation. This is the silent cost of this paper's title — silent not because it is small, but because nothing in the firm's accounts or dashboards ever names it. The cost of not collecting is the one cost that never shows up on the invoice.

3. The data-quality problem — what you can't trust

If dark data is the cost of not collecting, poor quality is the compounding cost of collecting badly. The evidence here is unusually blunt. In a study that scored real data against basic quality rules, Nagle, Redman and Sammon found that only 3% of companies' data met basic quality standards, and that 47% of newly-created data records carried at least one critical, work-impacting error.[2]

$12.9M

the average annual cost of poor data quality to an organisation, per Gartner — while only 3% of companies' data meets basic quality standards, and 47% of new records carry a critical error.

Sources: Gartner; Nagle, Redman & Sammon, Harvard Business Review (2017).

Gartner estimates that poor data quality costs organisations an average of $12.9 million per year.[1] Redman, summarising an IBM estimate, put the cost of bad data to the US economy at roughly $3.1 trillion per year; separately he has argued the cost to an individual firm runs to 15–25% of revenue.[3][5]

The reason bad data is worse than no data is that errors compound downstream. The well-known 1-10-100 heuristic of data quality captures the shape: an error costs roughly a unit to prevent at entry, ten units to correct later, and a hundred units in downstream failure if it is never caught. A duplicated supplier — "ACME S.r.l." and "Acme Srl" as two entities — silently splits that supplier's spend, its risk history and its contract terms across two records, so every report, forecast and decision built on top inherits the split. Collecting badly does not merely waste effort; it manufactures confident wrong answers, and an AI layer laid over it industrialises them.

4. Where the value actually hides — the foreclosed joins

The costs in §2 and §3 — foregone insight and downstream rework — are real, but they understate the true loss, because the largest cost is the one that is structurally hardest to see: the insight that is never surfaced because two datasets were never connected.

Consider the substrate of a manufacturer as four pillars — its digital communications, its people's declared knowledge, its machines' readings, and the world around it. A single-domain tool sees one pillar. To a maintenance system, a vibration reading is noise until it crosses a threshold. To a helpdesk, a complaint email is one ticket among many. To no one at all, a trade-press item about a supplier's reformulated resin is trivia. Each is meaningless alone. Overlaid on one connected graph, they can become a single sentence: one root cause behind three symptoms. That sentence is invisible to every tool that holds only one pillar — and it is exactly the sentence a mid-market firm most needs.

Machine · vibration Inbox · complaint World · supplier news

→

One graphProvenance-stamped join

→

One insightRoot cause, three symptoms

FIG 2 — Each signal is noise alone; joined on one connected graph they become a single root cause. This join is impossible for any single-domain tool. [Dimbo analysis].

This is why connected data does not add value linearly. The number of potential joins across domains grows combinatorially with the number of connected domains, and value tracks the joins, not the domains. It is the reason a second data domain, populated into a shared graph, roughly triples rather than doubles the value of the first — the join across domains is where the non-obvious value lives. [Dimbo analysis] The corollary is severe: a firm whose data sits in disconnected silos is not missing a fraction of its potential insight; it is missing the super-linear majority of it, and no amount of model quality recovers a join the data architecture never made possible.

1 domain×1

2 domains≈×3

3 domains≈×6

FIG 3 — Value tracks the joins, not the domains — so connected data compounds super-linearly. Illustrative. [Dimbo analysis].

5. Data as a governed, valued asset — the EU policy tailwind

The regulatory direction of travel reinforces the same thesis from an unexpected quarter. The EU Data Governance Act (Reg. (EU) 2022/868) and the Data Act (Reg. (EU) 2023/2854) treat data — and specifically industrial and machine-generated data — as an asset with rights of access, portability and sharing attached.[6][7] The practical effect is that a firm's ability to locate, govern, and — on its own terms — share its data with clear provenance is shifting from a nice-to-have to a regulated capability. The OECD made the economic case a decade ago in Data-Driven Innovation, framing data as a genuine capital asset whose value is realised only through reuse and recombination.[8]

Two conclusions follow for the mid-market. First, the data a firm can neither find nor trust is not merely unused — it is increasingly a compliance and commercial liability, because the law now assumes the firm can account for it. Second, the firms positioned to benefit from the machine-data provisions are precisely those that have already done the unglamorous work of collecting that data well, with provenance intact. The policy tailwind rewards the same discipline the economics does.

6. Architecting for compounding value

If the problem is that data is never collected, never connected and never trusted, the fix is not a better report at the end of the pipeline but a better discipline at the start of it: collect and structure automatically, with provenance, at ingest. Dimbo is built to that discipline, and its mechanisms are concrete rather than aspirational.

Entity resolution at the door. Incoming records are resolved to a single canonical entity — with Italian legal-name normalisation folding "ACME S.r.l.", "Acme Srl" and "ACME SRL" into one supplier before the split of §3 can occur (normalize_org_name, the entity_resolver, a configurable fuzzy-match threshold). One supplier becomes one node carrying all its invoices, shipments, contracts, news sentiment and machine how-tos at once.
Provenance and typing stamped on every edge. Every relationship written into the graph carries where it came from (a ProvenanceScope) and a canonical schema.org type, so the firm can always answer how do we know this? — the exact accountability the Data Act now assumes.
One shared graph, one shared knowledge store. Because every module writes into the same substrate, the cross-domain join that dark data forecloses becomes native, not a bespoke integration project. This is the mechanism behind the compounding of §4.
A system that actively fights dark data. The knowledge_hunter runs deterministic gap heuristics — a client with traffic but no notes, an open commitment with no owner, a device with incidents but no how-to — and turns each gap into a specific question for the person who can close it. Dark data is attacked, not merely tolerated.
Engines that sharpen as the graph fills. The forecast and what-if engines are deterministic and explainable, and they become materially more accurate as more modules populate the graph with clean, connected, provenanced records — value that accrues to the data, not the model.
The un-copyable local moat. Software is copyable; the years of a specific plant's approvals, edits and rejections folded back into every future proposal are not. A horizontal vendor starts from zero on every customer. A firm that has collected its own data well starts from zero once — and the gap only widens.

Data is the appreciating asset in an AI system; the interface is the depreciating one. Collect well once, and every future capability inherits it. — The compounding thesis

7. Conclusion — collect well, once

The three costs stack into a single argument. The cost of not collecting is invisible, so it is never budgeted. The cost of collecting badly compounds downstream, so it is paid many times over. And the value of collecting well compounds faster than either — because connected, trustworthy, provenanced data unlocks the super-linear majority of insight that lives in the joins.

A company that builds its collection layer right does not buy intelligence once; it compounds it. That is the difference between a feature bolted onto software built for something else, and a system born to collect data well and reason over it — the difference between renting a lens and owning the picture.

A representative scenario. Rivertex Compositi, a fictional €90M composites manufacturer, holds three dark-data pools that never touch: complaint emails in an inbox, a machine historian logging cure-oven temperatures, and countersigned supply contracts in a drive. Resolved into one provenance-stamped graph, a single correlation traces a recurring delamination complaint — previously three unrelated tickets — to a resin supplier's formulation change flagged in trade news six weeks earlier, correlated with a drift in cure-oven temperature the historian had recorded but no one had read. No new data was created; the value was always present in data the firm already owned. It had simply never been collected into a shape where the join could be made.

Honesty note

The €90M Rivertex scenario and the "second module triples value" and "super-linear joins" claims are transparent Dimbo analysis and reasoning, not cited statistics — flagged as such throughout, and the value figures are illustrative. The cited figures (Gartner $12.9M; HBR 3% / 47%; Redman & IBM $3.1T; MIT SMR 15–25%; IDC unstructured share) should be verified against their latest releases before publication. Dimbo's stated properties — entity resolution, provenance stamping, on-prem sovereignty, GDPR-by-design, PII anonymisation, full audit trail — are real; no unheld certification is claimed anywhere in this paper.

References

Gartner — Data Quality (poor data quality costs organisations an average of $12.9M/yr; defines "dark data"). gartner.com
Nagle, Redman & Sammon — "Only 3% of Companies' Data Meets Basic Quality Standards," Harvard Business Review (September 2017) (3% meet basic standards; 47% of newly-created records carry at least one critical error). hbr.org
Thomas C. Redman — "Bad Data Costs the U.S. $3 Trillion Per Year," Harvard Business Review (September 2016), citing an IBM estimate. hbr.org
IDC — Global DataSphere / Data Age research (the large majority of enterprise data is unstructured and un-analysed). idc.com
T. Redman / MIT Sloan Management Review — the cost of bad data to a firm ≈ 15–25% of revenue (secondary anchor). sloanreview.mit.edu
European Union — Data Act, Regulation (EU) 2023/2854 (access, portability and sharing rights for industrial / machine-generated data). eur-lex.europa.eu
European Union — Data Governance Act, Regulation (EU) 2022/868 (framework for trusted data sharing and reuse). eur-lex.europa.eu
OECD — Data-Driven Innovation: Big Data for Growth and Well-Being (2015) (data as a capital asset realised through reuse and recombination). oecd.org

Figures flagged should be confirmed against the latest release before publication. The Rivertex Compositi scenario is a representative fictional illustration flagged as Dimbo analysis; see the companion Value Model for the full transparent loss-and-capture working. No unheld certifications are claimed anywhere in this paper.