AI's Data Problem — and Why It Starts at the Infrastructure Layer

The Shape of the Problem

AI systems are data hungry. The larger and more capable the model, the more data it requires — and the more diverse that data needs to be. This creates a structural tension: the organizations with the most valuable data are often the ones least able to share it.

Healthcare providers hold patient records that would transform medical AI. Financial institutions hold transaction histories that would unlock credit models for the underserved. Manufacturers hold sensor streams from production lines that would make predictive maintenance dramatically more effective. But the compliance exposure, security risk, and operational complexity of sharing that data is prohibitive.

Existing solutions address parts of this problem. Federated learning keeps raw training data local, but it is lossy — gradient aggregation discards information — and application-specific. Secure data cleanrooms govern access, but they typically require raw data to enter a controlled environment managed by a third party. The raw data still moves.

"Datasent eliminates that requirement entirely. Raw data never leaves the Data Holder in any configuration. What AI systems receive is a structured residual representation — compact, governed, and lossless."

Tokens as First-Class ML Features

There is a second advantage beyond privacy: Datasent tokens are natively useful for machine learning, at zero additional computational cost.

Each token contains a coefficient matrix that captures local trend, curvature, and spectral components — precisely the features that preprocessing pipelines normally derive from raw data before feeding a model. The residual captures local deviations from the fitted structure. The model family identifier provides an unsupervised structural label for each segment. Together, these constitute a rich, structured feature representation that is available immediately from the encoding step.

For time-series data, this has dramatic implications for transformer architectures. A dataset with one million time steps, encoded with an average segment length of one thousand steps, produces a token sequence of one thousand elements rather than one million. Attention complexity falls from quadratic in one million to quadratic in one thousand — a reduction of six orders of magnitude, with no loss of information.

Federated Training on Residuals

The trusted setup architecture makes federated machine learning across organisational boundaries structurally simpler than any existing approach. Organisations exchange tokenised residuals rather than raw datasets or gradient updates. No raw training data leaves any organisation's environment. The custodian model provides explicit authorisation and auditlogging for every reconstruction event.

This is not federated learning in the conventional sense —it is something more powerful. Federated learning aggregates gradient updates from distributed training runs, accepting information loss in exchange for privacy. Datasent allows the actual data to be shared, loss lessly, without the raw data ever moving. The receiving party can train on the full reconstructed dataset under authorised access, with a complete audit trail.

The Path to Token-Native Compute

The longer-term direction is more radical: running MLinference directly on Datasent-encoded data, without a decode step. The token algebra established in Datasent's full mathematical framework provides a formal foundation for defining operations that compose naturally over the structured token representation — meaning that model weights trained on decoded data can, in principle, operate directly on tokens.

This direction is early. But it points toward a future in which the encoding layer is not a preprocessing step that happens before computation — it is the substrate on which computation runs.

Conclusion

AI is often framed as a problem of models, compute, or algorithms. But in practice, most limitations emerge much earlier — at the data layer.

Modern AI systems depend on continuous, large-scale access to data, yet the infrastructure supporting that data was never designed for this reality. As a result, organizations face bottlenecks in performance, rising costs, and increasing exposure risk — not because their models are insufficient, but because their data systems cannot keep up.

Feb 2026

John Rhye

Why Raw Data Shouldn't Travel

The assumption that raw data must move between systems to be useful is baked into almost every layer of modern infrastructure. Datasent was built to break it.

Feb 2026

Ben Wilson

Lossless by Design: How Datasent Guarantees Exact Reconstruction

Most compression systems trade fidelity for size. Datasent refuses that bargain. Here is how exact reconstruction is guaranteed, mathematically, for every data type.

May 2026

John Rhye

The Trusted Setup: How Datasent Keeps Raw Data Local

A trusted setup is often associated with cryptographic ceremonies and secret key generation. Datasent's version is simpler — and more powerful. No secrets required.

Apr 2026

John Rhye