Why ArcellAI Needs to Exist
Most AI Fails in R&D for the Same Reason
Most AI efforts in R&D don’t fail because the models are weak. They fail because the data is fragmented, the workflows are brittle, and the infrastructure was never designed for how scientific discovery actually unfolds. I’ve run into this repeatedly—writing production systems as an engineer, doing research in academic labs, and working at the boundary where AI meets biology—and the response was almost always the same: patch another pipeline, fine-tune another model, add another layer of glue code. Over time, it became hard to ignore the pattern. The real bottleneck wasn’t intelligence, but the scaffolding we rely on to organize, test, and reason over complex scientific data. ArcellAI exists because that bottleneck is structural, not accidental.
I first ran into versions of this problem at Pinterest, where core product analytics and online A/B experimentation sat at the heart of the business. Delays, silent data quality failures, and brittle experimentation pipelines weren’t abstract inconveniences—they created systemic risk and real financial loss, costing the company hundreds of millions of dollars annually in wasted labor and missed monetization opportunities. I spent years building systems at petabyte scale to mitigate these issues: data quality checks, anomaly detection, decision science tooling, and reasoning layers designed to make experimentation trustworthy again. Even in a world-class engineering organization, the gap between data, tooling, and reliable decision-making was persistent.
After leaving Pinterest, I joined a seed-stage startup as a forward-deployed engineer working on one of the first centralized metrics layers. There, I saw the same failures play out across companies and industries: fragmented data sources, bespoke workflows, and expensive, one-off solutions built customer by customer. That company was eventually acquired by a leading analytics platform, but the lesson stuck with me. Even with deep experience building AI-driven data systems, there were no tools that made it possible to construct automated, end-to-end data science and engineering workflows for specialized domains. That constraint has only recently begun to shift. Advances in agentic systems and generative AI now make it possible to combine the hard-won lessons of data engineering with modern AI architectures and domain-specific scientific intelligence. That intersection is what ArcellAI is built around.
From Software to Biology, the Same Constraints
I ran into the same structural failures again after moving into biomedical AI research at Harvard. The domain was different, but the pattern was unmistakable. Modern therapeutics research depends on combining genomics, molecular biology, single-cell measurements, and clinical signals, yet the infrastructure supporting machine learning across these modalities was fragmented, brittle, or missing entirely. Models were trained on narrow slices of reality, evaluated in isolation, and deployed through workflows that broke as soon as assumptions shifted. Progress was constrained less by scientific ideas than by the systems meant to operationalize them.
That gap motivated the development of PyTDC: an open platform designed to unify heterogeneous biomedical data sources and standardize how models are trained, evaluated, and compared in therapeutically relevant settings. PyTDC was not an attempt to build “better models,” but to make visible where existing approaches broke down—particularly in out-of-distribution generalization, multimodal integration, and domain-specific evaluation. What emerged was a familiar lesson: even state-of-the-art methods struggled not because they lacked sophistication, but because the surrounding data abstractions, workflows, and evaluation infrastructure were brittle. The tooling debt scaled with ambition.
This matters because biology is one of the most data-intensive and expensive R&D domains in the world. Biotech and pharmaceutical companies spend tens of billions of dollars annually on data generation, analytics, and infrastructure, yet much of that investment is lost to fragmented pipelines, bespoke tooling, and slow, failure-prone experimentation cycles. As data volume and modality diversity grow, so does infrastructure debt—and with it, the cost of missed discoveries, delayed programs, and irreproducible results. PyTDC made this dynamic concrete. It also made clear that incremental tooling would not be enough.
Why This Constraint Shows Up Everywhere
What PyTDC surfaced in biology is not unique to life sciences. I encountered the same constraint during a brief stint as a senior AI infrastructure engineer at a leading autonomous vehicles company, where the challenge was not perception models or planning algorithms, but the machinery required to continuously ingest, validate, replay, and reason over real-world data at scale. Building safe autonomous systems required an industry-leading data engineering and observability stack—one capable of tracking edge cases, managing massive sensor streams, enforcing strict provenance, and closing the loop between deployment and learning. It worked, but only because it was built by a highly specialized team with enormous capital, deep integration across hardware and software, and years of focused effort. Most organizations simply cannot afford to build this kind of infrastructure from scratch.
The same pattern appears across techbio, deep tech, and physical AI more broadly. These domains generate unprecedented volumes of high-dimensional, multimodal data—wet-lab assays, robotics telemetry, simulations, manufacturing logs—but rely on tooling that was never designed for continuous experimentation or closed-loop learning. As a result, teams fall back on fragmented pipelines, bespoke scripts, and manual coordination at precisely the point where rigor and automation matter most.
The economic cost of this mismatch is substantial. Across biotech, advanced manufacturing, robotics, and energy, organizations invest $100B+ of dollars annually in data generation, analytics, and R&D infrastructure. Yet a significant portion of that spend is absorbed by infrastructure debt: duplicated workflows, fragile ETL, inconsistent metrics, and slow iteration cycles that compound as systems grow more complex. The cost is not just operational inefficiency, but delayed programs, irreproducible results, and missed opportunities to learn from data already collected.
Physical AI makes this especially visible. The emergence of specialized data observability and debugging platforms in robotics and autonomy is a response to the failure of traditional data stacks under real-time, multimodal conditions. But these tools typically address narrow slices of the problem—telemetry, simulation replay, or monitoring—without solving the full lifecycle from data ingestion to hypothesis testing to model evaluation in domain-specific contexts.
Across science and engineering, the conclusion is consistent: the limiting factor is not the sophistication of models, but the absence of integrated systems that can support data-driven reasoning end to end. The constraint transcends domains. It is structural, it is expensive, and it explains why progress repeatedly stalls as ambition increases.
The Heresy: Better Model Architectures Are Insufficient
The default response to these failures is still to reach for better models. Larger architectures. More parameters. More pretraining. In isolation, these advances matter. But across science, engineering, and physical AI, they consistently fail to resolve the bottlenecks that determine whether systems work in practice.
What breaks first is not prediction quality, but everything surrounding it. Data arrives late, partially labeled, and misaligned across modalities. Assumptions embedded in preprocessing pipelines are rarely documented or enforced. Metrics drift across teams and over time. Experiments cannot be replayed reliably, let alone compared across programs. When performance degrades or fails to generalize, it is often impossible to determine whether the issue lies in the data, the workflow, the evaluation protocol, or the model itself.
This is why so many AI systems succeed in controlled settings and stall in production. They are inserted into stacks that were never designed for continuous learning, multimodal reasoning, or closed-loop experimentation. In safety-critical domains like autonomy and high-stakes domains like biotech, this mismatch forces teams to rely on manual oversight, bespoke glue code, and institutional knowledge embedded in scripts and dashboards. Over time, these systems become increasingly brittle precisely when adaptability matters most.
The deeper issue is that modern R&D requires systems that reason, not just models that predict. It requires infrastructure that can track provenance across experiments, adapt workflows as data and objectives change, enforce domain-specific constraints, and make uncertainty explicit. Without that scaffolding, even the most advanced models are forced to operate in isolation—powerful, but fragile.
This is the heresy: progress in AI for science and engineering will not be driven primarily by better models, but by better systems for orchestrating data, experiments, and decisions. Until that layer exists, model improvements will continue to produce diminishing returns in the real world.
Why Now — and Why ArcellAI
If this constraint has existed for decades, the obvious question is why it is tractable now. The answer is not a single breakthrough, but a convergence. Advances in agentic systems, foundation models, and tooling have finally made it possible to encode hard-won data-engineering lessons into software that can reason, adapt, and operate across complex R&D workflows.
Agents change the shape of the problem. For the first time, it is feasible to build systems that don’t just execute predefined pipelines, but plan multi-step workflows, inspect intermediate results, recover from failure, and adapt as data and objectives change. When combined with modern foundation models, these agents can operate across heterogeneous tools and modalities—code, databases, instruments, simulations—without requiring brittle, hand-written orchestration for every new use case. What previously demanded large, specialized teams can now be expressed as reusable, composable systems.
But agents alone are not enough. Without domain intelligence, they simply automate confusion. What makes this moment different is the ability to ground agentic reasoning in domain-specific context: scientific ontologies, experimental constraints, domain-appropriate metrics, and the semantics of real R&D workflows. This is where lessons from data engineering matter most. Decades of experience building reliable analytics systems—tracking provenance, enforcing contracts, validating assumptions, and managing change—can now be embedded directly into the behavior of autonomous systems rather than re-implemented manually, team by team.
ArcellAI is built at this intersection. It is an attempt to merge the rigor of mature data engineering with the flexibility of agentic AI and the specificity required by scientific and engineering domains. Rather than treating data preparation, experimentation, and evaluation as separate concerns, ArcellAI approaches them as parts of a single, continuous system—one designed to support reasoning, reproducibility, and adaptation end to end.
This is not about replacing scientists or engineers, nor about chasing ever larger models. It is about building the missing infrastructure layer that allows AI to function reliably inside real R&D environments: environments defined by messy data, shifting hypotheses, multimodal inputs, and high consequences. The same constraints that slowed progress in software, biology, and physical AI point to the same conclusion. The next gains will come not from intelligence alone, but from systems that make intelligence usable.
That is why ArcellAI exists—and why now is the moment to build it.

