Our log / Article

Open Table Formats and the Lakehouse Paradigm: Reclaiming Ownership and Control over your Data

For years, data platforms defined how our data could be stored, accessed, and moved. Open table formats and Lakehouse architectures now separate data from the tools that process it, restoring true ownership.

image of Biel Llobera
Biel Llobera
December 19th, 2025

    The Architectural Break: Decoupling Storage and Compute

    The success of modern cloud data warehouses was not driven primarily by better SQL engines, but by a fundamental design choice: storing data in cheap, durable object storage while allocating compute independently and elastically.

    This model introduced several properties that quickly became non-negotiable for analytical workloads:

    • Pay only for the compute actually used, rather than provisioning for peak demand.
    • Scale storage and compute independently, without downtime.
    • Allow multiple, heterogeneous compute engines to query the same data concurrently.
    • Treat data files as immutable, shifting complexity from in-place mutation to metadata management.

    Immutability, in particular, proved to be a powerful constraint. By never modifying files in place and instead writing new versions with updated metadata, systems can support time travel, table snapshots, and logical clones almost as a side effect. These were not bolt-on features, but natural consequences of the design.

    Once this pattern proved viable, it became clear that the traditional coupling of databases, storage engines, and execution engines was no longer necessary.

    Full view

    Why Traditional Databases Struggled with Analytics

    Relational databases were never designed to serve as large-scale analytical engines, even though they were often pressed into that role. Their limitations in this context were architectural rather than incidental.

    Common friction points included:

    • Fixed storage and compute, forcing teams to size systems for worst-case analytical queries.
    • Row-oriented storage optimized for transactions rather than aggregations.
    • Operational fragility under analytical loads, such as disk exhaustion from full reloads.
    • Manual backups, limited schema evolution, and poor handling of semi-structured data.

    None of this diminishes the value of traditional databases in transactional systems. It highlights that analytical workloads require different trade-offs. As data volumes grew and analytical use cases expanded, it became clear that incremental optimizations were no longer sufficient. A different abstraction was needed.

    Full view

    Open Table Formats: Rebuilding the Warehouse on Object Storage

    Open table formats emerged to replicate the essential guarantees of modern data warehouses—while remaining fully compatible with object storage and open standards.

    At a high level, they combine:

    • Columnar file formats, such as Parquet or OR,C for efficient storage and compression.
    • A metadata layer that defines what constitutes a table at any point in time.

    That metadata layer is the critical component. It enables:

    • ACID transactions on immutable files.
    • Table snapshots and time travel.
    • Query pruning through file- and column-level statistics.
    • Controlled schema evolution over time.

    Formats such as Iceberg and Delta Lake differ in ecosystem maturity and tooling, but converge on the same core idea: a table is no longer a physical structure managed by a single engine, but a logical construct defined by metadata over files in object storage.

    This abstraction also benefits data science and machine learning workflows. By keeping data in open formats, models can access the same underlying files used for analytics, without costly exports or format conversions.

    Full view

    Metadata as the Control Plane: Catalogs and Governance

    As storage becomes open and shared, metadata becomes the primary mechanism for control. This shift explains the growing importance of data catalogs in lakehouse architectures.

    Modern catalogs provide:

    • Centralized access control and auditing.
    • Lineage and impact analysis.
    • Discovery and semantic organization of datasets.
    • A logical database-and-schema view over raw object storage paths.

    In effect, catalogs restore many of the usability guarantees of traditional databases—while operating over open storage and open table formats. Unsurprisingly, they have also become a focal point of vendor competition. Control over metadata increasingly determines who defines the user experience, governance model, and integration surface of the analytics stack.

    From a system design perspective, this reinforces a broader pattern: once execution engines become interchangeable, the durable sources of power are metadata, governance, and developer workflows.

    Full view

    What This Unlocks Next: Simpler Systems, Selective Scale

    One of the more subtle consequences of open table formats is that they weaken the assumption that all analytics must be distributed by default.

    In practice:

    • Many analytical queries touch only a small fraction of large datasets.
    • Advances in vertical scaling make large single-node machines viable again.
    • Embedded and single-node engines can execute complex queries efficiently when paired with columnar data and pruning-aware metadata.

    This does not eliminate the need for distributed systems, but it reframes them as a tool for specific scaling problems rather than a universal requirement. The emerging pattern is composable:

    • One shared storage layer.
    • One or more open table formats.
    • A common catalog.
    • Multiple execution engines, chosen per workload.

    Small queries can run locally, while larger jobs can scale out as needed. Complexity is introduced deliberately, not by default.

    Full view

    Data Ownership as a First-Class Concern

    One of the most underappreciated effects of open table formats is how they clarify data ownership.

    In traditional analytics stacks, ownership is often implicit. Data may be “yours,” but its usability is tightly tied to a specific platform, storage layout, or execution engine. Over time, that coupling turns tooling choices into long-term commitments: switching systems can require migrating data, rewriting pipelines, or sacrificing historical guarantees.

    Open table formats invert this relationship. By storing data in open, well-defined formats on object storage, organizations keep direct control over the physical data and its evolution. Tables become shared assets defined by metadata that multiple engines can interpret.

    Practical implications include:

    • Data can outlive any single engine or vendor.
    • Analytics, data science, and ML can share one source of truth without duplication.
    • Governance and access policies can be enforced at the metadata layer, rather than relying on proprietary systems.
    • Architectural decisions become more reversible, reducing lock-in risk.

    In this model, ownership is less about who runs the queries and more about who controls the data’s format, location, and access rules—an increasingly important distinction as stacks become more heterogeneous and long-lived.

    Full view

    Closing Thoughts

    Open table formats are often discussed in terms of performance, cost, or ecosystem momentum. While those factors matter, their deeper value lies in architectural flexibility.

    By decoupling storage, metadata, and execution, they enable teams to select tools based on their current needs without compromising control over their data. Complexity can be introduced deliberately, scale can be applied selectively, and systems can evolve without wholesale rewrites.

    In that sense, open table formats are less about chasing the next analytics trend and more about restoring a durable foundation—one where data remains stable, owned, and adaptable as everything around it changes.

    Open table formats offer architectural flexibility by decoupling storage, metadata, and execution. This enables evolving systems while maintaining data stability, ownership, and adaptability.

    Full view