Tag

dataEngineering

Browsing

As organisations scale their modern data platforms, the debate around open table formats increasingly centres on Databricks DeltaLake Apache Iceberg. These two leading technologies both aim to deliver reliability, performance, and strong governance to data lakes—but they take distinct approaches, offer different strengths, and align with different use cases.

Whether you’re building a Lakehouse from scratch or modernising an existing data lake, understanding these differences is essential.

What Are Open Table Formats?

Open table formats enable data lakes to behave like databases—supporting ACID transactions, schema evolution, versioning, and efficient queries over massive datasets—while still using open storage (usually cloud object stores).

The three major table formats today are:

  • Delta Lake (originated by Databricks)
  • Apache Iceberg (originally from Netflix)
  • Apache Hudi

This blog focuses on Delta Lake vs. Iceberg, the two most commonly compared options.

1. Architecture and Design Philosophy

Delta Lake (Databricks)

Delta Lake was built for high-performance analytics inside the Databricks Lakehouse Platform. It features:

  • Transaction logs stored in JSON
  • A tight integration with Databricks runtimes
  • Excellent performance with Databricks Photon engine

Delta can be used outside Databricks, but the best features (Unity Catalog, Delta Live Tables, optimized writes) are available only on the Databricks platform.

Design philosophy: Performance-first, deeply integrated into the Databricks ecosystem.

Apache Iceberg

Iceberg is a vendor-neutral, open, community-driven project designed for multi-engine interoperability (Spark, Flink, Trino, Presto, Snowflake, Dremio, BigQuery, etc.).

It uses:

  • A highly scalable metadata tree structure (MANIFEST and METADATA files)
  • A table snapshot model designed for massive datasets
  • Hidden partitioning and substantial schema evolution

Design philosophy: Open, flexible, engine-agnostic, built for multi-cloud and multi-engine architectures.

2. Feature Comparison Databricks DeltaLake vs Apache Iceberg

ACID Transactions

Both Delta Lake and Iceberg support ACID transactions.

  • Delta Lake: JSON-based transaction log
  • Iceberg: Metadata & manifest trees

Verdict: Both are reliable, but Iceberg tends to scale better for very large metadata sets.

Schema Evolution

Both support schema evolution, but with some nuance:

  • Delta Lake: Supports add/drop/rename fields, but renames may be less reliable across all engines.
  • Iceberg: Offers the most robust schema evolution in the market, including field ID tracking and hidden partition evolution.

Verdict: Iceberg wins for long-term governance and cross-engine compatibility.

Partitioning

  • Delta Lake: Partition pruning works well but relies on stored partition columns.
  • Iceberg: Introduced hidden partitioning, keeping partition logic internal to metadata.

Verdict: Iceberg is more flexible and easier to operate as data evolves.

Performance

  • Delta Lake: Exceptional performance when paired with Databricks Photon.
  • Iceberg: Performance depends more on the query engine; strong with Trino, Spark, Snowflake, Dremio.

Verdict: If you’re all-in on Databricks, Delta wins. If you’re multi-engine, Iceberg is more flexible.

3. Interoperability

Delta Lake

  • Best performance inside Databricks
  • Limited writable interoperability across other engines
  • Delta Universal Format (UniForm) aims to bridge Delta → Iceberg/Hudi readers, but adoption is still growing.

Apache Iceberg

  • Designed from day one for interoperability
  • Supported by Spark, Flink, Trino, Presto, Snowflake, Athena, Dremio, BigQuery (read support)

Verdict: If you want vendor neutrality and multi-engine support, Iceberg is the clear winner.

4. Governance and Catalog Integration

Delta Lake

  • Unity Catalog provides centralized governance—but only on Databricks.
  • Outside Databricks, Delta has fewer cataloging/governance features.

Iceberg

  • Works with many catalogs:

Verdict: Iceberg offers broader ecosystem support.

5. Use Cases Best Suited for Each

Choose Databricks Delta Lake if:

  • You are heavily invested in Databricks
  • You want the best performance with Photon
  • You prefer a fully managed Lakehouse ecosystem
  • You rely on Databricks features like MLflow, DLT, Unity Catalog

Choose Apache Iceberg if:

  • You need multi-engine interoperability
  • You want the most flexible open table format
  • You want to avoid vendor lock-in
  • You run workloads on multiple clouds or different query engines
  • Governance and schema evolution are priority

Final Thoughts

The choice between Delta Lake and Apache Iceberg ultimately comes down to one key question:

Are you all-in on Databricks, or do you want an open, engine-agnostic data lake architecture?

  • If your data strategy revolves around Databricks, Delta Lake offers unmatched integration and performance.
  • If you’re building a flexible, future-proof data lake with multiple compute engines, Apache Iceberg is the best choice today.

In my next Blog post, I will do a technical deep dive for Delta Lake vs Apache Iceberg v3!

Since 2022, our data engineering team has been running Databricks and dbt Core to power our Data Vault environment. Everything ran smoothly—until we encountered the “remote client cannot create a SparkContext” error. This issue forced us to switch to creating a SparkSession instead and prompted a deep dive into its cause and solution.

That streak of reliability came to an abrupt stop last week when our DBT Python models running on Databricks started failing with the following error message:

[CONTEXT_UNAVAILABLE_FOR_REMOTE_CLIENT] The remote client cannot create a SparkContext. Create SparkSession instead.

This unexpected error disrupted our DBT runs, which had been stable for years. At first, it seemed related to how Spark contexts were being initialized—something that had not changed in our codebase. We then conducted a deep dive into recent Databricks platform updates. These updates affected DBT’s execution model when connecting remotely.

Why the Remote Client Cannot Create a SparkContext and How to Fix It

Initial Debugging Attempts

We spent hours debugging our code, testing different approaches, and combing through Databricks and DBT documentation for clues—but nothing seemed to resolve the issue. The error persisted across multiple models and environments, leaving us puzzled. Eventually, we decided to experiment with our infrastructure itself. By switching the cluster type, we finally managed to get our dbt jobs running again. This confirmed that the problem wasn’t within our code or dbt configuration, but rather linked to the Databricks cluster environment.

Using the dbt_cli Cluster

During our investigation, we discovered Databricks’ dedicated dbt_cli cluster, which runs DBT jobs efficiently. This cluster simplifies integration by providing a pre-configured environment where DBT Core and its dependencies come pre-installed. Setup becomes faster, and the cluster reduces compatibility issues. However, it primarily supports job execution rather than interactive development or broader data processing tasks. While convenient and lightweight, it offers less flexibility and scalability than an all-purpose cluster. For example, it cannot handle mixed workloads or support ad-hoc queries as efficiently. In our case, switching to the dbt_cli cluster resolved the SparkContext problem. We did need to adjust our workflow to match the job-oriented design of this cluster type.

Exploring Serverless Clusters

In addition to the dbt_cli cluster, Databricks also offers serverless clusters, which have recently become a strong option for development and debugging. We found that when the cluster configuration includes”spark.databricks.serverless.environmentVersion”: “3”, it fully supports dbt runs without the SparkContext issue. Serverless clusters start up quickly, scale efficiently, and provide a clean environment that’s ideal for testing and interactive development. However, there’s a trade-off—these clusters have limited direct access to Unity Catalog in notebooks.

Why All-Purpose Clusters Remain the Best Choice

In the end, we found that the all-purpose clusters remain the best and fastest option for running our dbt workloads in Databricks. Their flexibility, performance, and compatibility with our Data Vault framework make them ideal for both development and production. While the recent issue forced us to explore alternatives like the dbt_cli and serverless clusters, these workarounds kept our pipelines running and gave us valuable insights into Databricks’ evolving infrastructure. Hopefully, future updates will restore full support for running dbt Python models directly on all-purpose clusters—bringing back the seamless experience we’ve enjoyed since 2022