dataEngineering Archives - Saman Ahmadi

As organisations scale their modern data platforms, the debate around open table formats increasingly centres on Databricks DeltaLake Apache Iceberg. These two leading technologies both aim to deliver reliability, performance, and strong governance to data lakes—but they take distinct approaches, offer different strengths, and align with different use cases.

Whether you’re building a Lakehouse from scratch or modernising an existing data lake, understanding these differences is essential.

What Are Open Table Formats?

Open table formats enable data lakes to behave like databases—supporting ACID transactions, schema evolution, versioning, and efficient queries over massive datasets—while still using open storage (usually cloud object stores).

The three major table formats today are:

Delta Lake (originated by Databricks)
Apache Iceberg (originally from Netflix)
Apache Hudi

This blog focuses on Delta Lake vs. Iceberg, the two most commonly compared options.

1. Architecture and Design Philosophy

Delta Lake (Databricks)

Delta Lake was built for high-performance analytics inside the Databricks Lakehouse Platform. It features:

Transaction logs stored in JSON
A tight integration with Databricks runtimes
Excellent performance with Databricks Photon engine

Delta can be used outside Databricks, but the best features (Unity Catalog, Delta Live Tables, optimized writes) are available only on the Databricks platform.

Design philosophy: Performance-first, deeply integrated into the Databricks ecosystem.

Apache Iceberg

Iceberg is a vendor-neutral, open, community-driven project designed for multi-engine interoperability (Spark, Flink, Trino, Presto, Snowflake, Dremio, BigQuery, etc.).

It uses:

A highly scalable metadata tree structure (MANIFEST and METADATA files)
A table snapshot model designed for massive datasets
Hidden partitioning and substantial schema evolution

Design philosophy: Open, flexible, engine-agnostic, built for multi-cloud and multi-engine architectures.

2. Feature Comparison Databricks DeltaLake vs Apache Iceberg

ACID Transactions

Both Delta Lake and Iceberg support ACID transactions.

Delta Lake: JSON-based transaction log
Iceberg: Metadata & manifest trees

Verdict: Both are reliable, but Iceberg tends to scale better for very large metadata sets.

Schema Evolution

Both support schema evolution, but with some nuance:

Delta Lake: Supports add/drop/rename fields, but renames may be less reliable across all engines.
Iceberg: Offers the most robust schema evolution in the market, including field ID tracking and hidden partition evolution.

Verdict: Iceberg wins for long-term governance and cross-engine compatibility.

Partitioning

Delta Lake: Partition pruning works well but relies on stored partition columns.
Iceberg: Introduced hidden partitioning, keeping partition logic internal to metadata.

Verdict: Iceberg is more flexible and easier to operate as data evolves.

Performance

Delta Lake: Exceptional performance when paired with Databricks Photon.
Iceberg: Performance depends more on the query engine; strong with Trino, Spark, Snowflake, Dremio.

Verdict: If you’re all-in on Databricks, Delta wins. If you’re multi-engine, Iceberg is more flexible.

3. Interoperability

Delta Lake

Best performance inside Databricks
Limited writable interoperability across other engines
Delta Universal Format (UniForm) aims to bridge Delta → Iceberg/Hudi readers, but adoption is still growing.

Apache Iceberg

Designed from day one for interoperability
Supported by Spark, Flink, Trino, Presto, Snowflake, Athena, Dremio, BigQuery (read support)

Verdict: If you want vendor neutrality and multi-engine support, Iceberg is the clear winner.

4. Governance and Catalog Integration

Delta Lake

Unity Catalog provides centralized governance—but only on Databricks.
Outside Databricks, Delta has fewer cataloging/governance features.

Iceberg

Works with many catalogs:

Verdict: Iceberg offers broader ecosystem support.

5. Use Cases Best Suited for Each

Choose Databricks Delta Lake if:

You are heavily invested in Databricks
You want the best performance with Photon
You prefer a fully managed Lakehouse ecosystem
You rely on Databricks features like MLflow, DLT, Unity Catalog

Choose Apache Iceberg if:

You need multi-engine interoperability
You want the most flexible open table format
You want to avoid vendor lock-in
You run workloads on multiple clouds or different query engines
Governance and schema evolution are priority

Final Thoughts

The choice between Delta Lake and Apache Iceberg ultimately comes down to one key question:

Are you all-in on Databricks, or do you want an open, engine-agnostic data lake architecture?

If your data strategy revolves around Databricks, Delta Lake offers unmatched integration and performance.
If you’re building a flexible, future-proof data lake with multiple compute engines, Apache Iceberg is the best choice today.

In my next Blog post, I will do a technical deep dive for Delta Lake vs Apache Iceberg v3!

Why the Remote Client Cannot Create a SparkContext and How to Fix It

Initial Debugging Attempts

We spent hours debugging our code, testing different approaches, and combing through Databricks and DBT documentation for clues—but nothing seemed to resolve the issue. The error persisted across multiple models and environments, leaving us puzzled. Eventually, we decided to experiment with our infrastructure itself. By switching the cluster type, we finally managed to get our dbt jobs running again. This confirmed that the problem wasn’t within our code or dbt configuration, but rather linked to the Databricks cluster environment.

Using the dbt_cli Cluster

During our investigation, we discovered Databricks’ dedicated dbt_cli cluster, which runs DBT jobs efficiently. This cluster simplifies integration by providing a pre-configured environment where DBT Core and its dependencies come pre-installed. Setup becomes faster, and the cluster reduces compatibility issues. However, it primarily supports job execution rather than interactive development or broader data processing tasks. While convenient and lightweight, it offers less flexibility and scalability than an all-purpose cluster. For example, it cannot handle mixed workloads or support ad-hoc queries as efficiently. In our case, switching to the dbt_cli cluster resolved the SparkContext problem. We did need to adjust our workflow to match the job-oriented design of this cluster type.

Exploring Serverless Clusters

In addition to the dbt_cli cluster, Databricks also offers serverless clusters, which have recently become a strong option for development and debugging. We found that when the cluster configuration includes”spark.databricks.serverless.environmentVersion”: “3”, it fully supports dbt runs without the SparkContext issue. Serverless clusters start up quickly, scale efficiently, and provide a clean environment that’s ideal for testing and interactive development. However, there’s a trade-off—these clusters have limited direct access to Unity Catalog in notebooks.

Why All-Purpose Clusters Remain the Best Choice

In the end, we found that the all-purpose clusters remain the best and fastest option for running our dbt workloads in Databricks. Their flexibility, performance, and compatibility with our Data Vault framework make them ideal for both development and production. While the recent issue forced us to explore alternatives like the dbt_cli and serverless clusters, these workarounds kept our pipelines running and gave us valuable insights into Databricks’ evolving infrastructure. Hopefully, future updates will restore full support for running dbt Python models directly on all-purpose clusters—bringing back the seamless experience we’ve enjoyed since 2022