Delta Lake without Databricks

November 12, 2024/Chad Upjohn

Scenario

Databricks is a great tool for big data. It manages the environment to create tables in delta lake format. Databricks launches a spark cluster in which devs run pyspark that creates the tables and appends to them.

What if appending to a delta lake table could be done in a cheaper environment than a spark cluster? If we simply viewed the process as adding another metadata file to the delta logs, then we could remove the need for a spark cluster. The solution is a rust crate called delta-rs. With this crate, you can create and append to delta lake tables.

Solution

With a solution written in rust, the need for a spark cluster is removed and therefore allows for cheaper and more scalable environments to be used. For example, oxbow, a rust project, runs in an aws lambda to turn new parquet files into additions to a delta lake table – oxbow github page. Another use of delta-rs is in handling kafka messages, which is seen in the kafka-delta-ingest project. As the creator explained in a video on the project, they removed the cluster failures caused by spikes in kafka messages.

Code

In the project rust-deltalake-poc, rust code is written as an example of how to read the parquet file as source and append it to a delta lake table. The code will create a version 0 of the table if it does not already exist. As files are added to the delta lake table new versions of the delta logs are added. However, the metadata of the columns is not stored, such as min or max.

In the repo is a python script for reading the delta lake table. The script can take a table version number which is useful for exploring table state. Additionally, there is a function to read and display a given checkpoint file. These options in the python script along with the delta logs being local are helpful to gain more knowledge of how the delta logs are set up and function in delta lake.

The code could be refactored to skip the reading of source files before adding to the delta lake table. This would assume that the schema does not change. The benefit, however, is that the code can add a large number of files to the table without needing to read those files.

Repos:

delta-rs: https://github.com/delta-io/delta-rs

rust-deltalake-poc: https://github.com/upjohnc/rust-deltalake-poc