Databricks Multiple Filters using a Python Lambda statement

databricks widgets

Simplifying Multiple Null Checks in Databricks

Recently, I ran into a case where I needed to check if 11 different fields were null. Yes, I could have used Copilot (or my preferred Codeium) to generate it for me, but I knew it had to be easier. There had to be an easier way…

The Scenario

You have a list of conditions in a python list (in my case, I pasted it as separate lines and did the following. Doesn’t matter how you specify your conditions, just that it’s a list somehow

conditions = """
condition 1
condition2
...".split("\n")

You have a way of associating this list with a set of Databricks columns

Using reduce from functools

Here’s what I found:

  • There’s a python function reduce in functools that takes a list a reduces it down to a result
  • You specify how this list is “reduced”, such as “and each element of this list together”

Example Implementation

Here’s how it works. first, I have an array of join_columns which are the common columns between two datasets. In my setup, each column of join_columns maps to the corresponding filter:

conditions <span class="token operator">=</span> <span class="token punctuation">[</span>F<span class="token punctuation">.</span>col<span class="token punctuation">(</span>join_columns<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">==</span> filter_cols<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">]</span>

Now, we use the reduce function to combine these into a condition for filtering:

condition <span class="token operator">=</span> functools<span class="token punctuation">.</span><span class="token builtin">reduce</span><span class="token punctuation">(</span><span class="token keyword">lambda</span> a<span class="token punctuation">,</span> b<span class="token punctuation">:</span> a <span class="token operator">&</span> b<span class="token punctuation">,</span> conditions<span class="token punctuation">)</span>

Combining Conditions

So, in this case, we are saying all these conditions should be and’d together – or rather all of them should match. Naturally, had I wanted any of them to match, this would have have been:

condition <span class="token operator">=</span> functools<span class="token punctuation">.</span><span class="token builtin">reduce</span><span class="token punctuation">(</span><span class="token keyword">lambda</span> a<span class="token punctuation">,</span> b<span class="token punctuation">:</span> a <span class="token operator">|</span> b<span class="token punctuation">,</span> conditions<span class="token punctuation">)</span>

(the & is now a |)

Final Condition for Filtering

That’s it–I now have a condition that I can use:

res_df <span class="token operator">=</span> input_df<span class="token punctuation">.</span><span class="token builtin">filter</span><span class="token punctuation">(</span>condition<span class="token punctuation">)</span>