Removing Empty Columns

6th June 2020

Remove mostly empty columns to clean your data using pandas.

Show me everything!
Oh yeah, coding time.
Just the plots
Code is nasty.

The sad reality of life as a data scientist is we spend too much time cleaning and processing data. And a lot of the time, our data contains some features which simply need to go in the bin. A column of 1000 values with only 2 entries is probably not going to be useful, after all. Let’s make some data to illustrate this:

import pandas as pd
import numpy as np

data = np.random.random(size=(100, 3))  # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan  # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)

	A	B	C
0	0.220628	0.245779	NaN
1	0.854705	0.370203	NaN
2	0.669266	0.735099	NaN
3	0.890542	0.267307	NaN
4	0.595718	0.001851	NaN
...	...	...	...
95	0.544113	0.792105	NaN
96	0.734058	0.224512	NaN
97	0.500468	0.004238	NaN
98	0.152202	0.780630	NaN
99	0.279031	0.041635	NaN

100 rows × 3 columns

Here is a handy snippet to remove columns which are more than threshold empty:

def remove_empty_columns(df, threshold=0.9):
    column_mask = df.isnull().mean(axis=0) < threshold
    return df.loc[:, column_mask]

Super simple. Notice the use of isnull here instead of isnan - the former is more general.

df2 = remove_empty_columns(df)
display(df2)

	A	B
0	0.220628	0.245779
1	0.854705	0.370203
2	0.669266	0.735099
3	0.890542	0.267307
4	0.595718	0.001851
...	...	...
95	0.544113	0.792105
96	0.734058	0.224512
97	0.500468	0.004238
98	0.152202	0.780630
99	0.279031	0.041635

100 rows × 2 columns

And bam, column C is gone!

That’s it for this post, wanted to keep this one super short because I know I’ve seen at least a dozen different implementations to remove empty columns, and some of them are definitely better than others. May this hopefully be a useful snippet!

For your convenience, here’s the code in one block:

import pandas as pd
import numpy as np

data = np.random.random(size=(100, 3))  # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan  # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)
def remove_empty_columns(df, threshold=0.9):
    column_mask = df.isnull().mean(axis=0) < threshold
    return df.loc[:, column_mask]
df2 = remove_empty_columns(df)
display(df2)