Removing Empty Columns

6th June 2020

Remove mostly empty columns to clean your data using pandas.

The sad reality of life as a data scientist is we spend too much time cleaning and processing data. And a lot of the time, our data contains some features which simply need to go in the bin. A column of 1000 values with only 2 entries is probably not going to be useful, after all. Let’s make some data to illustrate this:

import pandas as pd
import numpy as np

data = np.random.random(size=(100, 3))  # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan  # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)
ABC
00.2206280.245779NaN
10.8547050.370203NaN
20.6692660.735099NaN
30.8905420.267307NaN
40.5957180.001851NaN
............
950.5441130.792105NaN
960.7340580.224512NaN
970.5004680.004238NaN
980.1522020.780630NaN
990.2790310.041635NaN

100 rows × 3 columns

Here is a handy snippet to remove columns which are more than threshold empty:

def remove_empty_columns(df, threshold=0.9):
    column_mask = df.isnull().mean(axis=0) < threshold
    return df.loc[:, column_mask]

Super simple. Notice the use of isnull here instead of isnan - the former is more general.

df2 = remove_empty_columns(df)
display(df2)
AB
00.2206280.245779
10.8547050.370203
20.6692660.735099
30.8905420.267307
40.5957180.001851
.........
950.5441130.792105
960.7340580.224512
970.5004680.004238
980.1522020.780630
990.2790310.041635

100 rows × 2 columns

And bam, column C is gone!

That’s it for this post, wanted to keep this one super short because I know I’ve seen at least a dozen different implementations to remove empty columns, and some of them are definitely better than others. May this hopefully be a useful snippet!


For your convenience, here’s the code in one block:

import pandas as pd
import numpy as np

data = np.random.random(size=(100, 3))  # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan  # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)
def remove_empty_columns(df, threshold=0.9):
    column_mask = df.isnull().mean(axis=0) < threshold
    return df.loc[:, column_mask]
df2 = remove_empty_columns(df)
display(df2)