18 Jun 2020, Samuel Hinton

Removing Empty Columns


Remove mostly empty columns to clean your data using pandas.

The sad reality of life as a data scientist is we spend too much time cleaning and processing data. And a lot of the time, our data contains some features which simply need to go in the bin. A column of 1000 values with only 2 entries is probably not going to be useful, after all. Let’s make some data to illustrate this:

import pandas as pd
import numpy as np

data = np.random.random(size=(100, 3))  # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan  # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)
A B C
0 0.220628 0.245779 NaN
1 0.854705 0.370203 NaN
2 0.669266 0.735099 NaN
3 0.890542 0.267307 NaN
4 0.595718 0.001851 NaN
... ... ... ...
95 0.544113 0.792105 NaN
96 0.734058 0.224512 NaN
97 0.500468 0.004238 NaN
98 0.152202 0.780630 NaN
99 0.279031 0.041635 NaN

100 rows × 3 columns

Here is a handy snippet to remove columns which are more than threshold empty:

Super simple. Notice the use of isnull here instead of isnan - the former is more general.

df2 = remove_empty_columns(df)
display(df2)
A B
0 0.220628 0.245779
1 0.854705 0.370203
2 0.669266 0.735099
3 0.890542 0.267307
4 0.595718 0.001851
... ... ...
95 0.544113 0.792105
96 0.734058 0.224512
97 0.500468 0.004238
98 0.152202 0.780630
99 0.279031 0.041635

100 rows × 2 columns

And bam, column C is gone!

That’s it for this post, wanted to keep this one super short because I know I’ve seen at least a dozen different implementations to remove empty columns, and some of them are definitely better than others. May this hopefully be a useful snippet!

Here’s the full code for convenience:

import numpy as np
import pandas as pd


data = np.random.random(size=(100, 3))  # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan  # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)

def remove_empty_columns(df, threshold=0.9):
    column_mask = df.isnull().mean(axis=0) < threshold
    return df.loc[:, column_mask]

df2 = remove_empty_columns(df)
display(df2)