6th June 2020
Remove mostly empty columns to clean your data using pandas.
The sad reality of life as a data scientist is we spend too much time cleaning and processing data. And a lot of the time, our data contains some features which simply need to go in the bin. A column of 1000 values with only 2 entries is probably not going to be useful, after all. Let’s make some data to illustrate this:
import pandas as pd
import numpy as np
data = np.random.random(size=(100, 3)) # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)
A | B | C | |
---|---|---|---|
0 | 0.220628 | 0.245779 | NaN |
1 | 0.854705 | 0.370203 | NaN |
2 | 0.669266 | 0.735099 | NaN |
3 | 0.890542 | 0.267307 | NaN |
4 | 0.595718 | 0.001851 | NaN |
... | ... | ... | ... |
95 | 0.544113 | 0.792105 | NaN |
96 | 0.734058 | 0.224512 | NaN |
97 | 0.500468 | 0.004238 | NaN |
98 | 0.152202 | 0.780630 | NaN |
99 | 0.279031 | 0.041635 | NaN |
100 rows × 3 columns
Here is a handy snippet to remove columns which are more than threshold
empty:
def remove_empty_columns(df, threshold=0.9):
column_mask = df.isnull().mean(axis=0) < threshold
return df.loc[:, column_mask]
Super simple. Notice the use of isnull
here instead of isnan
- the former is more general.
df2 = remove_empty_columns(df)
display(df2)
A | B | |
---|---|---|
0 | 0.220628 | 0.245779 |
1 | 0.854705 | 0.370203 |
2 | 0.669266 | 0.735099 |
3 | 0.890542 | 0.267307 |
4 | 0.595718 | 0.001851 |
... | ... | ... |
95 | 0.544113 | 0.792105 |
96 | 0.734058 | 0.224512 |
97 | 0.500468 | 0.004238 |
98 | 0.152202 | 0.780630 |
99 | 0.279031 | 0.041635 |
100 rows × 2 columns
And bam, column C is gone!
That’s it for this post, wanted to keep this one super short because I know I’ve seen at least a dozen different implementations to remove empty columns, and some of them are definitely better than others. May this hopefully be a useful snippet!
For your convenience, here’s the code in one block:
import pandas as pd
import numpy as np
data = np.random.random(size=(100, 3)) # Random data between 0 and 1
data[data[:, 2] < 0.95, 2] = np.nan # Set most of the last column to NaN
df = pd.DataFrame(data, columns=["A", "B", "C"])
display(df)
def remove_empty_columns(df, threshold=0.9):
column_mask = df.isnull().mean(axis=0) < threshold
return df.loc[:, column_mask]
df2 = remove_empty_columns(df)
display(df2)