6th June 2020
The most efficient code snippet to one-hot encode columns
One hot encoding is something we do very commonly in machine learning, where we want to turn a categorical feature into a vector of ones and zeros that algorithms can make much easier sense of.
For example, take this toy example dataframe of people and their favourite food. At the moment, it’s useless to us.
import pandas as pd
df = pd.DataFrame({
"Person": ["Sam", "Ali", "Jane", "John"],
"FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")
display(df)
FavFood | |
---|---|
Person | |
Sam | Pizza |
Ali | Vegetables |
Jane | Cake |
John | Happiness |
I’ve seen enough different implementations of one-hot But in machine learning from first pricinples, that I thought I’d throw my own version into the ring. If you want a “big boy” solution, you can always just appeal to scikit-learn’s OneHotEncoder, but the methods are even simpler. The first one operators on a dataframe as a whole, the second one operates on a specific column if that’s all you care about.
def one_hot_df(df):
return pd.get_dummies(df)
def one_hot_col(df, col):
return df[col].str.get_dummies()
Where we can see the difference between the two is how the column name is preserved using the generic version of get_dummies
.
display(one_hot_df(df), one_hot_col(df, "FavFood"))
FavFood_Cake | FavFood_Happiness | FavFood_Pizza | FavFood_Vegetables | |
---|---|---|---|---|
Person | ||||
Sam | 0 | 0 | 1 | 0 |
Ali | 0 | 0 | 0 | 1 |
Jane | 1 | 0 | 0 | 0 |
John | 0 | 1 | 0 | 0 |
Cake | Happiness | Pizza | Vegetables | |
---|---|---|---|---|
Person | ||||
Sam | 0 | 0 | 1 | 0 |
Ali | 0 | 0 | 0 | 1 |
Jane | 1 | 0 | 0 | 0 |
John | 0 | 1 | 0 | 0 |
Amazing and super simple!
But I’ve also seen survey results where there are multi-choice options, and the results have come back as lists. Like this:
df2 = pd.DataFrame({
"Person": ["Sam", "Ali", "Jane", "John"],
"Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)
Nationality | |
---|---|
Person | |
Sam | Australia |
Ali | Australia |
Jane | USA |
John | USA/German |
So now the question is “What is the simplest way we can hot encode this data?” And the answer is to change nothing! get_dummies
already accepts a separator input!
def hot_encode_col(df, col, sep="/"):
return df[col].str.get_dummies(sep=sep)
display(hot_encode_col(df2, "Nationality"))
Australia | German | USA | |
---|---|---|---|
Person | |||
Sam | 1 | 0 | 0 |
Ali | 1 | 0 | 0 |
Jane | 0 | 0 | 1 |
John | 0 | 1 | 1 |
Very, very simple. And if for some reason, get_dummies
is not behaving nicely, or you really want those multi-level indexes, you can do it manually using melt and pivot:
def one_hot_melt_pivot(df):
names = df.index.names
melted = df.reset_index().melt(id_vars=names)
return melted.pivot_table(index=names,
columns=["variable", "value"],
aggfunc=len,
fill_value=0)
display(one_hot_melt_pivot(df))
variable | FavFood | |||
---|---|---|---|---|
value | Cake | Happiness | Pizza | Vegetables |
Person | ||||
Ali | 0 | 0 | 0 | 1 |
Jane | 1 | 0 | 0 | 0 |
John | 0 | 1 | 0 | 0 |
Sam | 0 | 0 | 1 | 0 |
For your convenience, here’s the code in one block:
import pandas as pd
df = pd.DataFrame({
"Person": ["Sam", "Ali", "Jane", "John"],
"FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")
display(df)
def one_hot_df(df):
return pd.get_dummies(df)
def one_hot_col(df, col):
return df[col].str.get_dummies()
display(one_hot_df(df), one_hot_col(df, "FavFood"))
df2 = pd.DataFrame({
"Person": ["Sam", "Ali", "Jane", "John"],
"Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)
def hot_encode_col(df, col, sep="/"):
return df[col].str.get_dummies(sep=sep)
display(hot_encode_col(df2, "Nationality"))
def one_hot_melt_pivot(df):
names = df.index.names
melted = df.reset_index().melt(id_vars=names)
return melted.pivot_table(index=names,
columns=["variable", "value"],
aggfunc=len,
fill_value=0)
display(one_hot_melt_pivot(df))