17 Jun 2020, Samuel Hinton

Trivial One Hot Encoding in Python


The most efficient code snippet to one-hot encode columns

One hot encoding is something we do very commonly in machine learning, where we want to turn a categorical feature into a vector of ones and zeros that algorithms can make much easier sense of.

For example, take this toy example dataframe of people and their favourite food. At the moment, it’s useless to us.

import pandas as pd

df = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")

display(df)
FavFood
Person
Sam Pizza
Ali Vegetables
Jane Cake
John Happiness

I’ve seen enough different implementations of one-hot But in machine learning from first pricinples, that I thought I’d throw my own version into the ring. If you want a “big boy” solution, you can always just appeal to scikit-learn’s OneHotEncoder, but the methods are even simpler. The first one operators on a dataframe as a whole, the second one operates on a specific column if that’s all you care about.

Where we can see the difference between the two is how the column name is preserved using the generic version of get_dummies.

display(one_hot_df(df), one_hot_col(df, "FavFood"))
FavFood_Cake FavFood_Happiness FavFood_Pizza FavFood_Vegetables
Person
Sam 0 0 1 0
Ali 0 0 0 1
Jane 1 0 0 0
John 0 1 0 0
Cake Happiness Pizza Vegetables
Person
Sam 0 0 1 0
Ali 0 0 0 1
Jane 1 0 0 0
John 0 1 0 0

Amazing and super simple!

Unnecessary complications

But I’ve also seen survey results where there are multi-choice options, and the results have come back as lists. Like this:

df2 = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)
Nationality
Person
Sam Australia
Ali Australia
Jane USA
John USA/German

So now the question is “What is the simplest way we can hot encode this data?” And the answer is to change nothing! get_dummies already accepts a separator input!

def hot_encode_col(df, col, sep="/"):
    return df[col].str.get_dummies(sep=sep)

display(hot_encode_col(df2, "Nationality"))
Australia German USA
Person
Sam 1 0 0
Ali 1 0 0
Jane 0 0 1
John 0 1 1

Very, very simple. And if for some reason, get_dummies is not behaving nicely, or you really want those multi-level indexes, you can do it manually using melt and pivot:

def one_hot_melt_pivot(df):
    names = df.index.names
    melted = df.reset_index().melt(id_vars=names)
    return melted.pivot_table(index=names, 
                              columns=["variable", "value"], 
                              aggfunc=len, 
                              fill_value=0)

display(one_hot_melt_pivot(df))
variable FavFood
value Cake Happiness Pizza Vegetables
Person
Ali 0 0 0 1
Jane 1 0 0 0
John 0 1 0 0
Sam 0 0 1 0

Here’s the full code for convenience:

import pandas as pd


df = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")

display(df)

def one_hot_df(df):
    return pd.get_dummies(df)

def one_hot_col(df, col):
    return df[col].str.get_dummies()

display(one_hot_df(df), one_hot_col(df, "FavFood"))

df2 = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)

def hot_encode_col(df, col, sep="/"):
    return df[col].str.get_dummies(sep=sep)

display(hot_encode_col(df2, "Nationality"))

def one_hot_melt_pivot(df):
    names = df.index.names
    melted = df.reset_index().melt(id_vars=names)
    return melted.pivot_table(index=names, 
                              columns=["variable", "value"], 
                              aggfunc=len, 
                              fill_value=0)

display(one_hot_melt_pivot(df))