Trivial One Hot Encoding in Python

6th June 2020

The most efficient code snippet to one-hot encode columns

One hot encoding is something we do very commonly in machine learning, where we want to turn a categorical feature into a vector of ones and zeros that algorithms can make much easier sense of.

For example, take this toy example dataframe of people and their favourite food. At the moment, it’s useless to us.

import pandas as pd

df = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")

display(df)
FavFood
Person
SamPizza
AliVegetables
JaneCake
JohnHappiness

I’ve seen enough different implementations of one-hot But in machine learning from first pricinples, that I thought I’d throw my own version into the ring. If you want a “big boy” solution, you can always just appeal to scikit-learn’s OneHotEncoder, but the methods are even simpler. The first one operators on a dataframe as a whole, the second one operates on a specific column if that’s all you care about.

def one_hot_df(df):
    return pd.get_dummies(df)

def one_hot_col(df, col):
    return df[col].str.get_dummies()

Where we can see the difference between the two is how the column name is preserved using the generic version of get_dummies.

display(one_hot_df(df), one_hot_col(df, "FavFood"))
FavFood_CakeFavFood_HappinessFavFood_PizzaFavFood_Vegetables
Person
Sam0010
Ali0001
Jane1000
John0100
CakeHappinessPizzaVegetables
Person
Sam0010
Ali0001
Jane1000
John0100

Amazing and super simple!

Unnecessary complications

But I’ve also seen survey results where there are multi-choice options, and the results have come back as lists. Like this:

df2 = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)
Nationality
Person
SamAustralia
AliAustralia
JaneUSA
JohnUSA/German

So now the question is “What is the simplest way we can hot encode this data?” And the answer is to change nothing! get_dummies already accepts a separator input!

def hot_encode_col(df, col, sep="/"):
    return df[col].str.get_dummies(sep=sep)

display(hot_encode_col(df2, "Nationality"))
AustraliaGermanUSA
Person
Sam100
Ali100
Jane001
John011

Very, very simple. And if for some reason, get_dummies is not behaving nicely, or you really want those multi-level indexes, you can do it manually using melt and pivot:

def one_hot_melt_pivot(df):
    names = df.index.names
    melted = df.reset_index().melt(id_vars=names)
    return melted.pivot_table(index=names, 
                              columns=["variable", "value"], 
                              aggfunc=len, 
                              fill_value=0)

display(one_hot_melt_pivot(df))
variableFavFood
valueCakeHappinessPizzaVegetables
Person
Ali0001
Jane1000
John0100
Sam0010

For your convenience, here’s the code in one block:

import pandas as pd

df = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")

display(df)
def one_hot_df(df):
    return pd.get_dummies(df)

def one_hot_col(df, col):
    return df[col].str.get_dummies()
display(one_hot_df(df), one_hot_col(df, "FavFood"))
df2 = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)
def hot_encode_col(df, col, sep="/"):
    return df[col].str.get_dummies(sep=sep)

display(hot_encode_col(df2, "Nationality"))
def one_hot_melt_pivot(df):
    names = df.index.names
    melted = df.reset_index().melt(id_vars=names)
    return melted.pivot_table(index=names, 
                              columns=["variable", "value"], 
                              aggfunc=len, 
                              fill_value=0)

display(one_hot_melt_pivot(df))