Trivial One Hot Encoding in Python

6th June 2020

The most efficient code snippet to one-hot encode columns

Show me everything!
Oh yeah, coding time.
Just the plots
Code is nasty.

One hot encoding is something we do very commonly in machine learning, where we want to turn a categorical feature into a vector of ones and zeros that algorithms can make much easier sense of.

For example, take this toy example dataframe of people and their favourite food. At the moment, it’s useless to us.

import pandas as pd

df = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")

display(df)

	FavFood
Person
Sam	Pizza
Ali	Vegetables
Jane	Cake
John	Happiness

I’ve seen enough different implementations of one-hot But in machine learning from first pricinples, that I thought I’d throw my own version into the ring. If you want a “big boy” solution, you can always just appeal to scikit-learn’s OneHotEncoder, but the methods are even simpler. The first one operators on a dataframe as a whole, the second one operates on a specific column if that’s all you care about.

def one_hot_df(df):
    return pd.get_dummies(df)

def one_hot_col(df, col):
    return df[col].str.get_dummies()

Where we can see the difference between the two is how the column name is preserved using the generic version of get_dummies.

display(one_hot_df(df), one_hot_col(df, "FavFood"))

	FavFood_Cake	FavFood_Happiness	FavFood_Pizza	FavFood_Vegetables
Person
Sam	0	0	1	0
Ali	0	0	0	1
Jane	1	0	0	0
John	0	1	0	0

	Cake	Happiness	Pizza	Vegetables
Person
Sam	0	0	1	0
Ali	0	0	0	1
Jane	1	0	0	0
John	0	1	0	0

Amazing and super simple!

Unnecessary complications

But I’ve also seen survey results where there are multi-choice options, and the results have come back as lists. Like this:

df2 = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)

	Nationality
Person
Sam	Australia
Ali	Australia
Jane	USA
John	USA/German

So now the question is “What is the simplest way we can hot encode this data?” And the answer is to change nothing! get_dummies already accepts a separator input!

def hot_encode_col(df, col, sep="/"):
    return df[col].str.get_dummies(sep=sep)

display(hot_encode_col(df2, "Nationality"))

	Australia	German	USA
Person
Sam	1	0	0
Ali	1	0	0
Jane	0	0	1
John	0	1	1

Very, very simple. And if for some reason, get_dummies is not behaving nicely, or you really want those multi-level indexes, you can do it manually using melt and pivot:

def one_hot_melt_pivot(df):
    names = df.index.names
    melted = df.reset_index().melt(id_vars=names)
    return melted.pivot_table(index=names, 
                              columns=["variable", "value"], 
                              aggfunc=len, 
                              fill_value=0)

display(one_hot_melt_pivot(df))

variable	FavFood
value	Cake	Happiness	Pizza	Vegetables
Person
Ali	0	0	0	1
Jane	1	0	0	0
John	0	1	0	0
Sam	0	0	1	0

For your convenience, here’s the code in one block:

import pandas as pd

df = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "FavFood": ["Pizza", "Vegetables", "Cake", "Happiness"]
}).set_index("Person")

display(df)
def one_hot_df(df):
    return pd.get_dummies(df)

def one_hot_col(df, col):
    return df[col].str.get_dummies()
display(one_hot_df(df), one_hot_col(df, "FavFood"))
df2 = pd.DataFrame({
    "Person": ["Sam", "Ali", "Jane", "John"], 
    "Nationality": ["Australia", "Australia", "USA", "USA/German"]
}).set_index("Person")
display(df2)
def hot_encode_col(df, col, sep="/"):
    return df[col].str.get_dummies(sep=sep)

display(hot_encode_col(df2, "Nationality"))
def one_hot_melt_pivot(df):
    names = df.index.names
    melted = df.reset_index().melt(id_vars=names)
    return melted.pivot_table(index=names, 
                              columns=["variable", "value"], 
                              aggfunc=len, 
                              fill_value=0)

display(one_hot_melt_pivot(df))