Training a Neural Network Embedding Layer with Keras

6th July 2020

Using python, Keras and some colours to illustrate encoding as simply as possible

This little write is designed to try and explain what embeddings are, and how we can train a naive version of an embedding to understand and visualise the process. We’ll do this using a colour dataset, Keras and good old-fashioned matplotlib.

Introduction

Let’s start simple: What is an embedding?

An embedding is a way to represent some categorical feature (like a word), as a dense parameter. Specifically, this is normally a unit vector in a high dimensional hypersphere.

A common way of encoding a categorical feature of machine learning is to one-hot-encode them. However, for a large number of categories, this creates a very spare matrix. Imagine encoding the names of babies born in 2020. You might have a million records, but with ten thousand possible names, that is a very big matrix filled with mostly zeros. Also, names like Mat and Matt are just as similar as Mat and Patrica when you one-hot-encode. That is, not similar at all.

Instead, if we can create a dense vector (aka a vector filled with numbers and not mostly zeros), we can represent Mat, Matt and Patrica as some location in higher dimensional space, where the Mat and Matt vectors are similar to each other. This is what we are trying to do with embeddings. To learn the translation from a categorical feature to this vector in higher dimensional space.

Most of the time when you use embeddings, you’ll use them already trained and available - you won’t be training them yourself. However, to understand what they are better, we’ll mock up a dataset based on colour combinations, and learn the embeddings to turn a colour name into a location in both 2D and 3D space.

So, for the rest of this write up, the goal is to:

  1. Start with some colours.
  2. Create a data product similar to how Word2Vec and others embeddings are trained.
  3. Create a model with a 2D embedding layer and train it.
  4. Visualise the embedding layer.
  5. Do the same for a 3D normalised embedding just for fun.

Let’s get cracking!

The colour dataset

We’ll source the colour dataset available from Kaggle here. Let’s load it in and view a few samples from it.

import pandas as pd
import numpy as np

df_original = pd.read_csv("encoding_colours/colours.csv")
df_original = df_original.dropna(subset=["Color Name"])
num_colours = df_original.shape[0]
print(f"We have {num_colours} colours")
df_original.sample(5)
We have 646 colours
Color NameCreditsR;G;B DecRGB HexCSS HexBG/FG color sample
398honeydew4X131;139;131838B83NaN### SAMPLE ###
126CadetBlueX95;158;1605F9EA0NaN### SAMPLE ###
335SpringGreen4X0;139;69008B45NaN#©2006 walsh@njit.edu#
310GreenYellowX173;255;47ADFF2FNaN### SAMPLE ###
426HotPink4X139;58;988B3A62NaN### SAMPLE ###

So after dropping NaNs, we have 646 different colour names. Lets throw out columns we don’t want, and split the R;G;B Dec into separate columns (and then normalise them to 1).

df = df_original.loc[:, ["Color Name", "R;G;B Dec"]]
df[["r", "g", "b"]] = df["R;G;B Dec"].str.split(";", expand=True).astype(int) / 255
df = df.drop(columns="R;G;B Dec")
df = df.rename(columns={"Color Name": "name"})
df = df.reset_index(drop=True)
df.sample(10)
namergb
93grey820.8196080.8196080.819608
556NavajoWhite40.5450980.4745100.368627
440PaleVioletRed40.5450980.2784310.364706
401sienna40.5450980.2784310.149020
58grey470.4705880.4705880.470588
392salmon0.9803920.5019610.447059
626gold20.9333330.7882350.000000
240Free Speech Blue0.2549020.3372550.772549
212cyan20.0000000.9333330.933333
319SpringGreen0.0000001.0000000.498039

Now theres just one more issue - you dont pass in strings or text to a neural network. You pass in numbers. So lets one-hot encode our colours to give them a numeric representation. We could use the Keras preprocessing one_hot here… but we’ve got this nice dataframe which already has an index… so we’ll use that, and I’ll make it explicit and add it as a column.

df["num"] = df.index
df.head(10)
namergbnum
0Grey0.3294120.3294120.3294120
1Grey, Silver0.7529410.7529410.7529411
2grey0.7450980.7450980.7450982
3LightGray0.8274510.8274510.8274513
4LightSlateGrey0.4666670.5333330.6000004
5SlateGray0.4392160.5019610.5647065
6SlateGray10.7764710.8862751.0000006
7SlateGray20.7254900.8274510.9333337
8SlateGray30.6235290.7137250.8039228
9SlateGray40.4235290.4823530.5450989

At this point, we have a nice data product, but it doesn’t look like how you might train embeddings for words.

Words don’t have a well defined mathematical representation to start with, instead we simply see certain words next to each other (or close to each other) more often, and we learn from that. So lets start by generating pairs of colours to emulate pairs of sequential words in what is now a colour-palette-like example.

n = 100000 # Num samples
colour_1 = df.sample(n=n, replace=True, random_state=0).reset_index(drop=True)
colour_2 = df.sample(n=n, replace=True, random_state=42).reset_index(drop=True)
c = colour_1.merge(colour_2, left_index=True, right_index=True)
c[["name_x", "name_y"]].sample(10)
name_xname_y
9838LightSalmon1bright gold
44160DarkSeaGreen2goldenrod4
38804firebrickCadetBlue
52395DarkKhakigrey76
80675DarkOliveGreenOliveDrab2
28397Dark TurquoiseDarkGoldenrod
35853aquamarine4orange
67955LightSalmon3aquamarine3, MediumAquamarine
42872grey32grey13
68885Neon Pinkcoral1

Great! A hundred thousand colour combinations. However, this is still useless. When we train embeddings on words, we want positive and negative examples.

In a textual example, the pairs that come from words being next to each other are positive examples. We can generate negative examples by scrambling the document to remove meaning, and then taking pairs (of course, there many formal methods of doing this that I’m not going to go into!).

In our case, we’ll define a similarity metric ourselves, by just using the distance the two colours are from each other.

# Calculate the distance, and drop the RGB columns.
c["diff"] = ((c.r_x - c.r_y)**2 + (c.g_x - c.g_y)**2 + (c.b_x - c.b_y)**2) / 3
c = c.drop(columns=["r_x", "r_y", "g_x", "g_y", "b_x", "b_y"])
c
name_xnum_xname_ynum_ydiff
0gainsboro559grey911020.002215
1Goldenrod629OrangeRed44350.266913
2SteelBlue4192tan22700.210832
3DarkSalmon359grey951060.117621
4SlateGray49grey60710.015999
..................
99995grey6576Very Dark Brown2810.149199
99996SkyBlue2180DarkOrchid24790.105908
99997LemonChiffon2592purple15250.231757
99998cornsilk1609grey74850.045101
99999brown3253LightYellow26030.312813

100000 rows × 5 columns

Now that we have a difference, lets turn that into a set of positive and negative examples. I’m just going to compare the difference to a random number, and you can see this will generate a bunch of predicted values of 1 and 0.

np.random.seed(0)
c["predict"] = (c["diff"] < 0.2 * np.random.random(c.shape[0]) ** 2).astype(int)
print(f"{100 * c.predict.mean(): 0.1f}% positive values")
c.sample(10)
 22.5% positive values
name_xnum_xname_ynum_ydiffpredict
63834LightYellow1602burlywood12570.0343300
12504Dusty Rose468VioletRed34440.0411431
48828salmon1393Free Speech Green3520.4108210
32737orange3390grey891000.3119260
97626red4462grey44550.1323441
31753grey5768grey78890.0448440
81431DarkSeaGreen4296burlywood32590.0582391
32403grey1526BlueViolet1170.2325720
62143DarkOliveGreen1287SlateBlue41870.2866330
99415MistyRose2428SkyBlue11790.0650161

Its common to have more negative values than positive, both because you can generate essentially infinite negative values (just keep sticking random words together for text), so I’ve made sure to have around a quarter positive values here. You can generate whatever ratio you like, it won’t actually change how the rest of this example runs.

Now that we have a training dataset, let’s make a Keras model!

from tensorflow.random import set_seed
from tensorflow import keras
from tensorflow.keras.layers import Embedding, Dense, Lambda, Input, Subtract
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import LambdaCallback

def sum_dist(x):
    n = K.permute_dimensions(x, pattern=(1, 0, 2))
    a, b = n[0], n[1]
    return K.sum((a - b)**2, axis=-1, keepdims=True)

def get_model(embedding_dims=2):
    model = keras.Sequential()
    model.add(Embedding(num_colours, embedding_dims, input_length=2))
    model.add(Lambda(sum_dist, output_shape=(1,), name="Dist"))
    model.add(Dense(1, activation="sigmoid"))
    print(model.summary())
    model.compile(loss='binary_crossentropy', optimizer="adam", metrics=["mse"])
    return model

def setseed(i):
    np.random.seed(i), set_seed(i)

Okay, lets step through this. In get_model we ask for a normal sequential model. The Embedding layer is a lookup table, which will have 646 rows (one for each colour), and will produce a 2D vector for each word. Because we generate two words at a time, we set input_length=2 - which means the output of the embeding layer will be 2 2D vectors (aka a matrix of shape (2,2)).

If you don’t want to combine the inputs together like I am doing, here is an example where the model has two separate inputs that are combined later.

After the embedding, we have a Lambda layer, which simply takes that 2x2 matrix from the embedding output, splits it into a and b (the embedding vector for each of our two colours), and then returns the sum of the squared difference. Which is the Euclidean distance squared.

The keen-eyed among you might remember that we set a positive value for our predict column before for similar vectors… and here similar vectors will have a result close to zero!

It doesn’t matter!

The reason we don’t care, is because the Lambda layer is connected to a single Dense layer, which will also train its weight and bias. If the weight is negative and bias positive, it will act as an inverting mechanism. Neural networks really are magic!

If this still upsets you though, you can change out the K.sum, but a naive invert will cause division by zero. To get around it, you could use Lidstone smoothing (aka add a number to the denominator so its never zero), but I digress.

Finally, to make sure I can reproduce these plots exactly, I added the setseed function.

Lets now instantiate the model, and fit it. Oh, I’ll also save out the embedding weights using a custom callback as we go, so that we can see their evolution over epochs.

weights = []
save = LambdaCallback(on_epoch_end=lambda batch, logs: 
                      weights.append(model.layers[0].get_weights()[0]))

X, y = c[["num_x", "num_y"]], c["predict"]

setseed(7)
model = get_model()
model.fit(X, y, epochs=500, verbose=0, batch_size=512, callbacks=[save]);
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 2, 2)              1292      
_________________________________________________________________
Dist (Lambda)                (None, 1)                 0         
_________________________________________________________________
dense (Dense)                (None, 1)                 2         
=================================================================
Total params: 1,294
Trainable params: 1,294
Non-trainable params: 0
_________________________________________________________________
None

Warning: nastly plotting code below.

Now that we have a trained model, lets see how it performed. Below is an animation of the embeddings evolving.

%%capture
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# Get the list of colours
cs = df[["r", "g", "b"]].to_numpy()

# Create the plots
with plt.style.context("default"):
    fig, ax = plt.subplots(figsize=(10, 5))
    fig.subplots_adjust(left=0.1, right=0.9, bottom=0.15, top=0.9)
    ax.set_xlabel("$x_0$"), ax.set_ylabel("$x_1$")
    ax.set_title("Training of 2D colour embeddings")
    scat = ax.scatter(weights[-1][:, 0], weights[-1][:, 1], color=cs, s=10);

    def init():
        ax.set_xlim(weights[-1][:,0].min(), weights[-1][:,0].max())
        ax.set_ylim(weights[-1][:,1].min(), weights[-1][:,1].max())
        return scat,
    def update(i):
        scat.set_offsets(weights[i])
        return scat,

    # Split our epochs into frames
    nw = len(weights) - 1
    nf, power = 30 * 10, 2
    frames = pd.unique((np.linspace(1, nw**(1 / power), nf)**power).astype(int))
    ani = FuncAnimation(fig, update, frames=frames, 
                        init_func=init, blit=True, interval=33.3);

We can then use this ugly plotting code to output PNGs and turn them into a video using ffmpeg.

ani.save('embed_2d.mp4', fps=30, 
         extra_args=['-vcodec', 'libx264', '-crf', '26'])

# If you're running this in a Jupyter notebook, the below will do the 
# same without saving it to file
# HTML(ani.to_html5_video())
from IPython.display import Video
Video("embed_2d.mp4")

This is beautiful. From the starting randomly initialised mess, quickly structure emerges. You can see the black to white gradient down the middle, with blue and red being split on either side. Green, being in between and limited by 2D space, gets stuck in between.

And to give you something that isn’t moving to look at better, here is the final embedding.

fig, ax = plt.subplots()
scat = ax.scatter(weights[-1][:, 0], weights[-1][:, 1], color=cs, s=20)
ax.set_xlabel("$x_0$"), ax.set_ylabel("$x_1$")
ax.set_title("2D Colour Embeddings");

png

And thats it! The embeddings are trained, and we could now save them out and use them in a future model where - for some unknown reason, we need to ascribe numeric value to colour names, such that similar colours are close to each other in that vector space.

3D Embeddings

Just for fun, lets move to 3D space. Many embeddings are normalised such that the magnitude of each embedded vector is one, so I was tempted to do that for this example… but then we’d be back to a 2D surface (the surface of the unit sphere). So no normalisation for this, but if you wanted to know how to do it, I’ve redefined the model and commented out an l2_normalize layer, which would enforce unit vectors across arbitrary dimensions.

def get_model2(embedding_dims=3):
    model = keras.Sequential()
    model.add(Embedding(num_colours, embedding_dims, input_length=2))
    # model.add(Lambda(lambda x: K.l2_normalize(x, axis=-1), name="Norm"))
    model.add(Lambda(sum_dist, output_shape=(1,), name="Dist"))
    model.add(Dense(1, activation="sigmoid"))
    print(model.summary())
    model.compile(loss='binary_crossentropy', optimizer="adam", metrics=["mse"])
    return model

weights_3d = []
save = LambdaCallback(on_epoch_end=lambda batch, logs: 
                      weights_3d.append(model.layers[0].get_weights()[0]))
setseed(1)
model = get_model2()
model.fit(X, y, epochs=500, verbose=0, batch_size=512, callbacks=[save]);
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 2, 3)              1938      
_________________________________________________________________
Dist (Lambda)                (None, 1)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2         
=================================================================
Total params: 1,940
Trainable params: 1,940
Non-trainable params: 0
_________________________________________________________________
None

And because the animation code for 3D plots is even uglier than the 2D, I’ve hidden it away. But here is the constrained 3D trained embeddings!

And with the beauty of one more dimension, we can see that the RGB colour clusters can start to be separated independently.

Summary

Hopefully in this small write up you’ve seen how we can train embeddings to map categorical features to a coordinate in vector space. In our specific example, we started with a colour dataset, and generated pairs of colours with a label of either 1 for “similar” or 0 for “dissimilar”. We then managed to recover the relationship between colours by training an embedding only using this information.

If you want to read more on how to use embeddings in Keras, Jason Brownlee has some great write ups:

Have fun!


For your convenience, here’s the code in one block:

import pandas as pd
import numpy as np

df_original = pd.read_csv("encoding_colours/colours.csv")
df_original = df_original.dropna(subset=["Color Name"])
num_colours = df_original.shape[0]
print(f"We have {num_colours} colours")
df_original.sample(5)
df = df_original.loc[:, ["Color Name", "R;G;B Dec"]]
df[["r", "g", "b"]] = df["R;G;B Dec"].str.split(";", expand=True).astype(int) / 255
df = df.drop(columns="R;G;B Dec")
df = df.rename(columns={"Color Name": "name"})
df = df.reset_index(drop=True)
df.sample(10)
df["num"] = df.index
df.head(10)
n = 100000 # Num samples
colour_1 = df.sample(n=n, replace=True, random_state=0).reset_index(drop=True)
colour_2 = df.sample(n=n, replace=True, random_state=42).reset_index(drop=True)
c = colour_1.merge(colour_2, left_index=True, right_index=True)
c[["name_x", "name_y"]].sample(10)
# Calculate the distance, and drop the RGB columns.
c["diff"] = ((c.r_x - c.r_y)**2 + (c.g_x - c.g_y)**2 + (c.b_x - c.b_y)**2) / 3
c = c.drop(columns=["r_x", "r_y", "g_x", "g_y", "b_x", "b_y"])
c
np.random.seed(0)
c["predict"] = (c["diff"] < 0.2 * np.random.random(c.shape[0]) ** 2).astype(int)
print(f"{100 * c.predict.mean(): 0.1f}% positive values")
c.sample(10)
from tensorflow.random import set_seed
from tensorflow import keras
from tensorflow.keras.layers import Embedding, Dense, Lambda, Input, Subtract
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import LambdaCallback

def sum_dist(x):
    n = K.permute_dimensions(x, pattern=(1, 0, 2))
    a, b = n[0], n[1]
    return K.sum((a - b)**2, axis=-1, keepdims=True)

def get_model(embedding_dims=2):
    model = keras.Sequential()
    model.add(Embedding(num_colours, embedding_dims, input_length=2))
    model.add(Lambda(sum_dist, output_shape=(1,), name="Dist"))
    model.add(Dense(1, activation="sigmoid"))
    print(model.summary())
    model.compile(loss='binary_crossentropy', optimizer="adam", metrics=["mse"])
    return model

def setseed(i):
    np.random.seed(i), set_seed(i)
weights = []
save = LambdaCallback(on_epoch_end=lambda batch, logs: 
                      weights.append(model.layers[0].get_weights()[0]))

X, y = c[["num_x", "num_y"]], c["predict"]

setseed(7)
model = get_model()
model.fit(X, y, epochs=500, verbose=0, batch_size=512, callbacks=[save]);
%%capture
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# Get the list of colours
cs = df[["r", "g", "b"]].to_numpy()

# Create the plots
with plt.style.context("default"):
    fig, ax = plt.subplots(figsize=(10, 5))
    fig.subplots_adjust(left=0.1, right=0.9, bottom=0.15, top=0.9)
    ax.set_xlabel("$x_0$"), ax.set_ylabel("$x_1$")
    ax.set_title("Training of 2D colour embeddings")
    scat = ax.scatter(weights[-1][:, 0], weights[-1][:, 1], color=cs, s=10);

    def init():
        ax.set_xlim(weights[-1][:,0].min(), weights[-1][:,0].max())
        ax.set_ylim(weights[-1][:,1].min(), weights[-1][:,1].max())
        return scat,
    def update(i):
        scat.set_offsets(weights[i])
        return scat,

    # Split our epochs into frames
    nw = len(weights) - 1
    nf, power = 30 * 10, 2
    frames = pd.unique((np.linspace(1, nw**(1 / power), nf)**power).astype(int))
    ani = FuncAnimation(fig, update, frames=frames, 
                        init_func=init, blit=True, interval=33.3);
ani.save('embed_2d.mp4', fps=30, 
         extra_args=['-vcodec', 'libx264', '-crf', '26'])

# If you're running this in a Jupyter notebook, the below will do the 
# same without saving it to file
# HTML(ani.to_html5_video())
from IPython.display import Video
Video("embed_2d.mp4")
fig, ax = plt.subplots()
scat = ax.scatter(weights[-1][:, 0], weights[-1][:, 1], color=cs, s=20)
ax.set_xlabel("$x_0$"), ax.set_ylabel("$x_1$")
ax.set_title("2D Colour Embeddings");
def get_model2(embedding_dims=3):
    model = keras.Sequential()
    model.add(Embedding(num_colours, embedding_dims, input_length=2))
    # model.add(Lambda(lambda x: K.l2_normalize(x, axis=-1), name="Norm"))
    model.add(Lambda(sum_dist, output_shape=(1,), name="Dist"))
    model.add(Dense(1, activation="sigmoid"))
    print(model.summary())
    model.compile(loss='binary_crossentropy', optimizer="adam", metrics=["mse"])
    return model

weights_3d = []
save = LambdaCallback(on_epoch_end=lambda batch, logs: 
                      weights_3d.append(model.layers[0].get_weights()[0]))
setseed(1)
model = get_model2()
model.fit(X, y, epochs=500, verbose=0, batch_size=512, callbacks=[save]);