Embeddings

In this lab, we are going to rewrite TinyLM once more. In the matrices lab, the model stored everything it learned in a count matrix: one row per context window, one column per word, a number counting how many times that word followed that context. This worked, but the model could only ever produce output for context windows it had seen before—if a new context came up, it stopped.

This lab introduces a fundamentally different approach. Instead of counting, the model will learn a dense representation of each word—a list of numbers called an embedding—and it will adjust those numbers over many passes through the training data until its predictions improve.

Training

Now train the model on our familiar tongue-twister corpus:

tlm train --filepath chuck.txt

You will see output like this:

Epoch 1/5  loss=2.3512
Epoch 2/5  loss=2.2375
Epoch 3/5  loss=2.1561
Epoch 4/5  loss=2.1002
Epoch 5/5  loss=2.0608
Model saved to model.json

Something new is happening here. In the count model, training meant reading through the corpus once and filling in a matrix—done in a single pass. Here, training means making many passes through the corpus, measuring how wrong the predictions are, and gradually getting better. The number on each line is the loss: a measure of how wrong the model currently is, on average. Watch it decrease across epochs.

💻 Generate some text:

tlm generate --model model.json

Where is the model?

In the count model, we could inspect model in an interactive shell and read the count matrix directly—rows for contexts, columns for words, integers we could interpret at a glance. Where does the learning live in this model?

💻 Use --interact to open a Python shell after generating:

tlm generate --model model.json --interact

Look at the two main matrices:

>>> model.E.shape
(11, 32)
>>> model.W.shape
(32, 11)

The vocabulary has 11 words; the default embedding size is 32. E is an $ 11 \times 32 $ matrix and W is a $ 32 \times 11 $ matrix. Compare this to the count matrix from the previous lab, which was $ 11 \times 20 $: one column per unique context window seen in the corpus. This model has no such limit. E has one row per word, regardless of how many context windows appeared.

Look at one word's row:

>>> model.vocab
['a', 'all', 'chuck', 'could', 'how', 'if', 'it', 'much', 'the', 'wood', 'would']
>>> model.E[model.word_to_idx['chuck']]
array([ 0.043, -0.112,  0.087, -0.201,  0.034, ...])  # 32 numbers

This row—32 numbers—is the model's learned representation of the word "chuck." It is called an embedding. The model started with random numbers here and adjusted them over training to make better predictions.

Words in space

Think of each word's embedding as a set of coordinates that places the word at a point in a 32-dimensional space. (We can't draw 32 dimensions, but the math works the same as it does in 2 or 3.)

Words that appear in similar contexts will tend to end up near each other in this space, because the model learns to make similar predictions for them. The model is never told anything about what words mean—the structure emerges from statistics.

You can measure how close two words are using cosine similarity: a number between -1 and 1, where 1 means the two embedding vectors point in exactly the same direction and 0 means they are unrelated.

>>> import numpy as np
>>> chuck = model.E[model.word_to_idx['chuck']]
>>> wood = model.E[model.word_to_idx['wood']]
>>> np.dot(chuck, wood) / (np.linalg.norm(chuck) * np.linalg.norm(wood))

With a corpus of 32 words there is not much to learn from, and the similarities will not be very meaningful. The same model scales up.

💻 Train on a larger corpus and explore the resulting embeddings:

tlm train --gutenberg austen-emma.txt -t lower -t alpha --epochs 10 --output emma.json
tlm generate --model emma.json --interact
>>> import numpy as np
>>> def sim(a, b):
...     return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
>>> E = model.E
>>> wi = model.word_to_idx

Compute similarities between pairs of words that you expect to be related, and pairs you expect to be unrelated. Do the numbers match your intuition?

How prediction works

Look at the _forward method in tlm/model.py. When the model predicts the next word, it does three things:

  1. Look up and average: retrieve the embedding row E[i] for each word i in the context window, and average them into a single vector.
  2. Multiply: compute context_vector @ W to produce one score—called a logit—for each word in the vocabulary.
  3. Softmax: turn those scores into a probability distribution.

Steps 2 and 3 are the same structure as the matrices lab—a matrix multiplication followed by normalization—but now W is learned rather than filled by counting, and the context vector is a dense embedding rather than a one-hot selector.

This structure—input, matrix multiplication, nonlinear transformation, output—is the basic building block of a neural network. This model has one such layer. The large language models you use today have hundreds.

How training works

Look at the _step method in tlm/model.py. For each (context, target) pair:

  1. Run _forward to get a probability distribution.
  2. Check the probability the model assigned to the correct next word. The loss for this step is $ -\log(p_{\text{target}}) $: it is near zero when the model is confident and correct, and grows large when the model is wrong or uncertain.
  3. Compute how much each number in E and W contributed to the error. This is backpropagation: tracing the loss back through the math to figure out how to change each parameter to do better.
  4. Nudge each parameter by a small amount in the direction that would reduce the loss. This is gradient descent.

This loop—forward pass, compute loss, backpropagate, update—runs for every training example, and repeats every epoch. Over many epochs, E and W move toward values that produce better predictions.

Playing with embeddings

Training a language model produces embeddings as a by-product—the model needed them to make good predictions, but once training is done, the embedding matrix E can be pulled out and used on its own. It turns out to be remarkably useful.

Because words that appear in similar contexts end up with similar embedding vectors, the geometry of the embedding space encodes semantic relationships. Words that mean similar things cluster together. Antonyms often end up as mirror images. Relationships like "capital city of" or "past tense of" show up as consistent directions in the space. None of this was programmed in—it fell out of the statistics.

Researchers have trained embeddings on enormous corpora and released them for anyone to use. The gensim library makes it easy to download and work with these. The wp command in this project uses them.

💻 Try this in a Python shell (python or uv run python):

>>> import gensim.downloader
>>> model = gensim.downloader.load('glove-wiki-gigaword-100')

The first time this runs it downloads about 128 MB; after that it loads from cache in a few seconds. The result is a KeyedVectors object: essentially a large lookup table mapping words to their 100-dimensional embedding vectors.

>>> model['cat']
array([ 0.23088,  0.28283, -0.6142, ...])   # 100 numbers
>>> model.similarity('cat', 'dog')
0.8219
>>> model.similarity('cat', 'democracy')
0.0412

You can also ask for the words most similar to a given word:

>>> model.most_similar('king', topn=5)
[('queen', 0.7699), ('prince', 0.6840), ('royal', 0.6523), ('kings', 0.6510), ('throne', 0.6398)]

This returns a list of (word, similarity) pairs, ranked by cosine similarity to the query word. Explore the space a bit before moving on—try words from different domains, and try things you would not expect to work.