Matrices

In this lab, we are going to rewrite the TinyLM language model using a different data structure: matrices. The resulting model will work in the same way.

About matrices

You may be familiar with matrices if you have taken Algebra 2 or Precalculus. (Linear algebra, usually a college-level math class, is a deep-dive into matrices.) If not, don't worry--we'll introduce them here. Matrices are used to represent all kinds of things in mathematics, ranging from systems of equations to rotations of objects to connections within networks. But a matrix is just a table of numbers, like these:

$$ \underset{\begin{bmatrix} 1 & 2 & 3 \end{bmatrix}}{\textbf{a}} \quad \underset{\begin{bmatrix} 6 & 1 \\ 1 & 4 \end{bmatrix}}{\textbf{B}} \quad \underset{\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}}{\textbf{C}} \quad \underset{\begin{bmatrix} 1 & 1 \\ 2 & 2 \\ 3 & 3 \end{bmatrix}}{\textbf{D}} \quad \underset{\begin{bmatrix} 1 & 2 \\ 1 & 2 \\ 1 & 2 \end{bmatrix}}{\textbf{E}} $$

A few notes on how we talk about matrices. The size (or dimensions) of a matrix are written $ (m \times n) $ and pronounced "m by n". For example, the dimensions of $ \textbf{D} $ are $ 3 \times 2 $. Matrices are typically represented by bold capital letters. A matrix which has one of its dimensions as 1 (like $ \textbf{a} $) can also be called a vector. Vectors are often represented by bold lowercase letters. Regular numbers, called scalars to distinguish them from vectors and matrices, are represented by lowercase letters like $ x $.

When you want to talk about a particular element of a matrix, subscripts are used to indicate the row and column in that order. (Mathematicians use 1 for the first element, unlike computer scientists. If you're having a hunch that matrices are a bit like lists in Python, hold that thought.) So for example, $ \textbf{D}_{12} $ is 1 and $ \textbf{D}_{21} $ is 2.

Matrix addition and subtraction

Adding and subtracting matrices is easy! Two matrices can only be added or subtracted if they have the same dimensions. The result matrix will also have the same dimensions. Then just deal with each position as its own addition or subtraction problem, storing the result in the same position in the result matrix. Here's $ \textbf{D} + \textbf{E} $:

$ \begin{bmatrix}1 & 1 \\ 2 & 2 \\ 3 & 3 \end{bmatrix} + \begin{bmatrix}1 & 2 \\ 1 & 2 \\ 1 & 2 \end{bmatrix} = \begin{bmatrix}2 & 3 \\ 3 & 4 \\ 4 & 5 \end{bmatrix} $

See how the top left of each matrix is $ 1 + 1 = 2$? This is just six separate additions. Keep this theme in mind: matrices are all about doing many things in parallel, at the same time.

Matrix multiplication

When multiplying a matrix by a scalar, just multiply each individual element by the scalar. For example:

$ 2 \begin{bmatrix}1 & 1 \\ 2 & 2 \\ 3 & 3 \end{bmatrix} = \begin{bmatrix}2 & 2 \\ 4 & 4 \\ 6 & 6 \end{bmatrix} $

Multiplying matrices together is not quite as intuitive. You can only multiply two matrices if their "inner" dimensions match, for example $(4 \times 3)$ and $(3 \times 5)$. The two matching "inner" dimensions are squashed together and consumed during multiplication, and the dimensions of the resulting matrix are the remaining "outer" dimensions. So if you multiply matrices with dimensions $(4 \times 3)$ and $(3 \times 5)$, the result matrix will be $(4 \times 5)$. Unlike with regular numbers, the order matters for matrix multiplication. $ DB $ is allowed, since the "inner" dimensions match ($(3 \times 2)$ and $(2 \times 2)$), but $ BD $ is not allowed.

To multiply two matrices, take the first row of the first matrix and the first column of the second, multiply together the matching pairs of elements, and then add up these products. Write the sum in the top left element of the result matrix. Continue for the other rows of the first matrix, and for the other columns of the second matrix.

Here's an animation of $ DB $. Step through the process until you can follow the math.

Let the computer do the math

Matrices are a perfect fit for computers: lots of simple, boring math! numpy is Python's matrix library. A few notes:

💻 Try the previous section's examples in Python. matrices.py defines all the matrices we have been working with.

% python -i matrices.py
>>> D + E
array([[2, 3],
       [3, 4],
       [4, 5]])
>>> D @ B
array([[ 7,  5],
       [14, 10],
       [21, 15]])

Rebuilding TinyLM

OK, it's time to rebuild TinyLM using matrices. In the previous model, we stored our language model in a dict:

{
  ('a', 'wood'): ['chuck', 'chuck', 'chuck', 'chuck'],
  ('all', 'the'): ['wood'],
  ('chuck', 'all'): ['the'],
  ('chuck', 'chuck'): ['if'],
  ('chuck', 'could'): ['chuck', 'chuck'],
  ('chuck', 'if'): ['a', 'a'],
  ('chuck', 'wood'): ['a'],
  ('chuck', 'would'): ['chuck'],
  ('could', 'chuck'): ['wood', 'if', 'wood'],
  ('how', 'much'): ['wood'],
  ('if', 'a'): ['wood', 'wood'],
  ('it', 'could'): ['chuck'],
  ('much', 'wood'): ['would'],
  ('the', 'wood'): ['it'],
  ('wood', 'a'): ['wood'],
  ('wood', 'chuck'): ['chuck', 'could', 'would', 'could'],
  ('wood', 'it'): ['could'],
  ('wood', 'would'): ['a'],
  ('would', 'a'): ['wood'],
  ('would', 'chuck'): ['all']
}

We're going to store the same data in a matrix instead, with one row for each possible output word (['a', 'all', 'chuck', 'could', 'how', 'if', 'it', 'much', 'the', 'wood', 'would']). and one column for each context we have seen (in the same sorted order as the dict keys above). Let's call this matrix $\textbf{W}$, for weights.

$ \textbf{W} = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 4 & 0 & 0 & 0 & 2 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 2 & 1 & 2 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ \end{bmatrix} $

The first column of $\textbf{W}$ represents how many times each word was observed following "a wood": 0 occurrences of "a", 0 of "all", 4 of "chuck", and 0 for all the rest. Similarly, the second column tells us that "all the" was followed by "wood" the one time it appeared as context. Same data as before, just written down differently.

Building the vocabulary

The size of matrix we need will depend on how many unique contexts we observe and the size of the vocabulary--all the unique words in the corpus. So before our model creates its weights matrix, it will need to go through the corpus and calculate these.

Training

Training will work almost the same as in our previous model: we will walk through the whole corpus, stopping at each position to observe a context and a word. Then we will look up the index of the word (let's call this $i$) and the index of the context (let's call this $j$), and add one to $\textbf{W}_{ij}$.

Inference

Inference is the process of generating the most likely response; in our case inference means choosing the next word to generate, based on the context. First, we will need to choose the row of $\textbf{W}$ corresponding to the context. We can do this by representing each context as a one-hot vector, or a vector with all zeros and a single 1. For example, the fifth context ("chuck could") would be represented as:

$ \textbf{x} = \begin{bmatrix} 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix}^T $

The "T" means "transpose," or flip the rows and columns. $\textbf{x}$ needs to be a really tall vector, not a really wide vector, but it's easier to display this way.

Now the matrix multiplication $\textbf{Wx}$ selects just the row we care about from $\textbf{W}$. Here's an animation of the process, just using part of $\textbf{W}$.

Now we have the counts; we can divide by the total to get a probability distribution over the possible next words. Since "chuck" appeared twice after "chuck could" and nothing else did, dividing by the sum of 2 would give "chuck" a probability of 1 and every other word a probability of 0.