TinyLM

Lab setup

First, make sure you have completed the initial setup.

If you are part of a course

  1. Open Terminal. Run the update command to make sure you have the latest code.
    $ mwc update
  2. Move to this lab's directory.
    $ cd ~/Desktop/making_with_code/llm/labs/lab_tinylm
    

If you are working on your own

  1. Move to your MWC directory.
    $ cd ~/Desktop/making_with_code
    
  2. Get a copy of this lab's materials.
    git clone https://git.makingwithcode.org/mwc/lab_tinylm.git

Setup

First, follow the setup instructions. Your username is your UB email name (e.g. chrisp), and your password is id + your UB ID number (e.g. id12345678).

Introduction

This lab introduces the idea of a language model, specifically a generative langauge model.

A model is a simplified representation of a complex system. Models can help us study phenomena which are hard to see, and they can let us try experiments which we could not, or would not, conduct in the real world. There is a common saying in science that "All models are wrong, but some are useful." Some models are conceptual--just are a way of thinking about something. But computers have transformed how we do science by enabling computational models. We can set up a model and then have the computer run it!

So a language model is a simplified representation of language, often representing what utterances mean, distinguishing what can be understood from what doesn't make any sense, and trying producing language the way people do.

Over the next several labs, we are going to explore some foundational concepts of large language models, by starting with very simple models, and working our way up. In practice, you will not implement your own language models--the goal here is to give you a conceptual understanding of the goals and mechanisms of large language models so you can be more powerful in using them, and so that you can imagine and create tools with them.

A simple generative model

The task of a generative model is to predict the next word, given some context. For example, if an email message started with, "I am concerned about your...", you can imagine some of the ways the sentence could end--perhaps "attitude," "attendance," maybe "haircut." Probably not "toenails" or "antiquity." The task of a generative model is to estimate the most likely completion of the sentence.

This involves syntax (the rules of grammar) and semantics (the meanings of words). There have been lots of efforts to write out the rules of syntax and to manually articulate the meanings of words. However, over the last twenty years or so, it has become clear that a machine learning approach works best: if you want to know what words mean, don't look in the dictionary. Just look at how people talk.

Let's create a very simple generative model, which works by being trained on a corpus of text. The model will be configured with a context window size (let's start with 2). Then the model will look at every pair of words in the corpus, and observe what comes next. Here's a simple corpus:

how much wood would a wood chuck chuck if a wood chuck could chuck wood
a wood chuck would chuck all the wood it could chuck if a wood chuck could chuck wood

What could follow "could chuck"? We observe two possibilities: "wood" (appearing twice) and "if" (appearing once). So every time the model's last two generated words are "could chuck", it will randomly select "wood" (2/3 probability) or "if" (1/3 probability). The model will stop if it arrives at a context window it has not seen before. Generation could go like this:

could chuck (wood, if)
could chuck wood (a)
could chuck wood a (wood)
could chuck wood a wood (chuck)
could chuck wood a wood chuck (chuck, could)
could chuck wood a wood chuck could (chuck)
could chuck wood a wood chuck could chuck (wood) 
could chuck wood a wood chuck could chuck wood (a)

And so on.

💻 Run tlm generate --text chuck.txt to generate your own tongue-twister.

💻 Run tlm generate --help to see all available options. Try them out.