Local Models

Lab setup

First, make sure you have completed the initial setup.

Open Terminal. Run the update command to make sure you have the latest code.
```
$ mwc update
```

Move to this lab's directory.

$ cd ~/Desktop/making_with_code/llm/labs/lab_local_models

Move to your MWC directory.
```
$ cd ~/Desktop/making_with_code
```

Get a copy of this lab's materials.

git clone https://git.makingwithcode.org/mwc/lab_local_models.git

In this lab, you will run a large language model directly on your own computer, write Python code to interact with it, and give it tools it can use to fetch real-world information. Along the way, you will learn how to browse the landscape of available models and understand the techniques researchers use to make models small enough to run on consumer hardware.

By the end of this lab, you will be able to:

Host an LLM locally and call it from your own Python code.
Give an LLM Python functions as tools, so it can look things up on your behalf.
Browse available models and judge whether one will run on your device.
Explain quantization and distillation, the two main ways model weights are compressed.

Running a model on your computer

Until now, every LLM you have used has run on someone else's server. When you send a message to ChatGPT or Claude, your text travels across the internet, gets processed by a powerful cluster of machines, and the response travels back to you. Today we will do the same thing locally: your computer will load the model's weights into memory and run the computation itself.

We will use Ollama, which makes it easy to download and run open-weight models. Ollama runs as a background service on your computer; the ollama Python library lets your code talk to it.

💻 Open a new terminal window and start Ollama:

$ ollama serve

Leave this running in the background for the rest of the lab.

💻 Pull the model we will use today.

$ ollama pull llama3.2:3b

llama3.2:3b is a 3-billion-parameter model from Meta that fits in about 2 GB of memory.

👁 Open goodmorning.py in your editor.

import ollama

MODEL = "llama3.2:3b"

SYSTEM_PROMPT = """
You are a cheerful morning assistant helping a high school student get ready for school.
Keep responses short and encouraging.
"""


def send_message(messages):
    """Sends a conversation to the model, streaming its reply to the screen."""
    print("\nAssistant: ", end="", flush=True)
    reply = ""
    for chunk in ollama.chat(model=MODEL, messages=messages, stream=True):
        piece = chunk.message.content
        print(piece, end="", flush=True)
        reply += piece
    print()
    return reply


def run():
    """Runs a chat loop, keeping track of the conversation."""
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Good morning!"},
    ]
    print("(Type 'quit' to exit.)")
    greeting = send_message(messages)
    messages.append({"role": "assistant", "content": greeting})
    while True:
        user_input = input("\nYou: ")
        if user_input.strip().lower() == "quit":
            break
        messages.append({"role": "user", "content": user_input})
        reply = send_message(messages)
        messages.append({"role": "assistant", "content": reply})


if __name__ == "__main__":
    run()

Let's walk through what is happening:

MODEL is a string naming the model we want. When you call ollama.chat, Ollama looks up this model on your computer.
messages is a list that grows with each turn of the conversation. Each entry is a dict with a "role" and "content". The role is "user" for your messages and "assistant" for the model's replies.
send_message passes the full conversation history to the model. Rather than waiting for the complete response, it uses stream=True to receive the reply one small chunk at a time, printing each piece immediately as it arrives. This is why you see the response appear word by word instead of all at once.
run keeps track of the whole conversation so the model can refer back to what was said earlier.

💻 Run goodmorning.py and have a brief conversation with it. (You can also just run chat.)

$ python goodmorning.py
(Type 'quit' to exit.)

Notice that the model has no memory between separate runs of the script—it is stateless, meaning it holds no information between calls. The memory lives entirely in the messages list your code maintains. This is true of ChatGPT and every other LLM: when a service appears to remember you across sessions, the application saved the conversation and passed it back in. The model was just shown the history again.

The system prompt

Every message in the conversation has a role. You have seen "user" and "assistant". There is a third role: "system". A system message is not part of the back-and-forth; instead, it sets the stage before the conversation begins. The model treats the system prompt as standing instructions it should follow throughout the conversation.

👁 Notice the SYSTEM_PROMPT variable and how it is inserted as the first message in run.

messages = [{"role": "system", "content": SYSTEM_PROMPT}]

Because the system prompt appears before anything else, the model treats it as context for everything that follows. You can use it to define a persona, supply background information, or specify rules the model should follow. System prompts can be very long—it is common to write several paragraphs of instructions and background context.

Tools

The model you just built is useful, but it has a significant limitation: it does not know anything about the real world right now. It cannot tell you what day it is or what the weather is like outside, because those things change and the model's knowledge was frozen when it was trained.

We can fix this by giving the model tools: Python functions it is allowed to call. When you send a message with tools attached, the model can decide to call one of those functions before it responds. Your code runs the function, sends the result back to the model, and the model uses that result to finish its reply.

👁 Open tools.py. It defines three functions the model will be able to use.

from datetime import datetime
import geocoder
import os
import requests
import subprocess


def day_of_week() -> str:
    """Returns the current day of the week.

    Returns:
        The name of the current day, such as "Monday" or "Friday".
    """
    return datetime.now().strftime("%A")


def local_weather() -> str:
    """Returns a brief description of today's weather at the user's location.

    Uses the device's IP address to estimate location, then fetches the
    National Weather Service forecast for that location.

    Returns:
        A short text description of the current weather conditions,
        or an error message if the weather cannot be fetched.
    """
    ...


def read_file(path: str) -> tuple[bool, str]:
    """Reads a file and returns its contents as plain text.

    Supports many file formats including PDF, Word documents, HTML, and Markdown,
    using pdftotext (for PDFs) and pandoc (for everything else). If the required
    program is not installed or conversion fails, returns a description of the error.

    Args:
        path: The path to the file to read.

    Returns:
        A tuple of (success, text). When successful, success is True and text
        contains the plain-text contents of the file. When unsuccessful,
        success is False and text explains what went wrong.
    """
    ...

read_file is not used in this lab, but it is a model for how to write tools that take arguments and return more than one value. We will not use read_file in this lab, but you could use it to create an agent which can interact with files on your computer--for example, it could read your writing and give you feedback.

Before going further, notice something about these functions that looks different from Python you have written before.

👾 💬 Type hints

Look at the -> str in the function signatures:

def day_of_week() -> str:

This is a type hint. It tells anyone reading the code—and any tools that inspect the code—what type of value the function returns. -> str means "this function returns a string." Type hints are optional in Python; the program will run fine without them. However, some libraries—including Ollama—read these hints automatically to understand your code.

When a function takes arguments, each parameter can be annotated with its type, and the docstring should include an Args: section describing them. See how read_file does this:

def read_file(path: str) -> tuple[bool, str]:
    """
    Args:
        path: The path to the file to read.
    """

The return type tuple[bool, str] means the function returns two values packed together: a boolean and a string. Ollama reads this and tells the model to expect both values.

The docstrings in tools.py follow a specific format called Google style. When you write your own tools, follow the same pattern: a short description at the top, an Args: section for each parameter, and a Returns: section explaining the return value.

Now let's update goodmorning.py to use day_of_week as a tool. The flow is slightly more complex than before, because we need to handle the case where the model decides to call the function.

💻 At the top of goodmorning.py, after import ollama, add:

from tools import day_of_week

TOOLS = {
    "day_of_week": day_of_week,
}

TOOLS is a dict that maps each tool's name to the function itself. This makes it easy to look up the right function when the model asks for one by name.

💻 Add a new function run_tools, and replace send_message with this updated version:

def run_tools(tool_calls):
    """Runs any tool calls requested by the model and returns results."""
    results = []
    for call in tool_calls:
        func = TOOLS[call.function.name]
        result = func()
        results.append({"role": "tool", "content": str(result)})
    return results


def send_message(messages):
    """Sends a conversation to the model and streams its reply to the screen."""
    response = ollama.chat(model=MODEL, messages=messages, tools=list(TOOLS.values()))
    if response.message.tool_calls:
        messages.append(response.message)
        messages.extend(run_tools(response.message.tool_calls))
    print("\nAssistant: ", end="", flush=True)
    reply = ""
    for chunk in ollama.chat(model=MODEL, messages=messages, stream=True):
        piece = chunk.message.content
        print(piece, end="", flush=True)
        reply += piece
    print()
    return reply

Let's trace through what happens when the model decides to use a tool:

ollama.chat is called with tools=list(TOOLS.values()). Ollama describes each tool to the model using the type hints and docstrings.
The model replies with a request to call one of the tools instead of answering directly.
run_tools looks up the function by name in the TOOLS dict and calls it.
The result is added to messages with the role "tool".
ollama.chat is called again with stream=True. Now the model has the tool's output and streams its final response word by word to the screen.

💻 Update SYSTEM_PROMPT to tell the model to check the day before responding:

SYSTEM_PROMPT = """
You are a cheerful morning assistant helping a high school student get ready for school.
Keep responses short and encouraging.
When starting the conversation, use the day_of_week tool and greet the student with today's day.
"""

💻 Run the updated goodmorning.py and start a conversation. The model should greet you with today's day before waiting for your input.

Choosing a model

So far we have used llama3.2:3b because it is small enough to run on most computers. But there are thousands of open-weight models available, ranging from tiny models designed for embedded devices to enormous models that rival the best commercial systems. When you are picking a model for a project, two questions matter most.

What is this model good at?

A model's strengths depend on what data it was trained on and what task it was optimised for. A model trained mostly on scientific papers will write fluently about chemistry but may struggle with casual conversation. A model fine-tuned on code will write better Python than a general-purpose model of the same size.

Some examples of how training shapes capability:

General chat models like llama3, mistral, and gemma are trained on broad web text and fine-tuned to follow instructions conversationally. They are good starting points for assistants.
Code models like deepseek-coder and codellama are trained on large repositories of source code. They are much better at writing and debugging programs than general models.
Reasoning models like qwen and phi are trained with an emphasis on multi-step problem solving and are good at mathematics and logic tasks.
Embedding models like nomic-embed-text do not generate text at all—they turn text into vectors that can be compared for similarity (as you explored in the embeddings lab).
Domain-specific models are fine-tuned on medical records, legal documents, or other specialised corpora, and perform well within their narrow domain.

The model card on Hugging Face or the description on the Ollama library page will tell you what a model was designed for and give benchmark results comparing it to other models.

How much memory does it require?

Every parameter in a neural network is a number. By default, that number is stored as a 32-bit floating-point value, taking 4 bytes of memory. A model with 7 billion parameters therefore needs about 28 GB of memory just to hold the weights—more than most laptops have.

Two techniques allow models to run on more modest hardware.

Quantization reduces memory by storing each parameter in fewer bits. The most common format is 4-bit quantization: each weight is rounded to the nearest value in a small set of 16 possible values and stored in half a byte. This cuts memory use by roughly 8× with only a modest drop in quality for most tasks. A 7-billion-parameter model in 4-bit quantization needs about 4–5 GB of memory, which fits in many consumer laptops.

On Hugging Face and in Ollama you will often see quantization levels like Q4_K_M or Q8_0 in model names. The number refers to how many bits are used per weight; higher numbers preserve more precision but use more memory.

Distillation is a different approach. Instead of compressing an existing large model, you train a new, smaller model to mimic the outputs of the large one. The large model acts as a teacher; the small model is the student. Distilled models are genuinely smaller architectures—not just a compressed version of the original. Many popular small models, including some in the phi and qwen families, use distillation.

A useful rule of thumb: for a 4-bit quantized model, you need roughly 1 GB of RAM per billion parameters. A 7B model needs about 4–5 GB; a 13B model needs about 8–9 GB.

💻 Check how much memory your system has.

On macOS:

$ system_profiler SPHardwareDataType | grep Memory
      Memory: 16 GB

On Linux (including Raspberry Pi):

$ free -h
              total        used        free
Mem:           7.6G        2.1G        5.5G

A GPU can also make a large difference. GPUs are very well suited to the matrix arithmetic that drives LLMs and can run the same computation orders of magnitude faster than a CPU. If your computer has a discrete GPU, Ollama will use it automatically. Apple Silicon Macs use unified memory shared between the CPU and GPU, which is why they run local models well for their price.

Browsing models

Hugging Face hosts tens of thousands of open-weight models. You can filter by task, language, license, and size. Pay close attention to the model card, which describes what the model was trained on, what it excels at, and its limitations.

Ollama's curated library at ollama.com/library is a good starting point—every model there is already packaged for easy use, and each page lists available sizes and quantization levels.

⚡✨ Deliverables

Answer the following questions in questions.md. Commit and push your work.

How well did the model you used work in this lab? Describe one problem you encountered with the model's responses, and explain how you changed the system prompt to address it.
How much memory is available on your system?
List two models from Hugging Face or Ollama's model library that could run on your system. For each one, explain what it is good at.
Choose an interesting-looking model that cannot run on your system. What is this model good at doing? Imagine an AI-powered app you could create with this model and describe it.
Ollama can easily be configured to use a remotely-hosted model; you just provide the URL where the model is hosted. List some of the advantages of locally-hosted models, and some of the advantages of remotely-hosted models. What kinds of apps would be most suitable for locally-hosted models? What kinds of apps would be most suitable for remotely-hosted models?