Local Models
Lab setup
First, make sure you have completed the initial setup.
If you are part of a course
-
Open Terminal. Run the update command to make sure you have the latest code.
$ mwc update -
Move to this lab's directory.
$ cd ~/Desktop/making_with_code/llm/labs/lab_local_models
If you are working on your own
-
Move to your MWC directory.
$ cd ~/Desktop/making_with_code -
Get a copy of this lab's materials.
git clone https://git.makingwithcode.org/mwc/lab_local_models.git
In this lab, you will run a large language model directly on your own computer, write Python code to interact with it, and give it tools it can use to fetch real-world information. Along the way, you will learn how to browse the landscape of available models and understand the techniques researchers use to make models small enough to run on consumer hardware.
By the end of this lab, you will be able to:
- Host an LLM locally and call it from your own Python code.
- Give an LLM Python functions as tools, so it can look things up on your behalf.
- Browse available models and judge whether one will run on your device.
- Explain quantization and distillation, the two main ways model weights are compressed.
Running a model on your computer
Until now, every LLM you have used has run on someone else's server. When you send a message to ChatGPT or Claude, your text travels across the internet, gets processed by a powerful cluster of machines, and the response travels back to you. Today we will do the same thing locally: your computer will load the model's weights into memory and run the computation itself.
We will use Ollama, which makes it easy to download and run open-weight
models. Ollama runs as a background service on your computer; the ollama Python library
lets your code talk to it.
💻 Open a new terminal window and start Ollama:
$ ollama serve
Leave this running in the background for the rest of the lab.
💻 Pull the model we will use today.
$ ollama pull llama3.2:3b
llama3.2:3b is a 3-billion-parameter model from Meta that fits in about 2 GB of memory.
👁
Open goodmorning.py in your editor.
import ollama
MODEL = "llama3.2:3b"
SYSTEM_PROMPT = """
You are a cheerful morning assistant helping a high school student get ready for school.
Keep responses short and encouraging.
"""
def send_message(messages):
"""Sends a conversation to the model, streaming its reply to the screen."""
print("\nAssistant: ", end="", flush=True)
reply = ""
for chunk in ollama.chat(model=MODEL, messages=messages, stream=True):
piece = chunk.message.content
print(piece, end="", flush=True)
reply += piece
print()
return reply
def run():
"""Runs a chat loop, keeping track of the conversation."""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Good morning!"},
]
print("(Type 'quit' to exit.)")
greeting = send_message(messages)
messages.append({"role": "assistant", "content": greeting})
while True:
user_input = input("\nYou: ")
if user_input.strip().lower() == "quit":
break
messages.append({"role": "user", "content": user_input})
reply = send_message(messages)
messages.append({"role": "assistant", "content": reply})
if __name__ == "__main__":
run()
Let's walk through what is happening:
MODELis a string naming the model we want. When you callollama.chat, Ollama looks up this model on your computer.messagesis a list that grows with each turn of the conversation. Each entry is a dict with a"role"and"content". The role is"user"for your messages and"assistant"for the model's replies.send_messagepasses the full conversation history to the model. Rather than waiting for the complete response, it usesstream=Trueto receive the reply one small chunk at a time, printing each piece immediately as it arrives. This is why you see the response appear word by word instead of all at once.runkeeps track of the whole conversation so the model can refer back to what was said earlier.
💻
Run goodmorning.py and have a brief conversation with it. (You can also
just run chat.)
$ python goodmorning.py
(Type 'quit' to exit.)
Notice that the model has no memory between separate runs of the script—it is stateless,
meaning it holds no information between calls. The memory lives entirely in the messages
list your code maintains. This is true of ChatGPT and every other LLM: when a service
appears to remember you across sessions, the application saved the conversation and passed
it back in. The model was just shown the history again.
The system prompt
Every message in the conversation has a role. You have seen "user" and "assistant".
There is a third role: "system". A system message is not part of the back-and-forth; instead,
it sets the stage before the conversation begins. The model treats the system prompt as standing
instructions it should follow throughout the conversation.
👁
Notice the SYSTEM_PROMPT variable and how it is inserted as the first
message in run.
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
Because the system prompt appears before anything else, the model treats it as context for everything that follows. You can use it to define a persona, supply background information, or specify rules the model should follow. System prompts can be very long—it is common to write several paragraphs of instructions and background context.
Tools
The model you just built is useful, but it has a significant limitation: it does not know anything about the real world right now. It cannot tell you what day it is or what the weather is like outside, because those things change and the model's knowledge was frozen when it was trained.
We can fix this by giving the model tools: Python functions it is allowed to call. When you send a message with tools attached, the model can decide to call one of those functions before it responds. Your code runs the function, sends the result back to the model, and the model uses that result to finish its reply.
👁
Open tools.py. It defines three functions the model will be able to use.
from datetime import datetime
import geocoder
import os
import requests
import subprocess
def day_of_week() -> str:
"""Returns the current day of the week.
Returns:
The name of the current day, such as "Monday" or "Friday".
"""
return datetime.now().strftime("%A")
def local_weather() -> str:
"""Returns a brief description of today's weather at the user's location.
Uses the device's IP address to estimate location, then fetches the
National Weather Service forecast for that location.
Returns:
A short text description of the current weather conditions,
or an error message if the weather cannot be fetched.
"""
...
def read_file(path: str) -> tuple[bool, str]:
"""Reads a file and returns its contents as plain text.
Supports many file formats including PDF, Word documents, HTML, and Markdown,
using pdftotext (for PDFs) and pandoc (for everything else). If the required
program is not installed or conversion fails, returns a description of the error.
Args:
path: The path to the file to read.
Returns:
A tuple of (success, text). When successful, success is True and text
contains the plain-text contents of the file. When unsuccessful,
success is False and text explains what went wrong.
"""
...
read_file is not used in this lab, but it is a model for how to write tools that take
arguments and return more than one value. We will not use read_file in this lab, but
you could use it to create an agent which can interact with files on your computer--for example,
it could read your writing and give you feedback.
Before going further, notice something about these functions that looks different from Python you have written before.
Now let's update goodmorning.py to use day_of_week as a tool. The flow is slightly more
complex than before, because we need to handle the case where the model decides to call the
function.
💻
At the top of goodmorning.py, after import ollama, add:
from tools import day_of_week
TOOLS = {
"day_of_week": day_of_week,
}
TOOLS is a dict that maps each tool's name to the function itself. This makes it easy to
look up the right function when the model asks for one by name.
💻
Add a new function run_tools, and replace send_message with
this updated version:
def run_tools(tool_calls):
"""Runs any tool calls requested by the model and returns results."""
results = []
for call in tool_calls:
func = TOOLS[call.function.name]
result = func()
results.append({"role": "tool", "content": str(result)})
return results
def send_message(messages):
"""Sends a conversation to the model and streams its reply to the screen."""
response = ollama.chat(model=MODEL, messages=messages, tools=list(TOOLS.values()))
if response.message.tool_calls:
messages.append(response.message)
messages.extend(run_tools(response.message.tool_calls))
print("\nAssistant: ", end="", flush=True)
reply = ""
for chunk in ollama.chat(model=MODEL, messages=messages, stream=True):
piece = chunk.message.content
print(piece, end="", flush=True)
reply += piece
print()
return reply
Let's trace through what happens when the model decides to use a tool:
ollama.chatis called withtools=list(TOOLS.values()). Ollama describes each tool to the model using the type hints and docstrings.- The model replies with a request to call one of the tools instead of answering directly.
run_toolslooks up the function by name in theTOOLSdict and calls it.- The result is added to
messageswith the role"tool". ollama.chatis called again withstream=True. Now the model has the tool's output and streams its final response word by word to the screen.
💻
Update SYSTEM_PROMPT to tell the model to check the day before responding:
SYSTEM_PROMPT = """
You are a cheerful morning assistant helping a high school student get ready for school.
Keep responses short and encouraging.
When starting the conversation, use the day_of_week tool and greet the student with today's day.
"""
💻
Run the updated goodmorning.py and start a conversation. The model
should greet you with today's day before waiting for your input.
Choosing a model
So far we have used llama3.2:3b because it is small enough to run on most computers. But there
are thousands of open-weight models available, ranging from tiny models designed for embedded
devices to enormous models that rival the best commercial systems. When you are picking a
model for a project, two questions matter most.
What is this model good at?
A model's strengths depend on what data it was trained on and what task it was optimised for. A model trained mostly on scientific papers will write fluently about chemistry but may struggle with casual conversation. A model fine-tuned on code will write better Python than a general-purpose model of the same size.
Some examples of how training shapes capability:
- General chat models like
llama3,mistral, andgemmaare trained on broad web text and fine-tuned to follow instructions conversationally. They are good starting points for assistants. - Code models like
deepseek-coderandcodellamaare trained on large repositories of source code. They are much better at writing and debugging programs than general models. - Reasoning models like
qwenandphiare trained with an emphasis on multi-step problem solving and are good at mathematics and logic tasks. - Embedding models like
nomic-embed-textdo not generate text at all—they turn text into vectors that can be compared for similarity (as you explored in the embeddings lab). - Domain-specific models are fine-tuned on medical records, legal documents, or other specialised corpora, and perform well within their narrow domain.
The model card on Hugging Face or the description on the Ollama library page will tell you what a model was designed for and give benchmark results comparing it to other models.
How much memory does it require?
Every parameter in a neural network is a number. By default, that number is stored as a 32-bit floating-point value, taking 4 bytes of memory. A model with 7 billion parameters therefore needs about 28 GB of memory just to hold the weights—more than most laptops have.
Two techniques allow models to run on more modest hardware.
Quantization reduces memory by storing each parameter in fewer bits. The most common format is 4-bit quantization: each weight is rounded to the nearest value in a small set of 16 possible values and stored in half a byte. This cuts memory use by roughly 8× with only a modest drop in quality for most tasks. A 7-billion-parameter model in 4-bit quantization needs about 4–5 GB of memory, which fits in many consumer laptops.
On Hugging Face and in Ollama you will often see quantization levels like Q4_K_M or Q8_0
in model names. The number refers to how many bits are used per weight; higher numbers
preserve more precision but use more memory.
Distillation is a different approach. Instead of compressing an existing large model,
you train a new, smaller model to mimic the outputs of the large one. The large model acts
as a teacher; the small model is the student. Distilled models are genuinely smaller
architectures—not just a compressed version of the original. Many popular small models,
including some in the phi and qwen families, use distillation.
A useful rule of thumb: for a 4-bit quantized model, you need roughly 1 GB of RAM per billion parameters. A 7B model needs about 4–5 GB; a 13B model needs about 8–9 GB.
💻 Check how much memory your system has.
On macOS:
$ system_profiler SPHardwareDataType | grep Memory
Memory: 16 GB
On Linux (including Raspberry Pi):
$ free -h
total used free
Mem: 7.6G 2.1G 5.5G
A GPU can also make a large difference. GPUs are very well suited to the matrix arithmetic that drives LLMs and can run the same computation orders of magnitude faster than a CPU. If your computer has a discrete GPU, Ollama will use it automatically. Apple Silicon Macs use unified memory shared between the CPU and GPU, which is why they run local models well for their price.
Browsing models
Hugging Face hosts tens of thousands of open-weight models. You can filter by task, language, license, and size. Pay close attention to the model card, which describes what the model was trained on, what it excels at, and its limitations.
Ollama's curated library at ollama.com/library is a good starting point—every model there is already packaged for easy use, and each page lists available sizes and quantization levels.