Run GPT-OSS-20B Locally: The Ultimate Guide

Cover Image for Run GPT-OSS-20B Locally: The Ultimate Guide
Arfath Ahmed Syed
Arfath Ahmed Syed

Want to run gpt-oss-20b fully local—no data leaving your machine? Good news: OpenAI’s new open-weight 20B model is designed for on-device and edge use, with practical memory needs and permissive licensing. Below is a clean, copy-pasteable guide with three easy paths: Python (Transformers), Ollama (GGUF), and LM Studio on Apple Silicon. We’ll also cover hardware, quantization, and tips for speed/quality trade-offs.

What is gpt-oss-20b?

gpt-oss-20b is an open-weight, mixture-of-experts (MoE) LLM released alongside a larger 120B variant. It targets lower-latency local and specialized use cases while keeping resource usage modest. It delivers performance comparable to o3-mini on common tasks and is intended for edge/on-device scenarios. Both gpt-oss models use a specific “harmony” response format.

License, weights & formats

  • License: Apache-2.0 (open weights).
  • Weights: Official hub page: Hugging Face – openai/gpt-oss-20b.
  • Quantized builds: Community GGUF (for llama.cpp/Ollama) and MLX 8-bit (Apple Silicon) are available.

Hardware requirements (practical)

  • RAM/VRAM: Designed to run on devices with ~16 GB memory (fits a single consumer GPU or modern laptop memory with quantization).
  • Best experience: NVIDIA GPU (≥ 12–16 GB VRAM) or Apple Silicon (M-series) with MLX quant.
  • CPU-only: Feasible with heavy quantization, but slower; consider Ollama/llama.cpp GGUF for simplicity.

Path A — Pure Python (Transformers) on your GPU

This uses Hugging Face Transformers to load the official 20B weights. It’s the most flexible route (fine-tuning, custom pipelines) and works on Linux/Windows + CUDA or on Apple Silicon via Metal (slower unless using MLX variant).

# 1) Create env
conda create -n gptoss20b python=3.10 -y && conda activate gptoss20b

# 2) Install deps (CUDA build of PyTorch if on NVIDIA)
pip install --upgrade pip
pip install torch --index-url https://download.pytorch.org/whl/cu121  # adjust CUDA
pip install transformers accelerate bitsandbytes safetensors

# 3) Optional: enable 4-bit loading (bitsandbytes) for lower VRAM
# 4) Run a minimal chat script
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import torch

model_id = "openai/gpt-oss-20b"  # official weights
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

load_kwargs = {
    "torch_dtype": torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    "low_cpu_mem_usage": True,
    "trust_remote_code": True,
}

# Optional: 4-bit quantization
try:
    from transformers import BitsAndBytesConfig
    load_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True)
except Exception:
    pass

model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs).eval()
if torch.cuda.is_available(): model.to("cuda")

prompt = "You are a helpful assistant. Explain transformers in 3 bullets."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
_ = model.generate(**inputs, max_new_tokens=300, do_sample=True, top_p=0.9, streamer=streamer)

Troubleshooting (Transformers)

  • OOM? Enable 4-bit (above) or reduce max_new_tokens.
  • Python 3.12 issues? Pin to 3.10/3.11 if some wheels lag behind.

Path B — Ollama (easiest local serve)

Prefer a one-liner local server with GPU/CPU support and simple /api/generate? Use the community GGUF builds and run via Ollama.

  1. Install Ollama (Windows/macOS/Linux) from ollama.com.
  2. Download a gpt-oss-20b GGUF quant from the community page.
  3. Create a Modelfile:
# Modelfile
FROM ./gpt-oss-20b.Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
# Build & run
ollama create gpt-oss-20b -f Modelfile
ollama run gpt-oss-20b "Summarize the key ideas in attention mechanisms."

Path C — LM Studio on Apple Silicon (MLX 8-bit)

  1. Install LM Studio (macOS).
  2. Open Models and search for “gpt-oss-20b-MLX-8bit”.
  3. Download, chat, or enable local server for API access.

Colab / quick cloud test (optional)

Test in the cloud before going local using the official Colab recipe for gpt-oss-20b.

Performance tips

  • Quantization: 4-bit (Transformers) or GGUF Q-series (Ollama) for memory savings.
  • Context: Keep num_ctx reasonable to avoid RAM spikes.
  • Batching: Stream tokens and use small batch sizes.
  • Format: Use “harmony” format for best reliability in structured outputs.

Quality & trade-offs

  • 20B targets speed/local deploy; 120B is for higher reasoning if you have the hardware.
  • Prompt style matters; experiment with system prompts and sampling params.

End-to-end local mini-app (FastAPI)

# app.py
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()
tok = AutoTokenizer.from_pretrained("openai/gpt-oss-20b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b",
    torch_dtype=torch.bfloat16, trust_remote_code=True).eval().to("cuda" if torch.cuda.is_available() else "cpu")

class ChatReq(BaseModel):
    prompt: str
    max_new_tokens: int = 256

@app.post("/generate")
def generate(req: ChatReq):
    inputs = tok(req.prompt, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=req.max_new_tokens, do_sample=True, top_p=0.9)
    return {"text": tok.decode(out[0], skip_special_tokens=True)}
uvicorn app:app --reload
curl -s http://127.0.0.1:8000/generate -H "Content-Type: application/json" \
  -d '{"prompt":"Write a 2-sentence summary of attention in transformers."}'

References & further reading

That’s it. Choose the path that matches your hardware and comfort level: Transformers for flexibility, Ollama for simplicity, or LM Studio for Mac ease-of-use. You’ll have gpt-oss-20b running locally in minutes with full control over your data.

Share:XLinkedIn
Arfath Ahmed Syed
Arfath Ahmed Syed
Data Science, AI/ML @ Publicis Sapient