Run GPT-OSS-20B Locally: The Ultimate Guide


Want to run gpt-oss-20b fully local—no data leaving your machine? Good news: OpenAI’s new open-weight 20B model is designed for on-device and edge use, with practical memory needs and permissive licensing. Below is a clean, copy-pasteable guide with three easy paths: Python (Transformers), Ollama (GGUF), and LM Studio on Apple Silicon. We’ll also cover hardware, quantization, and tips for speed/quality trade-offs.
What is gpt-oss-20b?
gpt-oss-20b is an open-weight, mixture-of-experts (MoE) LLM released alongside a larger 120B variant. It targets lower-latency local and specialized use cases while keeping resource usage modest. It delivers performance comparable to o3-mini on common tasks and is intended for edge/on-device scenarios. Both gpt-oss models use a specific “harmony” response format.
License, weights & formats
- License: Apache-2.0 (open weights).
- Weights: Official hub page: Hugging Face – openai/gpt-oss-20b.
- Quantized builds: Community GGUF (for llama.cpp/Ollama) and MLX 8-bit (Apple Silicon) are available.
Hardware requirements (practical)
- RAM/VRAM: Designed to run on devices with ~16 GB memory (fits a single consumer GPU or modern laptop memory with quantization).
- Best experience: NVIDIA GPU (≥ 12–16 GB VRAM) or Apple Silicon (M-series) with MLX quant.
- CPU-only: Feasible with heavy quantization, but slower; consider Ollama/llama.cpp GGUF for simplicity.
Path A — Pure Python (Transformers) on your GPU
This uses Hugging Face Transformers to load the official 20B weights. It’s the most flexible route (fine-tuning, custom pipelines) and works on Linux/Windows + CUDA or on Apple Silicon via Metal (slower unless using MLX variant).
# 1) Create env
conda create -n gptoss20b python=3.10 -y && conda activate gptoss20b
# 2) Install deps (CUDA build of PyTorch if on NVIDIA)
pip install --upgrade pip
pip install torch --index-url https://download.pytorch.org/whl/cu121 # adjust CUDA
pip install transformers accelerate bitsandbytes safetensors
# 3) Optional: enable 4-bit loading (bitsandbytes) for lower VRAM
# 4) Run a minimal chat script
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import torch
model_id = "openai/gpt-oss-20b" # official weights
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
load_kwargs = {
"torch_dtype": torch.bfloat16 if torch.cuda.is_available() else torch.float32,
"low_cpu_mem_usage": True,
"trust_remote_code": True,
}
# Optional: 4-bit quantization
try:
from transformers import BitsAndBytesConfig
load_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True)
except Exception:
pass
model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs).eval()
if torch.cuda.is_available(): model.to("cuda")
prompt = "You are a helpful assistant. Explain transformers in 3 bullets."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
_ = model.generate(**inputs, max_new_tokens=300, do_sample=True, top_p=0.9, streamer=streamer)
Troubleshooting (Transformers)
- OOM? Enable 4-bit (above) or reduce
max_new_tokens. - Python 3.12 issues? Pin to 3.10/3.11 if some wheels lag behind.
Path B — Ollama (easiest local serve)
Prefer a one-liner local server with GPU/CPU support and simple /api/generate? Use the community GGUF builds and run via Ollama.
- Install Ollama (Windows/macOS/Linux) from ollama.com.
- Download a gpt-oss-20b GGUF quant from the community page.
- Create a
Modelfile:
# Modelfile
FROM ./gpt-oss-20b.Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
# Build & run
ollama create gpt-oss-20b -f Modelfile
ollama run gpt-oss-20b "Summarize the key ideas in attention mechanisms."
Path C — LM Studio on Apple Silicon (MLX 8-bit)
- Install LM Studio (macOS).
- Open Models and search for “
gpt-oss-20b-MLX-8bit”. - Download, chat, or enable local server for API access.
Colab / quick cloud test (optional)
Test in the cloud before going local using the official Colab recipe for gpt-oss-20b.
Performance tips
- Quantization: 4-bit (Transformers) or GGUF Q-series (Ollama) for memory savings.
- Context: Keep
num_ctxreasonable to avoid RAM spikes. - Batching: Stream tokens and use small batch sizes.
- Format: Use “harmony” format for best reliability in structured outputs.
Quality & trade-offs
- 20B targets speed/local deploy; 120B is for higher reasoning if you have the hardware.
- Prompt style matters; experiment with system prompts and sampling params.
End-to-end local mini-app (FastAPI)
# app.py
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()
tok = AutoTokenizer.from_pretrained("openai/gpt-oss-20b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b",
torch_dtype=torch.bfloat16, trust_remote_code=True).eval().to("cuda" if torch.cuda.is_available() else "cpu")
class ChatReq(BaseModel):
prompt: str
max_new_tokens: int = 256
@app.post("/generate")
def generate(req: ChatReq):
inputs = tok(req.prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=req.max_new_tokens, do_sample=True, top_p=0.9)
return {"text": tok.decode(out[0], skip_special_tokens=True)}
uvicorn app:app --reload
curl -s http://127.0.0.1:8000/generate -H "Content-Type: application/json" \
-d '{"prompt":"Write a 2-sentence summary of attention in transformers."}'
References & further reading
- OpenAI announcement
- Model card
- Hugging Face weights
- Community GGUF builds
- LMX 8-bit
- OpenAI Colab example
That’s it. Choose the path that matches your hardware and comfort level: Transformers for flexibility, Ollama for simplicity, or LM Studio for Mac ease-of-use. You’ll have gpt-oss-20b running locally in minutes with full control over your data.
