In early 2024, choosing a local model over GPT-4o meant accepting a significant capability gap. In 2026, the picture is meaningfully different. The open-weight model ecosystem has matured rapidly, and the decision between local and cloud now involves genuine trade-offs rather than a clear winner.
GPT-4o: What You Are Paying For
GPT-4o leads on complex multi-step reasoning, mathematical problem-solving, and tasks that require deep world knowledge. Its instruction-following is more reliable on edge cases. For coding tasks involving obscure libraries, legal document analysis, or anything requiring nuanced judgment, GPT-4o still has a meaningful edge. You pay roughly $5-15 per million tokens via API, which adds up for high-volume use but is negligible for occasional queries.
Llama 3.1 and the Local Contenders
Llama 3.1 70B (running on a machine with 48GB RAM, or quantised to Q4 on 24GB) matches or beats GPT-3.5-Turbo on nearly every benchmark and is competitive with GPT-4o on many everyday tasks. For a 7-13B model running on a 16GB MacBook, expect Claude-Instant-level quality โ very good for drafting, summarising, explaining, and answering questions about provided documents.
The models worth running locally in 2026: Llama 3.1 8B (16GB RAM, excellent general use), Qwen 2.5 14B (requires ~20GB RAM, notably strong on coding and analysis), Mistral 7B (fast, compact, reliable), and DeepSeek Coder for programming tasks specifically. All installable via Ollama in one command and connectable to Skales in Settings. Read more about offline AI workflows.