Ollama is a free, open-source tool that makes running AI models on your own machine as simple as running any other software. No Docker, no Python environment, no GPU required (though having one helps). This guide takes you from zero to a running local AI model, including which models work best for which tasks and how to get a proper interface beyond the terminal.
What Ollama Is
Ollama is a runtime for open-weight AI models. Think of it as the engine โ it handles downloading models, managing their memory usage, and serving them through a simple API that other software can connect to. Ollama is not itself an AI model; it is the infrastructure that makes running models easy. The models run on your hardware, completely offline, with no data leaving your machine.
Step 1: Install Ollama
Go to ollama.ai and download the installer for your platform. On macOS, drag it to Applications and launch it โ a small icon appears in your menu bar. On Windows, run the .exe installer. On Linux, the one-line install script in the documentation handles everything. The installation takes under two minutes.
Once installed, Ollama runs as a background service. You interact with it either through the terminal or through a compatible application. It listens on localhost:11434 by default.
Step 2: Download a Model
Open your terminal and run: ollama pull llama3.1. This downloads the Llama 3.1 8B model (about 4.7GB) from Meta via Ollama's servers. The download takes 5-15 minutes depending on your connection. Once downloaded, the model is stored locally and never needs to be downloaded again.
You can test it immediately with: ollama run llama3.1. A prompt appears. Type anything and press Enter. You are running AI locally.
Choosing the Right Model for Your Hardware
Model selection depends primarily on how much RAM your machine has. Here are the practical recommendations:
8 GB RAM: Llama 3.2 3B (very fast, good for simple tasks), Phi-3 Mini (efficient, surprisingly capable), Mistral 7B Q4 (fits with some room to spare, good general quality). Expect 5-20 second response times on CPU-only hardware.
16 GB RAM: Llama 3.1 8B (the sweet spot for most users โ good quality, reasonable speed), Mistral 7B at full precision, Gemma 2 9B. Expect 2-8 second response times. With an integrated GPU (Apple Silicon or AMD Radeon), responses are often under 3 seconds.
32 GB RAM: Llama 3.1 13B (notably better than 8B for complex reasoning), Qwen 2.5 14B (excellent for coding and analysis), Mistral Nemo. Responses in 1-4 seconds on Apple Silicon M-series.
64 GB RAM or a GPU with 24GB VRAM: Llama 3.1 70B (approaches GPT-4 quality on many tasks), Qwen 2.5 32B, DeepSeek 33B. Near-cloud speeds with local privacy.
Best Models by Task Type
General conversation and email: Llama 3.1 8B or Mistral 7B. Both handle natural language tasks extremely well at modest hardware requirements.
Coding and technical tasks: Qwen 2.5 Coder, DeepSeek Coder V2. These are fine-tuned specifically for code generation and debugging. Noticeably better than general models for programming tasks.
Reasoning and analysis: Qwen 2.5 14B or 32B. These models perform disproportionately well on analytical tasks relative to their size.
Fast responses for high-volume use: Phi-3 Mini or Llama 3.2 3B. When you need quick responses and quality is secondary to speed.
Step 3: Connect Ollama to Skales (Get a GUI)
The terminal interface is fine for testing, but for practical daily use you want a proper application. Skales connects to your local Ollama instance and gives you a full AI agent experience โ email, calendar, file management, automation โ all powered by your local model.
After installing Skales, go to Settings โ Providers โ Add Provider โ select Ollama โ enter http://localhost:11434 as the endpoint โ select your downloaded model from the dropdown โ Save. Your local model now powers all of Skales' features.
Read more about fully offline operation with Ollama or explore local AI use cases. Download Skales free โ it takes less than two minutes to set up alongside Ollama.