Most people who use cloud AI via API pay more than they need to. Not because they are careless, but because routing everything through one model — typically the best available one — is the path of least resistance. The cost difference between an optimal routing strategy and a naive "always use GPT-4o" approach is often 70–90% for typical knowledge-worker workflows.
Why One Model Does Not Fit All Tasks
Consider the difference between these two tasks: "What is 2 + 2?" and "Analyse the contradictions in this 50-page legal brief and identify the three weakest points in the plaintiff's argument." The first requires essentially no reasoning — a model trained on a smartphone would get it right. The second requires sophisticated analysis, extensive context handling, and nuanced judgment. Running both through GPT-4o at $5 per million tokens wastes significant money on the first task while arguably getting reasonable value from the second.
The cost spread across today's available models is enormous. GPT-4o mini costs approximately $0.15 per million input tokens — about 33x cheaper than GPT-4o. A well-tuned routing strategy that sends appropriate tasks to cheaper models while reserving expensive ones for complex reasoning can reduce a $50/month API bill to under $5 with no perceptible quality difference on the routed tasks.
A Practical Routing Tier System
Tier 1 — Local (free): Use Ollama with a 7–8B model for tasks that involve personal or sensitive data, or that do not require frontier-model capability. Template filling, email drafting from a provided draft, simple summarisation of short content, answering questions about files you provide. Zero cost, full privacy.
Tier 2 — Cheap cloud ($0.15–$0.50 per 1M tokens): GPT-4o mini, Claude Haiku, Gemini Flash. Good for classification, content generation from a clear brief, research synthesis from provided sources, standard coding tasks. Handles the bulk of conversational AI use without needing frontier quality.
Tier 3 — Expensive cloud ($3–$15 per 1M tokens): GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro. Reserve for tasks that genuinely need deep reasoning: complex multi-document analysis, nuanced writing where tone and judgment matter, hard debugging problems, tasks where the cost of a wrong answer is high.
How to Implement Routing in Skales
Skales supports multiple providers simultaneously. In Settings → Providers, you can add Ollama, OpenRouter, and direct API connections as separate providers. The routing can be configured at several levels:
Per-conversation: Click the provider selector in any chat window to choose which model handles that conversation. Use Ollama for a sensitive client email, switch to Claude 3.5 for a complex strategy document.
Per-skill: When defining a custom skill, specify which provider it uses. Your "meeting notes" skill routes to Ollama. Your "legal contract analysis" skill routes to GPT-4o. The right model is selected automatically when you invoke the skill.
Smart routing (experimental): Skales can analyse the apparent complexity and sensitivity of a query and suggest routing. A brief query with no attachments routes to Tier 1. A query with multiple document attachments and analytical language routes to Tier 3. This is a guide, not an override — you remain in control of the final model selection.
The practical result for most users: local and cheap models handle 80–90% of daily tasks, and expensive frontier models are used deliberately for the 10–20% that benefits from them. Explore provider settings in Skales or download free.