developermodelsoptimization

Multi-Model AI Routing: Using the Right Model for Each Task

Mario Simic

·5 min read
ShareXLinkedIn

Most people who use cloud AI via API pay more than they need to. Not because they are careless, but because routing everything through one model — typically the best available one — is the path of least resistance. The cost difference between an optimal routing strategy and a naive "always use GPT-4o" approach is often 70–90% for typical knowledge-worker workflows.

Why One Model Does Not Fit All Tasks

Consider the difference between these two tasks: "What is 2 + 2?" and "Analyse the contradictions in this 50-page legal brief and identify the three weakest points in the plaintiff's argument." The first requires essentially no reasoning — a model trained on a smartphone would get it right. The second requires sophisticated analysis, extensive context handling, and nuanced judgment. Running both through GPT-4o at $5 per million tokens wastes significant money on the first task while arguably getting reasonable value from the second.

The cost spread across today's available models is enormous. GPT-4o mini costs approximately $0.15 per million input tokens — about 33x cheaper than GPT-4o. A well-tuned routing strategy that sends appropriate tasks to cheaper models while reserving expensive ones for complex reasoning can reduce a $50/month API bill to under $5 with no perceptible quality difference on the routed tasks.

A Practical Routing Tier System

Tier 1 — Local (free): Use Ollama with a 7–8B model for tasks that involve personal or sensitive data, or that do not require frontier-model capability. Template filling, email drafting from a provided draft, simple summarisation of short content, answering questions about files you provide. Zero cost, full privacy.

Tier 2 — Cheap cloud ($0.15–$0.50 per 1M tokens): GPT-4o mini, Claude Haiku, Gemini Flash. Good for classification, content generation from a clear brief, research synthesis from provided sources, standard coding tasks. Handles the bulk of conversational AI use without needing frontier quality.

Tier 3 — Expensive cloud ($3–$15 per 1M tokens): GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro. Reserve for tasks that genuinely need deep reasoning: complex multi-document analysis, nuanced writing where tone and judgment matter, hard debugging problems, tasks where the cost of a wrong answer is high.

How to Implement Routing in Skales

Skales supports multiple providers simultaneously. In Settings → Providers, you can add Ollama, OpenRouter, and direct API connections as separate providers. The routing can be configured at several levels:

Per-conversation: Click the provider selector in any chat window to choose which model handles that conversation. Use Ollama for a sensitive client email, switch to Claude 3.5 for a complex strategy document.

Per-skill: When defining a custom skill, specify which provider it uses. Your "meeting notes" skill routes to Ollama. Your "legal contract analysis" skill routes to GPT-4o. The right model is selected automatically when you invoke the skill.

Smart routing (experimental): Skales can analyse the apparent complexity and sensitivity of a query and suggest routing. A brief query with no attachments routes to Tier 1. A query with multiple document attachments and analytical language routes to Tier 3. This is a guide, not an override — you remain in control of the final model selection.

The practical result for most users: local and cheap models handle 80–90% of daily tasks, and expensive frontier models are used deliberately for the 10–20% that benefits from them. Explore provider settings in Skales or download free.

Try it yourself 🦎

Skales is free for personal use. No Docker. No account.

Download Free →
ShareXLinkedIn

Ready to try Skales?