Concepts
The problem Minima solves
Section titled “The problem Minima solves”LLM workflows overspend by sending every call to a top-tier model when a cheaper model would do a portion of the work just as well. Token cost is the lever; model choice is the cheapest knob to turn. Minima turns that knob, per task, based on what models have actually done on similar tasks before.
Recommend-only, zero added latency
Section titled “Recommend-only, zero added latency”Minima only recommends. It does not proxy your call, execute a model, rewrite prompts, cache, or compress. You ask “which model should run this?”, it answers, and you run the model yourself in your own stack. Because Minima sits beside your call rather than in front of it, it adds zero latency to the actual LLM request. The only Minima round-trip is the recommendation lookup (typically ~100–300ms).
The loop
Section titled “The loop” ┌─────────────────────────────────────────────────────────┐ │ │ ▼ │ POST /v1/recommend ──▶ you run the model ──▶ POST /v1/feedback (recall + rank) (your stack) (write outcome, reinforce memory) ▲ │ │ memory gets sharper for next time │ └───────────────────────────────────────────────────────-─┘- Recommend. Minima recalls similar past
task → model → outcomerecords from Mubit, aggregates each candidate model’s empirical success rate, combines it with cost and capability priors, and returns the cheapest model expected to clear a quality bar. - Run it yourself. Minima hands back a
recommendation_id; you run the recommended model. - Feed back. You report the outcome and a quality score. Minima writes the outcome to Mubit, reinforces the exact memories that drove the decision, and (on strong verified-in-production results) promotes a durable lesson.
Why memory
Section titled “Why memory”The recommendation engine is non-parametric k-NN over history: recall similar past records, aggregate per-model success, pick the cheapest model clearing a threshold. Minima is backed by Mubit, which provides that substrate — semantic recall over server-side embeddings, per-entry reinforcement with a Bayesian reliability estimate, lesson promotion, and strategy surfacing for explainability. You don’t operate any of it; it’s part of the hosted service.
The recommendation algorithm
Section titled “The recommendation algorithm”For each request Minima:
-
Classifies the task. It uses your
task_type/difficultyhints if given; otherwise a fast heuristic infers them. If the heuristic is uncertain and escalation is allowed, the cheap-LLM reasoner can refine the classification. From this it derives a task cluster (e.g.code:hard). -
Selects candidates. It starts from the full model catalog, applies your constraint filters (
candidate_models,allowed_providers,excluded_models,require_prompt_caching,require_context_window), pre-ranks by capability prior, and caps tomax_candidates. -
Recalls similar past outcome records scoped to your account (and
namespace, if set), with a hard timeout. On timeout or no history, it falls back to the prior-only path. -
Aggregates per model. Each recalled neighbor is weighted by
similarity × reliability × staleness_decay, then combined into a Beta-smoothed empirical success rate per candidate (so models with no neighbors fall back to their capability prior, not to 0.5). An inverse-propensity weighting step corrects for the selection bias that you’ve historically sent certain task types to certain models. -
Scores each candidate by combining predicted success with estimated cost (see Cost-basis tiers below). The slider sets a quality threshold
τ. -
Optimizes. Among models predicted to clear
τ, it recommends the cheapest (tie-break: higher success, then higher confidence). If none clearτ, it recommends the highest-predicted-success model and warnsno_model_meets_threshold. Afallback_modelis chosen as a more reliable retry target. -
Escalates to a cheap-LLM reasoner when evidence is thin or conflicting — see below.
The cost/quality slider
Section titled “The cost/quality slider”cost_quality_tradeoff (0–10, default 5) maps to a quality threshold:
τ = τ_min + (cost_quality_tradeoff / 10) × (τ_max − τ_min)with τ_min = 0.55 and τ_max = 0.92 by default. 0 means “cheapest model that’s acceptable”; 10 means “highest quality regardless of cost”. A request’s min_quality constraint raises the floor. The slider also shifts the ranking weight between predicted success and normalized cost.
Cost-basis tiers (estimate → observed → rescaled)
Section titled “Cost-basis tiers (estimate → observed → rescaled)”The single most important accuracy mechanism. A flat token estimate assumes a fixed output length, so it ignores reasoning/thinking tokens — which mis-ranks a model with cheap list prices but heavy internal reasoning. Minima ranks candidates by what they really cost.
One basis is chosen for the whole candidate set so all costs are compared like-for-like (choose_cost_basis), preferring the most grounded tier every candidate supports:
| Tier | Used when | How cost is computed | Breakdown key |
|---|---|---|---|
| rescaled | every candidate has enough observations carrying output_tokens | this_request_input_tokens × input_price + observed_median_output_tokens × output_price — size-exact and reasoning-aware | rescaled, obs_output_tokens |
| observed | every candidate has enough realized cost_usd observations | robust similarity-weighted median of realized cost_usd per call | observed_avg |
| estimate | cold start | input_tokens × input_price + output_tokens × output_price, using the request’s expected tokens or per-task-type defaults | input, output |
A small minimum number of observations per candidate (default 3) gates the observed and rescaled tiers. The chosen basis is reflected in each RankedModel.est_cost_breakdown, and the rationale tags the number obs (grounded) or est (cold). The realized cost_usd / input_tokens / output_tokens come from your POST /v1/feedback calls — so the more you feed back, the more the ranking climbs from estimate → observed → rescaled.
Escalation to a cheap-LLM reasoner
Section titled “Escalation to a cheap-LLM reasoner”When deterministic evidence is thin or conflicting, Minima can consult a cheap LLM (an inexpensive reasoning model such as Anthropic Haiku or Gemini Flash). It fires only when allow_llm_escalation is true and any of:
- thin evidence — too little recalled history overall, or too few candidate models with any neighbor;
- low confidence — the recommended model’s neighborhood confidence is too low;
- conflict/tie — the top two candidates’ scores are within a small margin.
On trigger, Minima builds a memory context block, asks the reasoner to rank the candidates with structured output, and blends the reasoner’s predicted success with the deterministic one. On any reasoner error or parse failure it falls back to the deterministic result and warns reasoner_failed. The reasoner is the explicit slow tier and never touches your real LLM call.
decision_basis on the response tells you which path won: memory, prior, or llm.
How it gets better over time
Section titled “How it gets better over time”| Phase | What’s happening | Typical decision_basis |
|---|---|---|
| Cold start (day 0) | no history; leans on capability priors and flat estimates; reasoner fires often | prior (with cold_start) |
| Warming up | /feedback outcomes cross MIN_N; cost basis climbs estimate → observed → rescaled; reasoner fires less | mix of memory and prior |
| Mature | dense history; most picks are empirical; reflection has promoted durable lessons; selection bias in your routing history has been corrected | mostly memory |
New accounts start with a baseline of benchmark-derived history so picks are useful from day one; it’s progressively dominated by your own /v1/feedback outcomes as they accumulate.
The learning loop in detail
Section titled “The learning loop in detail”When POST /v1/feedback is called:
- Resolve the
recommendation_id→ the recalled neighbors, cluster, and scope. (Account-scoped: an id from another account resolves to nothing, so accounts can’t credit or poison each other.) - Upsert one durable outcome record per
(task cluster, model), carryingcost_usd,input_tokens,output_tokens, andquality_score. - Credit the exact recalled neighbors that drove the pick, bumping their reinforcement counters and reliability.
- On a verified-in-production strong success, promote a durable lesson that feeds rule promotion.
- Periodically reflect (after a number of feedbacks, or on any verified-prod failure) to promote run → session → account-level lessons.
Degradation behavior
Section titled “Degradation behavior”Minima is designed to keep serving when Mubit is slow or down:
| Condition | Behavior |
|---|---|
| Recall timeout | Prior-only recommendation + recall_timeout warning. |
| Mubit unavailable | Prior-only + memory_unavailable. |
| Stale prices | Last-good price snapshot used; catalog_stale: true + prices_stale. |
| Reasoner error | Deterministic result + reasoner_failed. |
| Reasoner not configured | Escalation surfaced as warning; deterministic result used. |
| No models match constraints | 422 NoCandidatesError. |