Concepts

The problem Minima solves

LLM workflows overspend by sending every call to a top-tier model when a cheaper model would do a portion of the work just as well. Token cost is the lever; model choice is the cheapest knob to turn. Minima turns that knob, per task, based on what models have actually done on similar tasks before.

Minima only recommends. It does not proxy your call, execute a model, rewrite prompts, cache, or compress. You ask “which model should run this?”, it answers, and you run the model yourself in your own stack. Because Minima sits beside your call rather than in front of it, it adds zero latency to the actual LLM request. The only Minima round-trip is the recommendation lookup (typically ~100–300ms).

The loop

        ┌─────────────────────────────────────────────────────────┐
        │                                                           │
        ▼                                                           │
  POST /v1/recommend  ──▶  you run the model  ──▶  POST /v1/feedback
   (recall + rank)          (your stack)            (write outcome,
                                                     reinforce memory)
        ▲                                                           │
        │              memory gets sharper for next time           │
        └───────────────────────────────────────────────────────-─┘

Recommend. Minima recalls similar past task → model → outcome records from Mubit, aggregates each candidate model’s empirical success rate, combines it with cost and capability priors, and returns the cheapest model expected to clear a quality bar.
Run it yourself. Minima hands back a recommendation_id; you run the recommended model.
Feed back. You report the outcome and a quality score. Minima writes the outcome to Mubit, reinforces the exact memories that drove the decision, and (on strong verified-in-production results) promotes a durable lesson.

Why memory

The recommendation engine is non-parametric k-NN over history: recall similar past records, aggregate per-model success, pick the cheapest model clearing a threshold. Minima is backed by Mubit, which provides that substrate — semantic recall over server-side embeddings, per-entry reinforcement with a Bayesian reliability estimate, lesson promotion, and strategy surfacing for explainability. You don’t operate any of it; it’s part of the hosted service.

The recommendation algorithm

For each request Minima:

Classifies the task. It uses your task_type / difficulty hints if given; otherwise a fast heuristic infers them. If the heuristic is uncertain and escalation is allowed, the cheap-LLM reasoner can refine the classification. From this it derives a task cluster (e.g. code:hard).
Selects candidates. It starts from the full model catalog, applies your constraint filters (candidate_models, allowed_providers, excluded_models, require_prompt_caching, require_context_window), pre-ranks by capability prior, and caps to max_candidates.
Recalls similar past outcome records scoped to your account (and namespace, if set), with a hard timeout. On timeout or no history, it falls back to the prior-only path.
Aggregates per model. Each recalled neighbor is weighted by similarity × reliability × staleness_decay, then combined into a Beta-smoothed empirical success rate per candidate (so models with no neighbors fall back to their capability prior, not to 0.5). An inverse-propensity weighting step corrects for the selection bias that you’ve historically sent certain task types to certain models.
Scores each candidate by combining predicted success with estimated cost (see Cost-basis tiers below). The slider sets a quality threshold τ.
Optimizes. Among models predicted to clear τ, it recommends the cheapest (tie-break: higher success, then higher confidence). If none clear τ, it recommends the highest-predicted-success model and warns no_model_meets_threshold. A fallback_model is chosen as a more reliable retry target.
Escalates to a cheap-LLM reasoner when evidence is thin or conflicting — see below.

The cost/quality slider

cost_quality_tradeoff (0–10, default 5) maps to a quality threshold:

τ = τ_min + (cost_quality_tradeoff / 10) × (τ_max − τ_min)

with τ_min = 0.55 and τ_max = 0.92 by default. 0 means “cheapest model that’s acceptable”; 10 means “highest quality regardless of cost”. A request’s min_quality constraint raises the floor. The slider also shifts the ranking weight between predicted success and normalized cost.

Cost-basis tiers (estimate → observed → rescaled)

The single most important accuracy mechanism. A flat token estimate assumes a fixed output length, so it ignores reasoning/thinking tokens — which mis-ranks a model with cheap list prices but heavy internal reasoning. Minima ranks candidates by what they really cost.

One basis is chosen for the whole candidate set so all costs are compared like-for-like (choose_cost_basis), preferring the most grounded tier every candidate supports:

Tier	Used when	How cost is computed	Breakdown key
rescaled	every candidate has enough observations carrying `output_tokens`	`this_request_input_tokens × input_price + observed_median_output_tokens × output_price` — size-exact and reasoning-aware	`rescaled`, `obs_output_tokens`
observed	every candidate has enough realized `cost_usd` observations	robust similarity-weighted median of realized `cost_usd` per call	`observed_avg`
estimate	cold start	`input_tokens × input_price + output_tokens × output_price`, using the request’s expected tokens or per-task-type defaults	`input`, `output`

A small minimum number of observations per candidate (default 3) gates the observed and rescaled tiers. The chosen basis is reflected in each RankedModel.est_cost_breakdown, and the rationale tags the number obs (grounded) or est (cold). The realized cost_usd / input_tokens / output_tokens come from your POST /v1/feedback calls — so the more you feed back, the more the ranking climbs from estimate → observed → rescaled.

Escalation to a cheap-LLM reasoner

When deterministic evidence is thin or conflicting, Minima can consult a cheap LLM (an inexpensive reasoning model such as Anthropic Haiku or Gemini Flash). It fires only when allow_llm_escalation is true and any of:

thin evidence — too little recalled history overall, or too few candidate models with any neighbor;
low confidence — the recommended model’s neighborhood confidence is too low;
conflict/tie — the top two candidates’ scores are within a small margin.

On trigger, Minima builds a memory context block, asks the reasoner to rank the candidates with structured output, and blends the reasoner’s predicted success with the deterministic one. On any reasoner error or parse failure it falls back to the deterministic result and warns reasoner_failed. The reasoner is the explicit slow tier and never touches your real LLM call.

decision_basis on the response tells you which path won: memory, prior, or llm.

How it gets better over time

Phase	What’s happening	Typical `decision_basis`
Cold start (day 0)	no history; leans on capability priors and flat estimates; reasoner fires often	`prior` (with `cold_start`)
Warming up	`/feedback` outcomes cross `MIN_N`; cost basis climbs estimate → observed → rescaled; reasoner fires less	mix of `memory` and `prior`
Mature	dense history; most picks are empirical; reflection has promoted durable lessons; selection bias in your routing history has been corrected	mostly `memory`

New accounts start with a baseline of benchmark-derived history so picks are useful from day one; it’s progressively dominated by your own /v1/feedback outcomes as they accumulate.

The learning loop in detail

When POST /v1/feedback is called:

Resolve the recommendation_id → the recalled neighbors, cluster, and scope. (Account-scoped: an id from another account resolves to nothing, so accounts can’t credit or poison each other.)
Upsert one durable outcome record per (task cluster, model), carrying cost_usd, input_tokens, output_tokens, and quality_score.
Credit the exact recalled neighbors that drove the pick, bumping their reinforcement counters and reliability.
On a verified-in-production strong success, promote a durable lesson that feeds rule promotion.
Periodically reflect (after a number of feedbacks, or on any verified-prod failure) to promote run → session → account-level lessons.

Degradation behavior

Minima is designed to keep serving when Mubit is slow or down:

Condition	Behavior
Recall timeout	Prior-only recommendation + `recall_timeout` warning.
Mubit unavailable	Prior-only + `memory_unavailable`.
Stale prices	Last-good price snapshot used; `catalog_stale: true` + `prices_stale`.
Reasoner error	Deterministic result + `reasoner_failed`.
Reasoner not configured	Escalation surfaced as warning; deterministic result used.
No models match constraints	`422 NoCandidatesError`.