Is an 85%-90% input reduction guaranteed?

No. It is an upper range for suitable workloads. Actual reduction depends on the task and context.

Is compression the same as a cache hit?

No. Compression reduces submitted input. A cache hit reuses eligible context under the upstream provider's rules.

Does a 0.3x model make every request cost 3%-4.5% of official rates?

No. That range applies to the input-context component under the stated compression condition. Output and cache charges are separate.

Does compression have no effect on output quality?

No universal guarantee is made. Users should compare representative results, especially for tasks that depend on small details in a large context.

Are cache savings included in the table?

No. Actual cache hits are controlled by the model provider and request pattern, so the table leaves them out.

How context compression changes token use in long AI tasks

Start with the input context

An AI request can include the current instruction, conversation history, files, tool results, and other context. In a long coding or research session, that input can become much larger than the answer. Repeatedly submitting it is often a major part of token use.

Input-context compression removes or condenses material before the request reaches the model. SOTA Token Plan can reduce submitted input-token volume by up to 85%-90% for suitable workloads. It is an upper range, not a promise for every task.

Compression and caching are different

Compression reduces the token volume submitted for the current request. A cache hit reuses eligible context under the upstream model provider's rules. They can both lower cost, but the calculation is different.

Prehendo supplies a one-hour cache configuration where the selected model supports it. The model provider and the shape of later requests determine whether the cache is hit. Because that result is outside Prehendo's control, the examples below do not count a cache discount.

How the input-context calculation works

If compression removes 85%-90% of the original input, 10%-15% remains. Multiply that remaining share by the displayed model multiplier.

For a model displayed at 0.3x, the arithmetic is 0.3 x 0.15 = 0.045 and 0.3 x 0.10 = 0.03. Under those conditions, the input-context component is 3%-4.5% of the comparable official input price.

Why total request cost varies

The 3%-4.5% figure is not the price of the whole request. Output tokens are not reduced by the input-compression ratio. Cache writes and cache reads may have separate rates. A response-heavy task will therefore have a different effective discount from a task dominated by a long repeated input.

Model choice, package terms, the input/output mix, actual compression, and cache reuse all affect the final amount. The honest way to compare cost is to look at a real workload and keep each component separate.

Check quality on your own workload

Compression is designed to retain context that matters to the task, but no compression method should be treated as invisible in every situation. A repository refactor, a legal document, and a long creative draft do not need the same details.

Test representative work before relying on a savings estimate. Compare the answer, not just the token count. If a task depends on small details spread across a large context, use a less aggressive setup or send the original material.

Theoretical input-context component after compression

Displayed multiplier	85% input reduction	90% input reduction
0.3x	4.5% of official input price	3% of official input price
0.9x	13.5% of official input price	9% of official input price
1.1x	16.5% of official input price	11% of official input price

These figures describe the input-context component, not the complete request. Total request cost varies.

Start with the input context

Compression and caching are different

How the input-context calculation works

Why total request cost varies

Check quality on your own workload

Common questions