Context caching

Context caching is the ability to cache the KV (key-value) computations of the same context between requests.

This feature helps speed up inference time by over 10 times when the request has the same context.

The speedup is about 2 times when benchmarked with evaluation datasets like M3Exam.

Context caching is automatically triggered when requests have the same context within the same batch size.

For example, if the endpoint has a batch size of 8 and each request has 1,024 tokens, and the requests share the same context for 900 tokens.

Instead of the system calculating 1,024 * 8 = 8,192 tokens.

The system will calculate ((1,024 - 900) * 8) + 1,024 = 2,016 tokens.

This significantly reduces the compute required and improves the endpoint's latency.

Last updated 10 months ago