Context caching
What is context caching ?
Context caching is the ability to cache the KV (key-value) computations of the same context between requests.
This feature helps speed up inference time by over 10 times when the request has the same context.
The speedup is about 2 times when benchmarked with evaluation datasets like M3Exam.
How does context caching is work ?
Context caching is automatically triggered when requests have the same context within the same batch size.
For example, if the endpoint has a batch size of 8 and each request has 1,024 tokens, and the requests share the same context for 900 tokens.
Instead of the system calculating 1,024 * 8 = 8,192 tokens.
The system will calculate ((1,024 - 900) * 8) + 1,024 = 2,016 tokens.
This significantly reduces the compute required and improves the endpoint's latency.
Use case
RAG
Few-shot prompting
Code Co-pilot
Last updated