Features
All features are automatically enabled by default to improve overall performance.
OpenAI Compatible
Supports OpenAI clients for chat and completion use cases, and Continue.dev for Co-Pilot use cases.
Batch size, Max input length, Number token
Supports long contexts up to 1M context length and dynamic batching to improve utilization.
Quantization
Reduces memory footprint to deploy LLMs to half the original model size.
Improves inference speed by 2 times.
Context caching
Reduces redundant computation when requests have the same context.
Improves inference speed by 1.5 to 2 times on evaluation datasets and over 10 times in ideal scenarios.
Last updated