Features

All features are automatically enabled by default to improve overall performance.

OpenAI Compatible

Supports OpenAI clients for chat and completion use cases, and Continue.dev for Co-Pilot use cases.

Batch size, Max input length, Number token

Supports long contexts up to 1M context length and dynamic batching to improve utilization.

Quantization

Reduces memory footprint to deploy LLMs to half the original model size.

Improves inference speed by 2 times.

Context caching

Reduces redundant computation when requests have the same context.

Improves inference speed by 1.5 to 2 times on evaluation datasets and over 10 times in ideal scenarios.

Last updated