All features are automatically enabled by default to improve overall performance.
Supports OpenAI clients for chat and completion use cases, and Continue.dev for Co-Pilot use cases.
Supports long contexts up to 1M context length and dynamic batching to improve utilization.
Reduces memory footprint to deploy LLMs to half the original model size.
Improves inference speed by 2 times.
Reduces redundant computation when requests have the same context.
Improves inference speed by 1.5 to 2 times on evaluation datasets and over 10 times in ideal scenarios.
Last updated 1 year ago