Quantization

Quantization is a technique to reduce the VRAM required for deploying LLMs.

Quantization has several parameters to consider because extreme quantization can lead to model collapse. Additionally, some quantization techniques, when used with the wrong inference engine, can slow down the model's inference speed.

One Click Deploy sets the default quantization to 8-bit weight. This technique minimizes the accuracy loss impact on the model and reduces the model size to half of the original.

For advanced quantization, we prioritize minimizing accuracy loss not only for a single language but also for multilingual performance.

Evaluation Dataset

M3Exam (Multiple-choice, Accuracy)

PreviousLong context and Auto scheduler NextContext caching

Last updated 10 months ago