๐Ÿ“š
Docs - Float16
homeapp
  • ๐Ÿš€GETTING STARTED
    • Introduction
    • Account
      • Dashboard
      • Profile
      • Payment
      • Workspace
      • Service Quota
    • LLM as a service
      • Quick Start
        • Set the credentials
      • Supported Model
      • Limitation
      • API Reference
    • One Click Deploy
      • Quick Start
        • Instance Detail
        • Re-generate API Key
        • Terminate Instance
      • Features
        • OpenAI Compatible
        • Long context and Auto scheduler
        • Quantization
        • Context caching
      • Limitation
      • Validated model
      • Endpoint Specification
    • Serverless GPU
      • Quick Start
        • Mode
        • Task Status
        • App Features
          • Project Detail
      • Tutorials
        • Hello World
        • Install new library
        • Prepare model weight
        • S3 Copy output from remote
        • R2 Copy output from remote
        • Direct upload and download
        • Server mode
        • LLM Dynamic Batching
        • Train and Inference MNIST
        • Etc.
      • CLI References
      • โ“FAQ
    • Playground
      • FloatChat
      • FloatPrompt
      • Quantize by Float16
  • ๐Ÿ“šUse Case
    • Q&A Bot (RAG)
    • Text-to-SQL
    • OpenAI with Rate Limit
    • OpenAI with Guardrail
    • Multiple Agents
    • Q&A Chatbots (RAG + Agents)
  • โœณ๏ธJourney
    • โœจThe Beginner's LLM Development Journey
    • ๐Ÿ“–Glossary
      • [English Version] LLM Glossary
      • [เธ เธฒเธฉเธฒเน„เธ—เธข] LLM Glossary
    • ๐Ÿง How to install node
  • Prompting
    • ๐Ÿ“šVariable
    • โ›“๏ธCondition
    • ๐Ÿ”จDemonstration
    • โŒ›Loop
    • ๐Ÿ“™Formatting
    • ๐ŸฃChat
    • ๐Ÿ”ŽTechnical term (Retrieve)
  • Privacy Policy
  • Terms & Conditions
Powered by GitBook
On this page
  1. GETTING STARTED
  2. One Click Deploy
  3. Features

Quantization

Quantization is a technique to reduce the VRAM required for deploying LLMs.

Quantization has several parameters to consider because extreme quantization can lead to model collapse. Additionally, some quantization techniques, when used with the wrong inference engine, can slow down the model's inference speed.

One Click Deploy sets the default quantization to 8-bit weight. This technique minimizes the accuracy loss impact on the model and reduces the model size to half of the original.

For advanced quantization, we prioritize minimizing accuracy loss not only for a single language but also for multilingual performance.

Evaluation Dataset

  • M3Exam (Multiple-choice, Accuracy)

PreviousLong context and Auto schedulerNextContext caching

Last updated 7 months ago

๐Ÿš€