๐Ÿ“š
Docs - Float16
homeapp
  • ๐Ÿš€GETTING STARTED
    • Introduction
    • Account
      • Dashboard
      • Profile
      • Payment
      • Workspace
      • Service Quota
    • LLM as a service
      • Quick Start
        • Set the credentials
      • Supported Model
      • Limitation
      • API Reference
    • One Click Deploy
      • Quick Start
        • Instance Detail
        • Re-generate API Key
        • Terminate Instance
      • Features
        • OpenAI Compatible
        • Long context and Auto scheduler
        • Quantization
        • Context caching
      • Limitation
      • Validated model
      • Endpoint Specification
    • Serverless GPU
      • Quick Start
        • Mode
        • Task Status
        • App Features
          • Project Detail
      • Tutorials
        • Hello World
        • Install new library
        • Prepare model weight
        • S3 Copy output from remote
        • R2 Copy output from remote
        • Direct upload and download
        • Server mode
        • LLM Dynamic Batching
        • Train and Inference MNIST
        • Etc.
      • CLI References
      • โ“FAQ
    • Playground
      • FloatChat
      • FloatPrompt
      • Quantize by Float16
  • ๐Ÿ“šUse Case
    • Q&A Bot (RAG)
    • Text-to-SQL
    • OpenAI with Rate Limit
    • OpenAI with Guardrail
    • Multiple Agents
    • Q&A Chatbots (RAG + Agents)
  • โœณ๏ธJourney
    • โœจThe Beginner's LLM Development Journey
    • ๐Ÿ“–Glossary
      • [English Version] LLM Glossary
      • [เธ เธฒเธฉเธฒเน„เธ—เธข] LLM Glossary
    • ๐Ÿง How to install node
  • Prompting
    • ๐Ÿ“šVariable
    • โ›“๏ธCondition
    • ๐Ÿ”จDemonstration
    • โŒ›Loop
    • ๐Ÿ“™Formatting
    • ๐ŸฃChat
    • ๐Ÿ”ŽTechnical term (Retrieve)
  • Privacy Policy
  • Terms & Conditions
Powered by GitBook
On this page
  • OpenAI Compatible
  • Batch size, Max input length, Number token
  • Quantization
  • Context caching
  1. GETTING STARTED
  2. One Click Deploy

Features

All features are automatically enabled by default to improve overall performance.

OpenAI Compatible

Supports OpenAI clients for chat and completion use cases, and Continue.dev for Co-Pilot use cases.

Batch size, Max input length, Number token

Supports long contexts up to 1M context length and dynamic batching to improve utilization.

Quantization

Reduces memory footprint to deploy LLMs to half the original model size.

Improves inference speed by 2 times.

Context caching

Reduces redundant computation when requests have the same context.

Improves inference speed by 1.5 to 2 times on evaluation datasets and over 10 times in ideal scenarios.

PreviousTerminate InstanceNextOpenAI Compatible

Last updated 7 months ago

๐Ÿš€