📚
Docs - Float16
homeapp
  • 🚀GETTING STARTED
    • Introduction
    • Account
      • Dashboard
      • Profile
      • Payment
      • Workspace
      • Service Quota
    • LLM as a service
      • Quick Start
        • Set the credentials
      • Supported Model
      • Limitation
      • API Reference
    • One Click Deploy
      • Quick Start
        • Instance Detail
        • Re-generate API Key
        • Terminate Instance
      • Features
        • OpenAI Compatible
        • Long context and Auto scheduler
        • Quantization
        • Context caching
      • Limitation
      • Validated model
      • Endpoint Specification
    • Serverless GPU
      • Quick Start
        • Mode
        • Task Status
        • App Features
          • Project Detail
      • Tutorials
        • Hello World
        • Install new library
        • Prepare model weight
        • S3 Copy output from remote
        • R2 Copy output from remote
        • Direct upload and download
        • Server mode
        • LLM Dynamic Batching
        • Train and Inference MNIST
        • Etc.
      • CLI References
      • ❓FAQ
    • Playground
      • FloatChat
      • FloatPrompt
      • Quantize by Float16
  • 📚Use Case
    • Q&A Bot (RAG)
    • Text-to-SQL
    • OpenAI with Rate Limit
    • OpenAI with Guardrail
    • Multiple Agents
    • Q&A Chatbots (RAG + Agents)
  • ✳️Journey
    • ✨The Beginner's LLM Development Journey
    • 📖Glossary
      • [English Version] LLM Glossary
      • [ภาษาไทย] LLM Glossary
    • 🧠How to install node
  • Prompting
    • 📚Variable
    • ⛓️Condition
    • 🔨Demonstration
    • ⌛Loop
    • 📙Formatting
    • 🐣Chat
    • 🔎Technical term (Retrieve)
  • Privacy Policy
  • Terms & Conditions
Powered by GitBook
On this page
  • Development Mode
  • Spot Mode
  • Production Mode
  1. GETTING STARTED
  2. Serverless GPU
  3. Quick Start

Mode

Serverless GPU Services

PreviousQuick StartNextTask Status

Last updated 1 month ago

The Serverless GPU service offers two operational modes, Develop and Deploy. Each mode is designed for different use cases to help you efficiently utilize GPU resources for your applications.

Development Mode

Development mode is optimized for one-time GPU tasks, immediate results delivery. Suitable for batch processing and analysis tasks. You can execute tasks using this command

float16 run <your_app> --name <name>

Limitations

  • Maximum execution time: 60 seconds per task

  • Concurrent tasks: 1 task per user

Spot Mode

We also offer another option in Development Mode for users who need to run tasks for more than 60 seconds. This new mode is called Spot Mode.

Usage:

float16 run --spot --name <task_name> --budget <budget_value>

In this mode, tasks can be interrupted by tasks if resources are insufficient. Once the on-demand task is completed, the spot task will automatically resume.

The system does not handle task staging. Users must manage task staging themselves, ensuring the task resumes from the last uncompleted position.

Production Mode

Production mode enables continuous GPU application deployment with API endpoint access.

float16 deploy <your_app> --project-id <project_id>

After successful deployment, you will receive:

  • API endpoints for your application

  • API key for authentication

  • Two endpoint types available:

    • Function Endpoint : Container automatically stop after task completion

    • Server Endpoint : The container remains active for up to 30 seconds and supports multiple requests while running. Each request extends the timeout by another 30 seconds. It automatically stops when the time limit expires.

Limitations

  • Applications must be built using FastAPI framework.

  • Single API key provided per deployment (Regenerate available)

  • Maximum execution time: 30 seconds per task

  • Concurrent tasks: 1 task per user

🚀
on-demand