📚
Docs - Float16
homeapp
  • 🚀GETTING STARTED
    • Introduction
    • Account
      • Dashboard
      • Profile
      • Payment
      • Workspace
    • One Click Deploy
      • Quick Start
        • Instance Detail
        • Re-generate API Key
        • Terminate Instance
      • Features
        • OpenAI Compatible
        • Long context and Auto scheduler
        • Quantization
        • Context caching
      • Limitation
      • Validated model
      • Endpoint Specification
    • Serverless GPU
      • Quick Start
        • Mode
        • Task Status
        • App Features
          • Project Detail
      • Tutorials
        • Hello World
        • Install new library
        • Prepare model weight
        • S3 Copy output from remote
        • R2 Copy output from remote
        • Direct upload and download
        • Server mode
        • LLM Dynamic Batching
        • Train and Inference MNIST
        • Etc.
      • CLI References
      • ❓FAQ
    • Playground
      • FloatChat
      • FloatPrompt
      • Quantize by Float16
  • 📚Use Case
    • Q&A Bot (RAG)
    • Text-to-SQL
    • OpenAI with Rate Limit
    • OpenAI with Guardrail
    • Multiple Agents
    • Q&A Chatbots (RAG + Agents)
  • ✳️Journey
    • ✨The Beginner's LLM Development Journey
    • 📖Glossary
      • [English Version] LLM Glossary
      • [ภาษาไทย] LLM Glossary
    • 🧠How to install node
  • Prompting
    • 📚Variable
    • ⛓️Condition
    • 🔨Demonstration
    • ⌛Loop
    • 📙Formatting
    • 🐣Chat
    • 🔎Technical term (Retrieve)
  • Privacy Policy
  • Terms & Conditions
Powered by GitBook
On this page
  • Development Mode
  • Spot Mode (Within development mode)
  • Production Mode
  1. GETTING STARTED
  2. Serverless GPU
  3. Quick Start

Mode

Serverless GPU Services

The Serverless GPU service offers two operational modes: Develop and Deploy. Each mode is designed for different use cases to help you efficiently utilize GPU resources.

Development Mode

Development mode is optimized for one-time GPU tasks with immediate result delivery. Suitable for batch processing, short-term experiments, and analytical tasks.

Command:

float16 run <your_app> --name <name>

Limitations:

  • Max execution time: 60 seconds per task

  • Max concurrency: up to 8 tasks (not guaranteed; depends on system load)

Pricing:

  • On-demand: $0.006 per second (~$21.6/hour)

Spot Mode (Within development mode)

Spot mode is designed for users who need longer tasks (> 60s), at lower cost.

Note: Spot tasks can be interrupted by on-demand tasks and resumed automatically when resources free up.

Command:

float16 run --spot --name <task_name> --budget <budget_value>
  • Tasks may pause and resume based on resource availability

  • No automatic task staging – users must handle resume logic

Limitations:

  • No fixed execution time limit

  • Max concurrency: up to 8 tasks (not guaranteed)

Pricing:

  • Spot: $0.0012 per second (~$4.32/hour)

Production Mode

Production mode allows continuous deployment of GPU applications via API endpoints. Best suited for serving models and real-time inference.

Command:

float16 deploy <your_app> --project-id <project_id>

After deployment, you will receive:

  • API endpoints for your app

  • API key for authentication

Endpoint Types:

  • Function Endpoint

    • Executes your task and shuts down the container after completion

    • Max execution time: 60 seconds per request

  • Server Endpoint

    • Keeps the container running for real-time interaction

    • Initial active time: 60 seconds

    • Each incoming request: extends timeout by 30 seconds

    • Automatically shuts down when no request is received within the timeout window

Limitations:

  • Must be built with FastAPI

  • One API key per deployment (regeneration supported)

  • Max concurrency: up to 8 tasks (not guaranteed)

Pricing:

  • 💵 On-demand: $0.006 per second (~$21.6/hour)

PreviousQuick StartNextTask Status

Last updated 1 day ago

🚀