📚
Docs - Float16
homeapp
  • 🚀GETTING STARTED
    • Introduction
    • Account
      • Dashboard
      • Profile
      • Payment
      • Workspace
      • Service Quota
    • LLM as a service
      • Quick Start
        • Set the credentials
      • Supported Model
      • Limitation
      • API Reference
    • One Click Deploy
      • Quick Start
        • Instance Detail
        • Re-generate API Key
        • Terminate Instance
      • Features
        • OpenAI Compatible
        • Long context and Auto scheduler
        • Quantization
        • Context caching
      • Limitation
      • Validated model
      • Endpoint Specification
    • Serverless GPU
      • Quick Start
        • Mode
        • Task Status
        • App Features
          • Project Detail
      • Tutorials
        • Hello World
        • Install new library
        • Prepare model weight
        • S3 Copy output from remote
        • R2 Copy output from remote
        • Direct upload and download
        • Server mode
        • LLM Dynamic Batching
        • Train and Inference MNIST
        • Etc.
      • CLI References
      • ❓FAQ
    • Playground
      • FloatChat
      • FloatPrompt
      • Quantize by Float16
  • 📚Use Case
    • Q&A Bot (RAG)
    • Text-to-SQL
    • OpenAI with Rate Limit
    • OpenAI with Guardrail
    • Multiple Agents
    • Q&A Chatbots (RAG + Agents)
  • ✳️Journey
    • ✨The Beginner's LLM Development Journey
    • 📖Glossary
      • [English Version] LLM Glossary
      • [ภาษาไทย] LLM Glossary
    • 🧠How to install node
  • Prompting
    • 📚Variable
    • ⛓️Condition
    • 🔨Demonstration
    • ⌛Loop
    • 📙Formatting
    • 🐣Chat
    • 🔎Technical term (Retrieve)
  • Privacy Policy
  • Terms & Conditions
Powered by GitBook
On this page
  • What is dynamic batching ?
  • Step 1 : Download and Upload the weight
  • Step 2 : Prepare Your Script
  • Step 3 : Deploy Script
  • Explore More
  1. GETTING STARTED
  2. Serverless GPU
  3. Tutorials

LLM Dynamic Batching

Get Endpoint via Float16

PreviousServer modeNextTrain and Inference MNIST

Last updated 1 month ago

This tutorial guides you through deploying dynamic batching with FastAPI application using Float16's deployment mode.

  • Float16 CLI installed

  • Logged into Float16 account

  • VSCode or preferred text editor recommended

What is dynamic batching ?

Deploying AI endpoints, also known as online serving, is crucial and challenging because it is difficult and complex to ensure proper VRAM mapping.

During online serving, we have several techniques to enhance GPU utilization and increase throughput.

One of the famous techniques is 'Dynamic batching.'

Dynamic batching helps us maximize GPU utilization while solving the 'memory bound' issue.

Dynamic batching allows you to pack incoming requests within a specific time frame, like 1 sec or 2 sec, into the same batch and perform inference simultaneously.

This helps us infer faster but might trade off with a slight increase in latency.

Step 1 : Download and Upload the weight

We use (a fine-tuned version of Llama3.1-8b) to demonstrate.

huggingface-cli download scb10x/llama3.1-typhoon2-8b-instruct --local-dir ./typhoon2-8b/

float16 storage upload -f ./typhoon2-8b -d weight-llm

Step 2 : Prepare Your Script

(server.py)

import os
import time
from typing import Optional
import uuid 
from fastapi import FastAPI
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import uvicorn
import asyncio

start_load = time.time()
model_name = "../weight-llm/typhoon-8b"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name,padding_side="left")

app = FastAPI()

class ChatRequest(BaseModel):
    messages: str
    max_token : Optional[int] = 512
    

def process_llm(batch_data, batch_id):
    global model
    batch_tokenized = []
    for data in batch_data : 
        _text_formated = [{"role": "user", "content": data}]
        _text_tokenized = tokenizer.apply_chat_template(
            _text_formated,
            tokenize=False,
            add_generation_prompt=True
        )
        batch_tokenized.append(_text_tokenized)

    model_inputs = tokenizer(batch_tokenized, return_tensors="pt",padding=True,truncation=True).to(model.device)
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
        pad_token_id=tokenizer.eos_token_id
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    result_list = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    result_with_id = dict(zip(batch_id,result_list))
    return result_with_id

class BatchProcessor:
    def __init__(self):
        self.batch = []
        self.batch_id = []
        self.lock = asyncio.Lock()
        self.event = asyncio.Event()
        
    async def add_to_batch(self, data, batch_id):
        async with self.lock:
            self.batch.append(data)
            self.batch_id.append(batch_id)

    async def process_batch(self):
        while True:
            await asyncio.sleep(1)  # Wait for 1 second
            async with self.lock:
                current_batch = self.batch.copy()
                current_batch_id = self.batch_id.copy()
                self.batch.clear()
                self.batch_id.clear()

            if current_batch:
                self.results = process_llm(current_batch,current_batch_id)
                self.event.set()
                self.event.clear()

    async def get_result(self, batch_id):
        return self.results[batch_id]

main_batch = BatchProcessor()

@app.post("/chat")
async def chat(text_request: ChatRequest):
    batch_id = uuid.uuid4()
    await main_batch.add_to_batch(text_request.messages, batch_id)
    await main_batch.event.wait()
    result_text = await main_batch.get_result(batch_id)
    return JSONResponse(content={"response": result_text})

async def main():
    asyncio.create_task(main_batch.process_batch())
    config = uvicorn.Config(
        app, host="0.0.0.0", port=int(os.environ["PORT"])
    )
    server = uvicorn.Server(config)
    await server.serve()

  • Ensure the port is set to "port=int(os.environ['PORT'])"

  • Ensure the server is serve with "async def main"

Step 3 : Deploy Script

float16 deploy server.py

After successful deployment, you'll receive:

  • Function Endpoint

  • Server Endpoint

  • API Key

Example:

Function Endpoint: http://api.float16.cloud/task/run/function/x7x2DFl8zU   
Server Endpoint: http://api.float16.cloud/task/run/server/x7x2DFl8zU       
API Key: float16-r-QoZU7uNlgDIFJ5IMrBtOCjuzVBlC

## curl
curl -X POST "{FUNCTION-URL}/chat" -H "Authorization: Bearer {FLOAT16-ENDPOINT-TOKEN}"

curl -X POST "http://api.float16.cloud/task/run/server/x7x2DFl8zU/chat" -H "Authorization: Bearer float16-r-QoZU7uNlgDIFJ5IMrBtOCjuzVBlC" -d '{ "messages": "Hi !! Who are you ?" }' &
curl -X POST "http://api.float16.cloud/task/run/server/x7x2DFl8zU/chat" -H "Authorization: Bearer float16-r-QoZU7uNlgDIFJ5IMrBtOCjuzVBlC" -d '{ "messages": "How about you ?" }'

To pack requests, we need to use the server mode only.

This is because server mode will start the endpoint and keep it alive for 30 seconds.

(You will only be billed for 30 seconds)

It doesn't charge based on the number of requests during the active time.

The server will handle and process the requests by itself. This will help you be more cost-effective.

Congratulations! You've successfully use your first server mode on Float16's serverless GPU platform.

Explore More

Learn how to use Float16 CLI for various use cases in our tutorials.

Happy coding with Float16 Serverless GPU!

🚀
Typhoon2-8b
https://github.com/float16-cloud/examples/tree/main/official/deploy/fastapi-dynamic-batching-typhoon2-8b

Hello World

Launch your first serverless GPU function and kickstart your journey.

Install new library

Enhance your toolkit by adding new libraries tailored to your project needs.

Copy output from remote

Efficiently transfer computation results from remote to your local storage.

Deploy FastAPI Helloworld

Quick start to deploy FastAPI without change the code.

Upload and Download via CLI and Website

Direct upload and download file(s) to server.

More examples

Open source from community and Float16 team.