LLM Dynamic Batching
Get Endpoint via Float16
This tutorial guides you through deploying dynamic batching with FastAPI application using Float16's deployment mode.
What is dynamic batching ?
Deploying AI endpoints, also known as online serving, is crucial and challenging because it is difficult and complex to ensure proper VRAM mapping.
During online serving, we have several techniques to enhance GPU utilization and increase throughput.
One of the famous techniques is 'Dynamic batching.'
Dynamic batching helps us maximize GPU utilization while solving the 'memory bound' issue.
Dynamic batching allows you to pack incoming requests within a specific time frame, like 1 sec or 2 sec, into the same batch and perform inference simultaneously.
This helps us infer faster but might trade off with a slight increase in latency.
Step 1 : Download and Upload the weight
We use Typhoon2-8b (a fine-tuned version of Llama3.1-8b) to demonstrate.
huggingface-cli download scb10x/llama3.1-typhoon2-8b-instruct --local-dir ./typhoon2-8b/
float16 storage upload -f ./typhoon2-8b -d weight-llmStep 2 : Prepare Your Script
(server.py)
Step 3 : Deploy Script
After successful deployment, you'll receive:
Function Endpoint
Server Endpoint
API Key
Example:
To pack requests, we need to use the server mode only.
This is because server mode will start the endpoint and keep it alive for 30 seconds.
(You will only be billed for 30 seconds)
It doesn't charge based on the number of requests during the active time.
The server will handle and process the requests by itself. This will help you be more cost-effective.
Congratulations! You've successfully use your first server mode on Float16's serverless GPU platform.
Explore More
Learn how to use Float16 CLI for various use cases in our tutorials.
Happy coding with Float16 Serverless GPU!
Last updated