One Click Deploy
What is One Click Deploy?
One Click Deploy is a service that helps you deploy LLMs without any configuration, using just a Huggingface Model Repository. This service allows you to focus on model development while preventing chaos during the inference process.
The One Click Deploy service relies on TensorRT-LLM and the Triton Inference Server. (NIM is also available; please reach out to us for more information.)
Why One Click Deploy with Float16 ?
Deploying LLMs requires consideration of inference speed, not just batch size, maximum input length, number of tokens, quantization, Context caching, and other factors.
Float16 handles the complexity of configuring LLM deployment, ensuring you have the best experience with LLM serving.
Main feature
OpenAI Compatible
Auto scheduler
Long context support (128k)
Quantization
Context caching
Pricing
One Click Deploy service charge based on instance hours like EC2. (whatever compute or not)
GPU (number of card) | Region | Price per hours |
---|---|---|
L40sx1 | N. Virginia (us-east-1) | $2.7 |
L40sx1 | Oregon (us-west-2) | $2.7 |
L4x1 | N. Virginia (us-east-1) | $1.2 |
L4x1 | Oregon (us-west-2) | $1.2 |
A10x1 | Sydney (ap-southeast-2) | $1.95 |
A10x1 | Jakarta (ap-southeast-3) | $2.1 |
A10x1 | Tokyo (ap-northeast-1) | $2.2 |
L4x1 is mean the instance have NVIDIA GPU L4 1 card.
L4x4 is mean the instance have NVIDIA GPU L4 4 cards
Use Case
Intensive workload
One Click Deploy provides a dedicated endpoint for you with no rate limits or additional costs.
This endpoint is private and exclusive to your workload, ensuring it is not shared with others.
RAG
Leverage LLMs and vector search together to empower LLMs to access external knowledge or use your business's internal documents.
Multilingual
Proprietary solutions are not suitable for low-resource and specific language use cases. You can deploy models for specific languages, such as SeaLLM for South-East Asian languages, Typhoon and OpenThaiGPT for the Thai language.
Code co-pilot
An alternative to GitHub Co-Pilot, you can deploy your own co-pilot like CodeQwen1.5-7B-Chat and use it via Continue.dev to help with autocompletion and fill-in-the-middle coding tasks.
Last updated