One Click Deploy

What is One Click Deploy?

One Click Deploy is a service that helps you deploy LLMs without any configuration, using just a Huggingface Model Repository. This service allows you to focus on model development while preventing chaos during the inference process.

The One Click Deploy service relies on TensorRT-LLM and the Triton Inference Server. (NIM is also available; please reach out to us for more information.)

Why One Click Deploy with Float16 ?

Deploying LLMs requires consideration of inference speed, not just batch size, maximum input length, number of tokens, quantization, Context caching, and other factors.

Float16 handles the complexity of configuring LLM deployment, ensuring you have the best experience with LLM serving.

Main feature

  • OpenAI Compatible

  • Auto scheduler

  • Long context support (128k)

  • Quantization

  • Context caching

Pricing

One Click Deploy service charge based on instance hours like EC2. (whatever compute or not)

L4x1 is mean the instance have NVIDIA GPU L4 1 card.

L4x4 is mean the instance have NVIDIA GPU L4 4 cards

Use Case

Intensive workload

One Click Deploy provides a dedicated endpoint for you with no rate limits or additional costs.

This endpoint is private and exclusive to your workload, ensuring it is not shared with others.

RAG

Leverage LLMs and vector search together to empower LLMs to access external knowledge or use your business's internal documents.

Multilingual

Proprietary solutions are not suitable for low-resource and specific language use cases. You can deploy models for specific languages, such as SeaLLM for South-East Asian languages, Typhoon and OpenThaiGPT for the Thai language.

Code co-pilot

An alternative to GitHub Co-Pilot, you can deploy your own co-pilot like CodeQwen1.5-7B-Chat and use it via Continue.dev to help with autocompletion and fill-in-the-middle coding tasks.

Last updated