# One Click Deploy

{% hint style="danger" %}
This service is under maintenance.
{% endhint %}

## What is One Click Deploy?

One Click Deploy is a service that helps you **deploy LLMs without any configuration**, using just a Huggingface Model Repository. This service allows you to focus on model development while preventing chaos during the inference process.

The One Click Deploy service relies on **TensorRT-LLM** and the **Triton Inference Server**. (**NIM** is also available; please reach out to us for more information.)

## Why One Click Deploy with Float16 ?

Deploying LLMs requires consideration of inference speed, not just **batch size**, **maximum input length**, **number of tokens**, **quantization**, **Context caching**, and other factors.&#x20;

Float16 handles the complexity of configuring LLM deployment, ensuring you have the best experience with LLM serving.

### Main feature

* OpenAI Compatible
* Auto scheduler
* Long context support (128k)
* Quantization
* Context caching

## Pricing

One Click Deploy service charge based on **instance hours** like EC2. (whatever compute or not)

| GPU (number of card) | Region                   | Price per hours |
| -------------------- | ------------------------ | --------------- |
| L40sx1               | N. Virginia (us-east-1)  | $2.7            |
| L40sx1               | Oregon (us-west-2)       | $2.7            |
| L4x1                 | N. Virginia (us-east-1)  | $1.2            |
| L4x1                 | Oregon (us-west-2)       | $1.2            |
| A10x1                | Sydney (ap-southeast-2)  | $1.95           |
| A10x1                | Jakarta (ap-southeast-3) | $2.1            |
| A10x1                | Tokyo (ap-northeast-1)   | $2.2            |

L4x1 is mean the instance have NVIDIA GPU L4 1 card.

L4x4 is mean the instance have NVIDIA GPU L4 4 cards

## Use Case

#### Intensive workload

One Click Deploy provides a dedicated endpoint for you with **no rate limits** or additional costs.&#x20;

This endpoint is private and exclusive to your workload, ensuring it is not shared with others.

#### RAG

Leverage LLMs and vector search together to empower LLMs to access external knowledge or use your business's internal documents.

#### Multilingual

Proprietary solutions are not suitable for low-resource and specific language use cases. You can deploy models for specific languages, such as SeaLLM for South-East Asian languages, Typhoon and OpenThaiGPT for the Thai language.

#### Code co-pilot

An alternative to GitHub Co-Pilot, you can deploy your own co-pilot like CodeQwen1.5-7B-Chat and use it via Continue.dev to help with autocompletion and fill-in-the-middle coding tasks.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.float16.cloud/getting-started/one-click-deploy.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
