# Long context and Auto scheduler

One Click Deploy supports long contexts up to 1M context length.

### Batch size, (Context length) Max input length, Number token

<figure><img src="/files/Aq4S15ySflY43JIUcGGv" alt=""><figcaption></figcaption></figure>

Batch size, max input length, and number of tokens are interrelated and crucial when deploying LLMs by yourself.

#### A higher batch size increases the parallel LLM processes at the same time.

i.e. 1 batch size of 1 means processing one request at a time, and the next request will be processed only after the first one is completed.

#### A longer max input length increases the number of tokens in the prompt.

i.e. max input length of 4,096 means the request can use a maximum of 4,096 tokens per request.

#### The number of tokens determines the average maximum tokens per request.

i.e. with 16,384 tokens, an 8 batch size, and a 4,096 max input length,&#x20;

this endpoint should process a maximum of 8 requests simultaneously, provided the accumulated tokens of the 8 requests do not exceed 16,384 tokens.&#x20;

This means each request should have an average of no more than 2,048 tokens.

If the incoming request tokens exceed the number of tokens, the request will automatically be put into a queue and wait to be processed when enough tokens are available

### How does long context is work ?

#### 1. The maximum context length is read from `max_position_embeddings` in `config.json`.

If the model repository was trained with a long context, ensure the `max_position_embeddings` in `config.json` matches the max context used during training.

#### 2. Ensure the VRAM instance is sufficient.

Long contexts require more VRAM for inference.

The VRAM requirement scales linearly; doubling the context size requires double the VRAM.

#### 3. One Click Deploy automatically sets the maximum context length.

It estimates the maximum context length, batch size, and number of tokens using optimization techniques, model size, VRAM, and GPU compute compatibility.

### How does auto scheduler is work ?

The auto scheduler automatically triggers when certain scenarios are met:

#### 1. Exceeding batch size

i.e. If your instance can process 8 batch sizes, it means if you are processing 8 requests simultaneously and you have a new request, the new request will wait until a batch size is available.

{% hint style="info" %}
The new request does not wait for all batch sizes to complete; if a request in the batch size completes, the new request will automatically start processing.
{% endhint %}

#### 2. Exceeding number of token

The number of tokens is a hard cap to process parallel requests simultaneously to prevent OOM (Out-Of-Memory) errors.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.float16.cloud/getting-started/one-click-deploy/features/long-context-and-auto-scheduler.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
