Long context and Auto scheduler

One Click Deploy supports long contexts up to 1M context length.

Batch size, (Context length) Max input length, Number token

Batch size, max input length, and number of tokens are interrelated and crucial when deploying LLMs by yourself.

A higher batch size increases the parallel LLM processes at the same time.

i.e. 1 batch size of 1 means processing one request at a time, and the next request will be processed only after the first one is completed.

A longer max input length increases the number of tokens in the prompt.

i.e. max input length of 4,096 means the request can use a maximum of 4,096 tokens per request.

The number of tokens determines the average maximum tokens per request.

i.e. with 16,384 tokens, an 8 batch size, and a 4,096 max input length,

this endpoint should process a maximum of 8 requests simultaneously, provided the accumulated tokens of the 8 requests do not exceed 16,384 tokens.

This means each request should have an average of no more than 2,048 tokens.

If the incoming request tokens exceed the number of tokens, the request will automatically be put into a queue and wait to be processed when enough tokens are available

How does long context is work ?

1. The maximum context length is read from `max_position_embeddings` in `config.json`.

If the model repository was trained with a long context, ensure the max_position_embeddings in config.json matches the max context used during training.

2. Ensure the VRAM instance is sufficient.

Long contexts require more VRAM for inference.

The VRAM requirement scales linearly; doubling the context size requires double the VRAM.

3. One Click Deploy automatically sets the maximum context length.

It estimates the maximum context length, batch size, and number of tokens using optimization techniques, model size, VRAM, and GPU compute compatibility.

How does auto scheduler is work ?

The auto scheduler automatically triggers when certain scenarios are met:

1. Exceeding batch size

i.e. If your instance can process 8 batch sizes, it means if you are processing 8 requests simultaneously and you have a new request, the new request will wait until a batch size is available.

The new request does not wait for all batch sizes to complete; if a request in the batch size completes, the new request will automatically start processing.

2. Exceeding number of token

The number of tokens is a hard cap to process parallel requests simultaneously to prevent OOM (Out-Of-Memory) errors.

PreviousOpenAI Compatible NextQuantization

Last updated 1 year ago