Long context and Auto scheduler
Last updated
Last updated
One Click Deploy supports long contexts up to 1M context length.
Batch size, max input length, and number of tokens are interrelated and crucial when deploying LLMs by yourself.
i.e. 1 batch size of 1 means processing one request at a time, and the next request will be processed only after the first one is completed.
i.e. max input length of 4,096 means the request can use a maximum of 4,096 tokens per request.
i.e. with 16,384 tokens, an 8 batch size, and a 4,096 max input length,
this endpoint should process a maximum of 8 requests simultaneously, provided the accumulated tokens of the 8 requests do not exceed 16,384 tokens.
This means each request should have an average of no more than 2,048 tokens.
If the incoming request tokens exceed the number of tokens, the request will automatically be put into a queue and wait to be processed when enough tokens are available
max_position_embeddings
in config.json
.If the model repository was trained with a long context, ensure the max_position_embeddings
in config.json
matches the max context used during training.
Long contexts require more VRAM for inference.
The VRAM requirement scales linearly; doubling the context size requires double the VRAM.
It estimates the maximum context length, batch size, and number of tokens using optimization techniques, model size, VRAM, and GPU compute compatibility.
The auto scheduler automatically triggers when certain scenarios are met:
i.e. If your instance can process 8 batch sizes, it means if you are processing 8 requests simultaneously and you have a new request, the new request will wait until a batch size is available.
The new request does not wait for all batch sizes to complete; if a request in the batch size completes, the new request will automatically start processing.
The number of tokens is a hard cap to process parallel requests simultaneously to prevent OOM (Out-Of-Memory) errors.