Local
This runtime provides an interface to deploy models via different backends optimized for running your own LLMs on the serving infrastructure of your choice.
Modes
Depending on your AI application needs, you can choose from chat and text generation with our HuggingFace Transformers
, vLLM
, and DeepSpeed
backends.
Chat
Chat is for instruction-trained or fine-tuned models that expect the usage of different roles with a prompt template. You can customize the prompt usage and provide a prompt template using plain text or Jinja2. Templates will be rendered with a list of messages
where each one is an object with a role
, content
, and type
. You can optionally provide tokens that will be rendered inside the template.
Please visit the reference and the chat example for further details.
Text
Our prompting functionality also enables you to use the LLM for text generation, otherwise known as text completion.
Please visit the reference and the text example for further details.
Backends
The three backends we currently support are vLLM, DeepSpeed, and a custom backend leveraging the HuggingFace Transformers library. While these three are fundamentally different tools, our local runtime provides a common model-settings.json
configuration file and an inference request-response for them. You are still able to configure the specific parameters that differ between each runtime, all within the same .json
file.
In the table below, you can see some of the serving optimizations provided by each backend option. Check out our examples on how you can serve models across multiple GPUs here, or for how to quantize your models see here.
Multi-GPU Serving
Latency, Memory Usage
✅
✅
🟠 (available but not optimal)
Define in model-settings.json
Quantization (before load)
Memory Usage
❌
✅ (awq, gptq)
✅ (awq, gptq)
Define in model-settings.json
Quantization (upon load)
Memory Usage
✅
❌
✅ (bitsandbytes)
Define in model-settings.json
Continuous Batching
Latency
✅
✅
✅
Default serving behaviour
Attention
Latency
✅
✅
✅
Default serving behaviour
K-V Caching
Latency
✅ (blocked K-V caching)
✅ (paged attention)
✅ (zero padding K-V caching)
Default serving behaviour
For more information on each solution, see their respective documentations below:
vLLM ↗ is a ".. high-throughput and memory-efficient inference and serving engine for LLMs". It requires GPU(s) access.
DeepSpeed ↗ is a ".. deep learning optimization library that makes distributed training and inference easy, efficient, and effective". It requires GPU(s) access.
Transformers ↗ is a ".. State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX". It can be used on CPUs or GPUs.
Last updated
Was this helpful?