Local

This runtime provides an interface to deploy models via different backends optimized for running your own LLMs on the serving infrastructure of your choice.

Modes

Depending on your AI application needs, you can choose from chat and text generation with our HuggingFace Transformers, vLLM, and DeepSpeed backends.

Chat

Chat is for instruction-trained or fine-tuned models that expect the usage of different roles with a prompt template. You can customize the prompt usage and provide a prompt template using plain text or Jinja2. Templates will be rendered with a list of messages where each one is an object with a role, content, and type. You can optionally provide tokens that will be rendered inside the template.

Please visit the reference and the chat example for further details.

Text

Our prompting functionality also enables you to use the LLM for text generation, otherwise known as text completion.

Please visit the reference and the text example for further details.

Backends

The three backends we currently support are vLLM, DeepSpeed, and a custom backend leveraging the HuggingFace Transformers library. While these three are fundamentally different tools, our local runtime provides a common model-settings.json configuration file and an inference request-response for them. You are still able to configure the specific parameters that differ between each runtime, all within the same .json file.

In the table below, you can see some of the serving optimizations provided by each backend option. Check out our examples on how you can serve models across multiple GPUs here, or for how to quantize your models see here.

Optimization
Goal
DeepSpeed
vLLM
HF Transformers
How to Use

Multi-GPU Serving

Latency, Memory Usage

🟠 (available but not optimal)

Define in model-settings.json

Quantization (before load)

Memory Usage

✅ (awq, gptq)

✅ (awq, gptq)

Define in model-settings.json

Quantization (upon load)

Memory Usage

✅ (bitsandbytes)

Define in model-settings.json

Continuous Batching

Latency

Default serving behaviour

Attention

Latency

Default serving behaviour

K-V Caching

Latency

✅ (blocked K-V caching)

✅ (paged attention)

✅ (zero padding K-V caching)

Default serving behaviour

For more information on each solution, see their respective documentations below:

  • vLLM ↗ is a ".. high-throughput and memory-efficient inference and serving engine for LLMs". It requires GPU(s) access.

  • DeepSpeed ↗ is a ".. deep learning optimization library that makes distributed training and inference easy, efficient, and effective". It requires GPU(s) access.

  • Transformers ↗ is a ".. State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX". It can be used on CPUs or GPUs.

Last updated

Was this helpful?