githubEdit

Local

This runtime provides an interface to deploy models via different backends optimized for running your own LLMs on the serving infrastructure of your choice.

Modes

Depending on your AI application needs, you can choose from chat and text generation with our HuggingFace Transformers, vLLM, and DeepSpeed backends.

Chat

Chat is for instruction-trained or fine-tuned models that expect the usage of different roles with a prompt template. You can customize the prompt usage and provide a prompt template using plain text or Jinja2arrow-up-right. Templates will be rendered with a list of messages where each one is an object with a role, content, and type. You can optionally provide tokens that will be rendered inside the template.

Please visit the referencearrow-up-right and the chat examplearrow-up-right for further details.

Text

Our prompting functionality also enables you to use the LLM for text generation, otherwise known as text completion.

Please visit the referencearrow-up-right and the text examplearrow-up-right for further details.

Backends

The three backends we currently support are vLLMarrow-up-right, DeepSpeedarrow-up-right, and a custom backend leveraging the HuggingFace Transformersarrow-up-right library. While these three are fundamentally different tools, our local runtime provides a common model-settings.json configuration file and an inference request-response for them. You are still able to configure the specific parameters that differ between each runtime, all within the same .json file.

In the table below, you can see some of the serving optimizations provided by each backend option. Check out our examples on how you can serve models across multiple GPUs herearrow-up-right, or for how to quantize your models see herearrow-up-right.

Optimization
Goal
DeepSpeed
vLLM
HF Transformers
How to Use

Multi-GPU Serving

Latency, Memory Usage

🟠 (available but not optimal)

Define in model-settings.json

Quantization (before load)

Memory Usage

✅ (awq, gptq)

✅ (awq, gptq)

Define in model-settings.json

Quantization (upon load)

Memory Usage

✅ (bitsandbytes)

Define in model-settings.json

Continuous Batching

Latency

Default serving behaviour

Attention

Latency

Default serving behaviour

K-V Caching

Latency

✅ (blocked K-V caching)

✅ (paged attention)

✅ (zero padding K-V caching)

Default serving behaviour

For more information on each solution, see their respective documentations below:

  • vLLM ↗arrow-up-right is a ".. high-throughput and memory-efficient inference and serving engine for LLMs". It requires GPU(s) access.

  • DeepSpeed ↗arrow-up-right is a ".. deep learning optimization library that makes distributed training and inference easy, efficient, and effective". It requires GPU(s) access.

  • Transformers ↗arrow-up-right is a ".. State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX". It can be used on CPUs or GPUs.

Last updated

Was this helpful?