# Local

This runtime provides an interface to deploy models via different backends optimized for running your own LLMs on the serving infrastructure of your choice.

## Modes

Depending on your AI application needs, you can choose from chat and text generation with our `HuggingFace Transformers`, `vLLM`, and `DeepSpeed` backends.

### Chat

Chat is for instruction-trained or fine-tuned models that expect the usage of different roles with a prompt template. You can customize the prompt usage and provide a prompt template using plain text or [Jinja2](https://jinja.palletsprojects.com/). Templates will be rendered with a list of `messages` where each one is an object with a `role`, `content`, and `type`. You can optionally provide tokens that will be rendered inside the template.

Please visit the [reference](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/models/reference/local/README.md) and the [chat example](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/models/examples/local/transformers-chat/README.md) for further details.

### Text

Our prompting functionality also enables you to use the LLM for text generation, otherwise known as text completion.

Please visit the [reference](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/models/reference/local/README.md) and the [text example](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/models/examples/local/transformers-text/README.md) for further details.

## Backends

The three backends we currently support are [vLLM](https://docs.vllm.ai/en/latest/), [DeepSpeed](https://www.deepspeed.ai), and a custom backend leveraging the [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) library. While these three are fundamentally different tools, our local runtime provides a common `model-settings.json` configuration file and an inference request-response for them. You are still able to configure the specific parameters that differ between each runtime, all within the same `.json` file.

In the table below, you can see some of the serving optimizations provided by each backend option. Check out our examples on how you can serve models across multiple GPUs [here](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/models/examples/local/tensor-parallelism/README.md), or for how to quantize your models see [here](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/models/examples/local/quantization/README.md).

| Optimization                         | Goal                  | DeepSpeed               | vLLM                | HF Transformers                | How to Use                      |
| ------------------------------------ | --------------------- | ----------------------- | ------------------- | ------------------------------ | ------------------------------- |
| Multi-GPU Serving                    | Latency, Memory Usage | ✅                       | ✅                   | 🟠 (available but not optimal) | Define in `model-settings.json` |
| <p>Quantization<br>(before load)</p> | Memory Usage          | ❌                       | ✅ (awq, gptq)       | ✅ (awq, gptq)                  | Define in `model-settings.json` |
| <p>Quantization<br>(upon load)</p>   | Memory Usage          | ✅                       | ❌                   | ✅ (bitsandbytes)               | Define in `model-settings.json` |
| Continuous Batching                  | Latency               | ✅                       | ✅                   | ✅                              | Default serving behaviour       |
| Attention                            | Latency               | ✅                       | ✅                   | ✅                              | Default serving behaviour       |
| K-V Caching                          | Latency               | ✅ (blocked K-V caching) | ✅ (paged attention) | ✅ (zero padding K-V caching)   | Default serving behaviour       |

For more information on each solution, see their respective documentations below:

* [vLLM ↗](https://github.com/vllm-project/vllm) is a ".. high-throughput and memory-efficient inference and serving engine for LLMs". It requires GPU(s) access.
* [DeepSpeed ↗](https://github.com/microsoft/DeepSpeed) is a ".. deep learning optimization library that makes distributed training and inference easy, efficient, and effective". It requires GPU(s) access.
* [Transformers ↗](https://github.com/huggingface/transformers) is a ".. State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX". It can be used on CPUs or GPUs.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/components/models/local.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
