# Local

## Transformers

The following settings can be passed through the `model-settings.json` file within the `model` object. For example, see `device` and `max_tokens` parameters added from a previous example below:

```json
{
    "name": "Llama-2-7B-chat-AWQ",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TheBloke/Llama-2-7B-chat-AWQ",
                    "device": "cuda",
                    "max_tokens": -1
                }
            }
        }
    }
}
```

For more information on Transformers-specific settings see [here](https://huggingface.co/docs/transformers/v4.44.0/en/model_doc/auto#transformers.AutoModelForCausalLM.from_pretrained) for Causal LMs and [here](https://huggingface.co/docs/transformers/v4.44.0/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM.from_pretrained) for Seq2Seq LMs.

### Model Settings

#### **model: `str`**

**Description:** Model name or path of the HuggingFace model to be deployed.

#### **load\_kwargs: `dict`**

**Description:** Extra arguments to be passed to the `from_pretrained` method of the HuggingFace model.\
**Default:** `{}`

#### **device: `Literal["cpu", "cuda", "auto"]`**

**Description:** Device to be used for the model. If `auto`, the device will be automatically selected based on the availability of GPUs.\
**Default:** `"auto"`

#### **dtype: `str = "auto"`**

**Description:** Data type to be used for the model. If `auto`, the data type will be automatically selected based on the availability of GPUs. If `cpu` is selected, the data type will be `torch.float32`. If `cuda` is selected, the data type will be `torch.float16`.\
**Default:** `"auto"`

#### **tensor\_parallel\_size: `int`**

**Description:** Number of GPUs to be used for tensor parallelism. If `1`, tensor parallelism will not be used.\
**Default:** `1`

#### **pipeline\_parallel\_size: `int`**

**Description:** Number of GPUs to be used for pipeline parallelism. If `1`, pipeline parallelism will not be used. This is not supported yet.\
**Default:** `1`

#### **enable\_profile: `bool`**

**Description:** Whether to enable profiling of the model.\
**Default:** `False`

#### **profile\_kwargs: `Optional["ProfilerSettings"]`**

**Description:** Profiling settings. If `None`, default settings will be used.\
**Default:** `None`

#### **enable\_optimisation: `bool`**

**Description:** Whether to enable optimisation of the model.\
**Default:** `False`

#### **optimisation\_kwargs: `Optional["OptimisationSettings"]`**

**Description:** Optimisation settings. If `None`, default settings will be used.\
**Default:** `None`

#### **config: `Optional["PretrainedConfig"]`**

**Description:** Model configuration. If `None`, the model configuration will be automatically loaded from the model path.\
**Default:** `None`

#### **max\_model\_len: `Optional[int]`**

**Description:** Maximum sequence length the model can handle. If `None`, the maximum model length will be automatically inferred from the model configuration.\
**Default:** `None`

#### **max\_tokens: `Optional[int]`**

**Description:** Maximum number of tokens which can be processed by the model at a time. If `None`, the maximum number of tokens is set to -1 if the device is CPU or inferred from `gpu_memory_utilization` if the device is GPU. Note that -1 means that there is no limit on the number of tokens.\
**Default:** `None`

#### **max\_num\_seqs: `int`**

**Description:** Maximum number of sequences in a batch. Setting it to -1 means that there is no limit on the number of sequences.\
**Default:** `256`

#### **max\_paddings: `int`**

**Description:** Maximum number of paddings in a batch. Setting it to -1 means that there is no limit on the amount of padding.\
**Default:** `2048`

#### **gpu\_memory\_utilization: `float`**

**Description:** The fraction of the GPU memory to be used.\
**Default:** `0.9`

#### **default\_generate\_kwargs: `Dict[str, Any]`**

**Description:** Dictionary of default values for the generate request kwargs.\
**Default:** `{}`

#### **stream: `bool`**

**Description:** Whether to stream the output.\
**Default:** `False`

#### **skip\_special\_tokens: `bool`**

**Description:** Whether to skip special tokens. Potentially useful when used in conjunction with the `ignore_eos` sampling flag.\
**Default:** `True`

#### **max\_tokens: `int`**

**Description:** Maximum number of tokens to generate per output sequence.\
**Default:** `20`

#### **ignore\_eos: `bool`**

**Description:** Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.\
**Default:** `False`

#### **temperature: `float`**

**Description:** Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.\
**Default:** `1.0`

#### **repetition\_penalty: `float`**

**Description:** Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.\
**Default:** `1.0`

#### **top\_p: `float`**

**Description:** Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.\
**Default:** `1.0`

#### **top\_k: `int`**

**Description:** Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.\
**Default:** `-1`

## vLLM

The following settings can be passed through the `model-settings.json` file within the `model` object. For example, see the `gpu_memory_utilization` parameter added from a previous example below:

```json
{
    "name": "tinyllama-1-1b-chat",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
             "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "gpu_memory_utilization": 0.4
                }
            }
        }
    }
}
```

{% hint style="info" %}
A comprehensive list vLLM model settings that can be passed can be viewed [here](https://docs.vllm.ai/en/latest/models/engine_args.html). All those parameters can be passed through via the model-settings file like above.
{% endhint %}

#### **default\_generate\_kwargs: `Dict[str, Any]`**

**Description:** Dictionary of default values for the generate request kwargs. The parameters that can be passed can be seen [here](https://docs.vllm.ai/en/latest/dev/sampling_params.html).\
**Default:** `{}`

#### **stream: `bool`**

**Description:** Whether to stream the output.\
**Default:** `False`

#### **skip\_special\_tokens: `bool`**

**Description:** Whether to skip special tokens. Potentially useful when used in conjunction with the `ignore_eos` sampling flag.\
**Default:** `True`

## DeepSpeed

The following settings can be passed through the `model-settings.json` file within the `model` object. For example, see the `inference_engine_config` object added from a previous example below:

```json
{
    "name": "opt-2.7b-text",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "deepspeed",
            "model_settings": {
                "model": "facebook/opt-2.7b",
                "inference_engine_config": {
                    "state_manager": {
                        "memory_config": {
                            "mode": "allocate",
                            "size": 50
                        }
                    }
                }
            }
        }
    }
}
```

For more information on DeepSpeed-specific settings see [here](https://github.com/microsoft/DeepSpeed/blob/ade7149db491cec71d49814013f1be2d7041dbdc/deepspeed/inference/v2/config_v2.py#L29) for inference engine configuration, [here](https://github.com/microsoft/DeepSpeed/blob/ade7149db491cec71d49814013f1be2d7041dbdc/deepspeed/inference/v2/ragged/manager_configs.py#L137) for state manager configuration, [here](https://github.com/microsoft/DeepSpeed/blob/ade7149db491cec71d49814013f1be2d7041dbdc/deepspeed/inference/v2/ragged/manager_configs.py#L120) for memory configuration, and [here](https://github.com/microsoft/DeepSpeed/blob/ade7149db491cec71d49814013f1be2d7041dbdc/deepspeed/inference/v2/config_v2.py#L19) for quantization configuration.

### Model Settings

#### **model: `str`**

**Description:** Model name or path of the HuggingFace model to be deployed.

#### **tokenizer: `Optional[str]`**

**Description:** Name or path of the HuggingFace tokenizer to be used.\
**Default:** `None`

#### **load\_kwargs: `dict`**

**Description:** Extra arguments to be passed to the `from_pretrained` method of the HuggingFace model, such as `trust_remote_code` and `revision` used to load the model configuration.\
**Default:** `{}`

#### **device: `Literal["cuda"]`**

**Description:** Device to be used for the model. Only supports `cuda`.\
**Default:** `"cuda"`

#### **tensor\_parallel\_size: `int`**

**Description:** Tensor parallelism to use for a model (i.e., how many GPUs to shard a model across). This defaults to the `WORLD_SIZE` environment variable, or a value of 1 if that variable is not set. This value is also propagated to the `inference_engine_config`.\
**Default:** `1`

#### **inference\_engine\_config: `Dict[str, Any]`**

**Description:** DeepSpeed inference engine config. This is automatically generated, but you can provide a set of custom configs.\
**Default:** `{}`

#### **torch\_dist\_port: `int`**

**Description:** Torch distributed port to be used. This also serves as a base port when multiple replicas are deployed. For example, if there are 2 replicas, the first will use port 29500 and the second will use port 29600.\
**Default:** `29500`

#### **max\_model\_len: `Optional[int]`**

**Description:** The maximum number of tokens DeepSpeed-Inference can work with, including the input and output tokens.\
**Default:** `None`

#### **worker\_use\_ray: `bool`**

**Description:** Whether to initialise the worker in a separate process when `tensor_parallel` equals 1.\
**Default:** `False`

#### **config: `Optional["PretrainedConfig"]`**

**Description:** Model configuration. If `None`, the model configuration will be automatically loaded from the model path.\
**Default:** `None`

#### **quantization\_mode: `Optional[str]`**

**Description:** The quantization mode in string format. The supported modes are as follows:\\

* `'wf6af16'`: Weight-only quantization with FP6 weight and FP16 activation.

**Default:** `None`

#### **default\_generate\_kwargs: `Dict[str, Any]`**

**Description:** Dictionary of default values for the generate request kwargs.\
**Default:** `{}`

#### **stream: `bool`**

**Description:** Whether to stream the output. Defaults to `False`.\
**Default:** `False`

#### **skip\_special\_tokens: `bool`**

**Description:** Whether to skip special tokens. Potentially useful when used in conjunction with the `ignore_eos` sampling flag. Defaults to `True`.\
**Default:** `True`

#### **max\_tokens: `int`**

**Description:** Maximum number of tokens to generate per output sequence.\
**Default:** `20`

#### **ignore\_eos: `bool`**

**Description:** Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.\
**Default:** `False`

#### **temperature: `float`**

**Description:** Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.\
**Default:** `1.0`

#### **repetition\_penalty: `float`**

**Description:** Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.\
**Default:** `1.0`

#### **top\_p: `float`**

**Description:** Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.\
**Default:** `1.0`

#### **top\_k: `int`**

**Description:** Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.\
**Default:** `-1`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/resources/reference/local.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
