Local

Transformers

The following settings can be passed through the model-settings.json file within the model object. For example, see device and max_tokens parameters added from a previous example below:

{
    "name": "Llama-2-7B-chat-AWQ",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TheBloke/Llama-2-7B-chat-AWQ",
                    "device": "cuda",
                    "max_tokens": -1
                }
            }
        }
    }
}

For more information on Transformers-specific settings see here for Causal LMs and here for Seq2Seq LMs.

Model Settings

model: str

Description: Model name or path of the HuggingFace model to be deployed.

load_kwargs: dict

Description: Extra arguments to be passed to the from_pretrained method of the HuggingFace model. Default: {}

device: Literal["cpu", "cuda", "auto"]

Description: Device to be used for the model. If auto, the device will be automatically selected based on the availability of GPUs. Default: "auto"

dtype: str = "auto"

Description: Data type to be used for the model. If auto, the data type will be automatically selected based on the availability of GPUs. If cpu is selected, the data type will be torch.float32. If cuda is selected, the data type will be torch.float16. Default: "auto"

tensor_parallel_size: int

Description: Number of GPUs to be used for tensor parallelism. If 1, tensor parallelism will not be used. Default: 1

pipeline_parallel_size: int

Description: Number of GPUs to be used for pipeline parallelism. If 1, pipeline parallelism will not be used. This is not supported yet. Default: 1

enable_profile: bool

Description: Whether to enable profiling of the model. Default: False

profile_kwargs: Optional["ProfilerSettings"]

Description: Profiling settings. If None, default settings will be used. Default: None

enable_optimisation: bool

Description: Whether to enable optimisation of the model. Default: False

optimisation_kwargs: Optional["OptimisationSettings"]

Description: Optimisation settings. If None, default settings will be used. Default: None

config: Optional["PretrainedConfig"]

Description: Model configuration. If None, the model configuration will be automatically loaded from the model path. Default: None

max_model_len: Optional[int]

Description: Maximum sequence length the model can handle. If None, the maximum model length will be automatically inferred from the model configuration. Default: None

max_tokens: Optional[int]

Description: Maximum number of tokens which can be processed by the model at a time. If None, the maximum number of tokens is set to -1 if the device is CPU or inferred from gpu_memory_utilization if the device is GPU. Note that -1 means that there is no limit on the number of tokens. Default: None

max_num_seqs: int

Description: Maximum number of sequences in a batch. Setting it to -1 means that there is no limit on the number of sequences. Default: 256

max_paddings: int

Description: Maximum number of paddings in a batch. Setting it to -1 means that there is no limit on the amount of padding. Default: 2048

gpu_memory_utilization: float

Description: The fraction of the GPU memory to be used. Default: 0.9

default_generate_kwargs: Dict[str, Any]

Description: Dictionary of default values for the generate request kwargs. Default: {}

stream: bool

Description: Whether to stream the output. Default: False

skip_special_tokens: bool

Description: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos sampling flag. Default: True

max_tokens: int

Description: Maximum number of tokens to generate per output sequence. Default: 20

ignore_eos: bool

Description: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Default: False

temperature: float

Description: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. Default: 1.0

repetition_penalty: float

Description: Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Default: 1.0

top_p: float

Description: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. Default: 1.0

top_k: int

Description: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens. Default: -1

vLLM

The following settings can be passed through the model-settings.json file within the model object. For example, see the gpu_memory_utilization parameter added from a previous example below:

{
    "name": "tinyllama-1-1b-chat",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
             "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "gpu_memory_utilization": 0.4
                }
            }
        }
    }
}

A comprehensive list vLLM model settings that can be passed can be viewed here. All those parameters can be passed through via the model-settings file like above.

default_generate_kwargs: Dict[str, Any]

Description: Dictionary of default values for the generate request kwargs. The parameters that can be passed can be seen here. Default: {}

stream: bool

Description: Whether to stream the output. Default: False

skip_special_tokens: bool

Description: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos sampling flag. Default: True

DeepSpeed

The following settings can be passed through the model-settings.json file within the model object. For example, see the inference_engine_config object added from a previous example below:

{
    "name": "opt-2.7b-text",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "deepspeed",
            "model_settings": {
                "model": "facebook/opt-2.7b",
                "inference_engine_config": {
                    "state_manager": {
                        "memory_config": {
                            "mode": "allocate",
                            "size": 50
                        }
                    }
                }
            }
        }
    }
}

For more information on DeepSpeed-specific settings see here for inference engine configuration, here for state manager configuration, here for memory configuration, and here for quantization configuration.

Model Settings

model: str

Description: Model name or path of the HuggingFace model to be deployed.

tokenizer: Optional[str]

Description: Name or path of the HuggingFace tokenizer to be used. Default: None

load_kwargs: dict

Description: Extra arguments to be passed to the from_pretrained method of the HuggingFace model, such as trust_remote_code and revision used to load the model configuration. Default: {}

device: Literal["cuda"]

Description: Device to be used for the model. Only supports cuda. Default: "cuda"

tensor_parallel_size: int

Description: Tensor parallelism to use for a model (i.e., how many GPUs to shard a model across). This defaults to the WORLD_SIZE environment variable, or a value of 1 if that variable is not set. This value is also propagated to the inference_engine_config. Default: 1

inference_engine_config: Dict[str, Any]

Description: DeepSpeed inference engine config. This is automatically generated, but you can provide a set of custom configs. Default: {}

torch_dist_port: int

Description: Torch distributed port to be used. This also serves as a base port when multiple replicas are deployed. For example, if there are 2 replicas, the first will use port 29500 and the second will use port 29600. Default: 29500

max_model_len: Optional[int]

Description: The maximum number of tokens DeepSpeed-Inference can work with, including the input and output tokens. Default: None

worker_use_ray: bool

Description: Whether to initialise the worker in a separate process when tensor_parallel equals 1. Default: False

config: Optional["PretrainedConfig"]

Description: Model configuration. If None, the model configuration will be automatically loaded from the model path. Default: None

quantization_mode: Optional[str]

Description: The quantization mode in string format. The supported modes are as follows:\

  • 'wf6af16': Weight-only quantization with FP6 weight and FP16 activation.

Default: None

default_generate_kwargs: Dict[str, Any]

Description: Dictionary of default values for the generate request kwargs. Default: {}

stream: bool

Description: Whether to stream the output. Defaults to False. Default: False

skip_special_tokens: bool

Description: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos sampling flag. Defaults to True. Default: True

max_tokens: int

Description: Maximum number of tokens to generate per output sequence. Default: 20

ignore_eos: bool

Description: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Default: False

temperature: float

Description: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling. Default: 1.0

repetition_penalty: float

Description: Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Default: 1.0

top_p: float

Description: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. Default: 1.0

top_k: int

Description: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens. Default: -1

Last updated

Was this helpful?