Local
Transformers
The following settings can be passed through the model-settings.json file within the model object. For example, see device and max_tokens parameters added from a previous example below:
{
"name": "Llama-2-7B-chat-AWQ",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "transformers",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TheBloke/Llama-2-7B-chat-AWQ",
"device": "cuda",
"max_tokens": -1
}
}
}
}
}For more information on Transformers-specific settings see here for Causal LMs and here for Seq2Seq LMs.
Model Settings
model: str
strDescription: Model name or path of the HuggingFace model to be deployed.
load_kwargs: dict
dictDescription: Extra arguments to be passed to the from_pretrained method of the HuggingFace model.
Default: {}
device: Literal["cpu", "cuda", "auto"]
Literal["cpu", "cuda", "auto"]Description: Device to be used for the model. If auto, the device will be automatically selected based on the availability of GPUs.
Default: "auto"
dtype: str = "auto"
str = "auto"Description: Data type to be used for the model. If auto, the data type will be automatically selected based on the availability of GPUs. If cpu is selected, the data type will be torch.float32. If cuda is selected, the data type will be torch.float16.
Default: "auto"
tensor_parallel_size: int
intDescription: Number of GPUs to be used for tensor parallelism. If 1, tensor parallelism will not be used.
Default: 1
pipeline_parallel_size: int
intDescription: Number of GPUs to be used for pipeline parallelism. If 1, pipeline parallelism will not be used. This is not supported yet.
Default: 1
enable_profile: bool
boolDescription: Whether to enable profiling of the model.
Default: False
profile_kwargs: Optional["ProfilerSettings"]
Optional["ProfilerSettings"]Description: Profiling settings. If None, default settings will be used.
Default: None
enable_optimisation: bool
boolDescription: Whether to enable optimisation of the model.
Default: False
optimisation_kwargs: Optional["OptimisationSettings"]
Optional["OptimisationSettings"]Description: Optimisation settings. If None, default settings will be used.
Default: None
config: Optional["PretrainedConfig"]
Optional["PretrainedConfig"]Description: Model configuration. If None, the model configuration will be automatically loaded from the model path.
Default: None
max_model_len: Optional[int]
Optional[int]Description: Maximum sequence length the model can handle. If None, the maximum model length will be automatically inferred from the model configuration.
Default: None
max_tokens: Optional[int]
Optional[int]Description: Maximum number of tokens which can be processed by the model at a time. If None, the maximum number of tokens is set to -1 if the device is CPU or inferred from gpu_memory_utilization if the device is GPU. Note that -1 means that there is no limit on the number of tokens.
Default: None
max_num_seqs: int
intDescription: Maximum number of sequences in a batch. Setting it to -1 means that there is no limit on the number of sequences.
Default: 256
max_paddings: int
intDescription: Maximum number of paddings in a batch. Setting it to -1 means that there is no limit on the amount of padding.
Default: 2048
gpu_memory_utilization: float
floatDescription: The fraction of the GPU memory to be used.
Default: 0.9
default_generate_kwargs: Dict[str, Any]
Dict[str, Any]Description: Dictionary of default values for the generate request kwargs.
Default: {}
stream: bool
boolDescription: Whether to stream the output.
Default: False
skip_special_tokens: bool
boolDescription: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos sampling flag.
Default: True
max_tokens: int
intDescription: Maximum number of tokens to generate per output sequence.
Default: 20
ignore_eos: bool
boolDescription: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
Default: False
temperature: float
floatDescription: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
Default: 1.0
repetition_penalty: float
floatDescription: Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
Default: 1.0
top_p: float
floatDescription: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
Default: 1.0
top_k: int
intDescription: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
Default: -1
vLLM
The following settings can be passed through the model-settings.json file within the model object. For example, see the gpu_memory_utilization parameter added from a previous example below:
{
"name": "tinyllama-1-1b-chat",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"gpu_memory_utilization": 0.4
}
}
}
}
}default_generate_kwargs: Dict[str, Any]
Dict[str, Any]Description: Dictionary of default values for the generate request kwargs. The parameters that can be passed can be seen here.
Default: {}
stream: bool
boolDescription: Whether to stream the output.
Default: False
skip_special_tokens: bool
boolDescription: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos sampling flag.
Default: True
DeepSpeed
The following settings can be passed through the model-settings.json file within the model object. For example, see the inference_engine_config object added from a previous example below:
{
"name": "opt-2.7b-text",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "deepspeed",
"model_settings": {
"model": "facebook/opt-2.7b",
"inference_engine_config": {
"state_manager": {
"memory_config": {
"mode": "allocate",
"size": 50
}
}
}
}
}
}
}For more information on DeepSpeed-specific settings see here for inference engine configuration, here for state manager configuration, here for memory configuration, and here for quantization configuration.
Model Settings
model: str
strDescription: Model name or path of the HuggingFace model to be deployed.
tokenizer: Optional[str]
Optional[str]Description: Name or path of the HuggingFace tokenizer to be used.
Default: None
load_kwargs: dict
dictDescription: Extra arguments to be passed to the from_pretrained method of the HuggingFace model, such as trust_remote_code and revision used to load the model configuration.
Default: {}
device: Literal["cuda"]
Literal["cuda"]Description: Device to be used for the model. Only supports cuda.
Default: "cuda"
tensor_parallel_size: int
intDescription: Tensor parallelism to use for a model (i.e., how many GPUs to shard a model across). This defaults to the WORLD_SIZE environment variable, or a value of 1 if that variable is not set. This value is also propagated to the inference_engine_config.
Default: 1
inference_engine_config: Dict[str, Any]
Dict[str, Any]Description: DeepSpeed inference engine config. This is automatically generated, but you can provide a set of custom configs.
Default: {}
torch_dist_port: int
intDescription: Torch distributed port to be used. This also serves as a base port when multiple replicas are deployed. For example, if there are 2 replicas, the first will use port 29500 and the second will use port 29600.
Default: 29500
max_model_len: Optional[int]
Optional[int]Description: The maximum number of tokens DeepSpeed-Inference can work with, including the input and output tokens.
Default: None
worker_use_ray: bool
boolDescription: Whether to initialise the worker in a separate process when tensor_parallel equals 1.
Default: False
config: Optional["PretrainedConfig"]
Optional["PretrainedConfig"]Description: Model configuration. If None, the model configuration will be automatically loaded from the model path.
Default: None
quantization_mode: Optional[str]
Optional[str]Description: The quantization mode in string format. The supported modes are as follows:\
'wf6af16': Weight-only quantization with FP6 weight and FP16 activation.
Default: None
default_generate_kwargs: Dict[str, Any]
Dict[str, Any]Description: Dictionary of default values for the generate request kwargs.
Default: {}
stream: bool
boolDescription: Whether to stream the output. Defaults to False.
Default: False
skip_special_tokens: bool
boolDescription: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos sampling flag. Defaults to True.
Default: True
max_tokens: int
intDescription: Maximum number of tokens to generate per output sequence.
Default: 20
ignore_eos: bool
boolDescription: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
Default: False
temperature: float
floatDescription: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
Default: 1.0
repetition_penalty: float
floatDescription: Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
Default: 1.0
top_p: float
floatDescription: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
Default: 1.0
top_k: int
intDescription: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
Default: -1
Last updated
Was this helpful?