Local
Transformers
The following settings can be passed through the model-settings.json
file within the model
object. For example, see device
and max_tokens
parameters added from a previous example below:
{
"name": "Llama-2-7B-chat-AWQ",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "transformers",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TheBloke/Llama-2-7B-chat-AWQ",
"device": "cuda",
"max_tokens": -1
}
}
}
}
}
For more information on Transformers-specific settings see here for Causal LMs and here for Seq2Seq LMs.
Model Settings
model: str
str
Description: Model name or path of the HuggingFace model to be deployed.
load_kwargs: dict
dict
Description: Extra arguments to be passed to the from_pretrained
method of the HuggingFace model.
Default: {}
device: Literal["cpu", "cuda", "auto"]
Literal["cpu", "cuda", "auto"]
Description: Device to be used for the model. If auto
, the device will be automatically selected based on the availability of GPUs.
Default: "auto"
dtype: str = "auto"
str = "auto"
Description: Data type to be used for the model. If auto
, the data type will be automatically selected based on the availability of GPUs. If cpu
is selected, the data type will be torch.float32
. If cuda
is selected, the data type will be torch.float16
.
Default: "auto"
tensor_parallel_size: int
int
Description: Number of GPUs to be used for tensor parallelism. If 1
, tensor parallelism will not be used.
Default: 1
pipeline_parallel_size: int
int
Description: Number of GPUs to be used for pipeline parallelism. If 1
, pipeline parallelism will not be used. This is not supported yet.
Default: 1
enable_profile: bool
bool
Description: Whether to enable profiling of the model.
Default: False
profile_kwargs: Optional["ProfilerSettings"]
Optional["ProfilerSettings"]
Description: Profiling settings. If None
, default settings will be used.
Default: None
enable_optimisation: bool
bool
Description: Whether to enable optimisation of the model.
Default: False
optimisation_kwargs: Optional["OptimisationSettings"]
Optional["OptimisationSettings"]
Description: Optimisation settings. If None
, default settings will be used.
Default: None
config: Optional["PretrainedConfig"]
Optional["PretrainedConfig"]
Description: Model configuration. If None
, the model configuration will be automatically loaded from the model path.
Default: None
max_model_len: Optional[int]
Optional[int]
Description: Maximum sequence length the model can handle. If None
, the maximum model length will be automatically inferred from the model configuration.
Default: None
max_tokens: Optional[int]
Optional[int]
Description: Maximum number of tokens which can be processed by the model at a time. If None
, the maximum number of tokens is set to -1 if the device is CPU or inferred from gpu_memory_utilization
if the device is GPU. Note that -1 means that there is no limit on the number of tokens.
Default: None
max_num_seqs: int
int
Description: Maximum number of sequences in a batch. Setting it to -1 means that there is no limit on the number of sequences.
Default: 256
max_paddings: int
int
Description: Maximum number of paddings in a batch. Setting it to -1 means that there is no limit on the amount of padding.
Default: 2048
gpu_memory_utilization: float
float
Description: The fraction of the GPU memory to be used.
Default: 0.9
default_generate_kwargs: Dict[str, Any]
Dict[str, Any]
Description: Dictionary of default values for the generate request kwargs.
Default: {}
stream: bool
bool
Description: Whether to stream the output.
Default: False
skip_special_tokens: bool
bool
Description: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos
sampling flag.
Default: True
max_tokens: int
int
Description: Maximum number of tokens to generate per output sequence.
Default: 20
ignore_eos: bool
bool
Description: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
Default: False
temperature: float
float
Description: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
Default: 1.0
repetition_penalty: float
float
Description: Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
Default: 1.0
top_p: float
float
Description: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
Default: 1.0
top_k: int
int
Description: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
Default: -1
vLLM
The following settings can be passed through the model-settings.json
file within the model
object. For example, see the gpu_memory_utilization
parameter added from a previous example below:
{
"name": "tinyllama-1-1b-chat",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"gpu_memory_utilization": 0.4
}
}
}
}
}
default_generate_kwargs: Dict[str, Any]
Dict[str, Any]
Description: Dictionary of default values for the generate request kwargs. The parameters that can be passed can be seen here.
Default: {}
stream: bool
bool
Description: Whether to stream the output.
Default: False
skip_special_tokens: bool
bool
Description: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos
sampling flag.
Default: True
DeepSpeed
The following settings can be passed through the model-settings.json
file within the model
object. For example, see the inference_engine_config
object added from a previous example below:
{
"name": "opt-2.7b-text",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "deepspeed",
"model_settings": {
"model": "facebook/opt-2.7b",
"inference_engine_config": {
"state_manager": {
"memory_config": {
"mode": "allocate",
"size": 50
}
}
}
}
}
}
}
For more information on DeepSpeed-specific settings see here for inference engine configuration, here for state manager configuration, here for memory configuration, and here for quantization configuration.
Model Settings
model: str
str
Description: Model name or path of the HuggingFace model to be deployed.
tokenizer: Optional[str]
Optional[str]
Description: Name or path of the HuggingFace tokenizer to be used.
Default: None
load_kwargs: dict
dict
Description: Extra arguments to be passed to the from_pretrained
method of the HuggingFace model, such as trust_remote_code
and revision
used to load the model configuration.
Default: {}
device: Literal["cuda"]
Literal["cuda"]
Description: Device to be used for the model. Only supports cuda
.
Default: "cuda"
tensor_parallel_size: int
int
Description: Tensor parallelism to use for a model (i.e., how many GPUs to shard a model across). This defaults to the WORLD_SIZE
environment variable, or a value of 1 if that variable is not set. This value is also propagated to the inference_engine_config
.
Default: 1
inference_engine_config: Dict[str, Any]
Dict[str, Any]
Description: DeepSpeed inference engine config. This is automatically generated, but you can provide a set of custom configs.
Default: {}
torch_dist_port: int
int
Description: Torch distributed port to be used. This also serves as a base port when multiple replicas are deployed. For example, if there are 2 replicas, the first will use port 29500 and the second will use port 29600.
Default: 29500
max_model_len: Optional[int]
Optional[int]
Description: The maximum number of tokens DeepSpeed-Inference can work with, including the input and output tokens.
Default: None
worker_use_ray: bool
bool
Description: Whether to initialise the worker in a separate process when tensor_parallel
equals 1.
Default: False
config: Optional["PretrainedConfig"]
Optional["PretrainedConfig"]
Description: Model configuration. If None
, the model configuration will be automatically loaded from the model path.
Default: None
quantization_mode: Optional[str]
Optional[str]
Description: The quantization mode in string format. The supported modes are as follows:\
'wf6af16'
: Weight-only quantization with FP6 weight and FP16 activation.
Default: None
default_generate_kwargs: Dict[str, Any]
Dict[str, Any]
Description: Dictionary of default values for the generate request kwargs.
Default: {}
stream: bool
bool
Description: Whether to stream the output. Defaults to False
.
Default: False
skip_special_tokens: bool
bool
Description: Whether to skip special tokens. Potentially useful when used in conjunction with the ignore_eos
sampling flag. Defaults to True
.
Default: True
max_tokens: int
int
Description: Maximum number of tokens to generate per output sequence.
Default: 20
ignore_eos: bool
bool
Description: Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
Default: False
temperature: float
float
Description: Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
Default: 1.0
repetition_penalty: float
float
Description: Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
Default: 1.0
top_p: float
float
Description: Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
Default: 1.0
top_k: int
int
Description: Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
Default: -1
Last updated
Was this helpful?