Quantization
The following demonstrates how to use the quantizaion capabilites of the Local runtime on top of Seldon Core 2 (SC2) to server your LLM on limited hardware (i.e., when the fp16 model does not fit on yout GPUs). We will showcase how to deploy quantized models, using the all our backend. To run this example you will need SC2 up and running, the Local runtime deployed on SC2, and a GPU with compute capability >= 8.0 for the deepspeed
backeend.
In order to start serving LLMs you have to create a secret for the Huggingface token and deploy the Local Runtime server. Please check our installation tutorial to see how to do so.
We are now ready to deploy our LLMs on SC2, but before that, let us define the inference request we will send to our models.
import pprint
import requests
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"]
},
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"data": ["What is the capital of France?"],
},
{
"name": "type",
"shape": [1],
"datatype": "BYTES",
"data": ["text"],
}
],
}
Qunatization with Transformers
We begin with the transformers
backend for which we deploy a llama2-7b
model with different quantization methods.
AWQ
Loading a awq quantized model is as simple as setting the uri to an already awq
quantized model on the Huggingface model hub or to the local disk. Note that the scheduler is deactivated by setting "max_tokens": -1
as it may not work properly with awq
quantized models.
The model-settings.json
file for awq
is the following:
!cat models/transformers/llama-awq/model-settings.json
{
"name": "llama-awq",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "transformers",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TheBloke/Llama-2-7B-chat-AWQ",
"device": "cuda",
"max_tokens": -1
}
}
}
}
}
To deploy the model on SC2, we need to apply the following manifest file:
!cat manifests/transformers/llama-awq.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama-awq
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/examples/models/local/quantization/models/transformers/llama-awq"
requirements:
- llm-local
To apply the manifest file, run the following command:
!kubectl apply -f manifests/transformers/llama-awq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/llama-awq created
model.mlops.seldon.io/llama-awq condition met
Once the model is deployed, we can send an inference request:
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-awq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'feb2f362-2cd3-44b1-a250-d3d89af44146',
'model_name': 'llama-awq_1',
'outputs': [{'data': ['assistant'],
'datatype': 'BYTES',
'name': 'role',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': [' The capital of France is Paris.'],
'datatype': 'BYTES',
'name': 'content',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['text'],
'datatype': 'BYTES',
'name': 'type',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
To unload the model, run the following command:
!kubectl delete -f manifests/transformers/llama-awq.yaml -n seldon
model.mlops.seldon.io "llama-awq" deleted
GPTQ
Similarly, loading a gptq
quantized model is as simple as setting the model
to an already gptq
quantized model on the Huggingface model hub or to the local disk. Note that the scheduler is deactivated by setting "max_tokens": -1
as it may not work properly with gptq quantized models.
The model-settings.json
file for gptq
is the following:
!cat models/transformers/llama-gptq/model-settings.json
{
"name": "llama-gptq",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"uri": "TheBloke/Llama-2-7B-chat-GPTQ",
"extra": {
"backend": "transformers",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TheBloke/Llama-2-7B-chat-GPTQ",
"device": "cuda",
"max_tokens": -1
}
}
}
}
}
To deploy the model on SC2, we need to apply the following manifest file:
!cat manifests/transformers/llama-gptq.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama-gptq
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/transformers/llama-gptq"
requirements:
- llm-local
To apply the manifest file, run the following command:
!kubectl apply -f manifests/transformers/llama-gptq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/llama-gptq created
model.mlops.seldon.io/llama-gptq condition met
Once the model is deployed, we can send an inference request:
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-gptq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': '1f96a023-d14b-4f6c-a169-eec2ddfae959',
'model_name': 'llama-gptq_1',
'outputs': [{'data': ['assistant'],
'datatype': 'BYTES',
'name': 'role',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': [' The capital of France is Paris (French: Paris).'],
'datatype': 'BYTES',
'name': 'content',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['text'],
'datatype': 'BYTES',
'name': 'type',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
To unload the model, run the following command:
!kubectl delete -f manifests/transformers/llama-gptq.yaml -n seldon
model.mlops.seldon.io "llama-gptq" deleted
BitsAndBytes quantization
The Transformers backend also allows to perform quantization upon loading using bitsandbytes quantization.
The model-settings.json
file for bnb
is the following:
!cat models/transformers/llama-bnb/model-settings.json
{
"name": "llama-bnb",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "transformers",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "meta-llama/Llama-2-7b-chat-hf",
"device": "cuda",
"max_tokens": -1,
"load_kwargs": {
"load_in_4bit": true
}
}
}
}
}
}
You can also specify more advanced quantization settings as flows:
!cat models/transformers/llama-bnb-config/model-settings.json
{
"name": "llama-bnb-config",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "transformers",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "meta-llama/Llama-2-7b-chat-hf",
"device": "cuda",
"max_tokens": -1,
"load_kwargs": {
"quantization_config": {
"load_in_4bit": true,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": true,
"bnb_4bit_compute_dtype": "float16"
}
}
}
}
}
}
}
To deploy the model on SC2, we need to apply the following manifest file:
!cat manifests/transformers/llama-bnb-config.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama-bnb-config
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/transformers/llama-bnb-config"
requirements:
- llm-local
To apply the manifest file, run the following command:
!kubectl apply -f manifests/transformers/llama-bnb-config.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/llama-bnb-config created
model.mlops.seldon.io/llama-bnb-config condition met
Once the model is deployed, we can send an inference request:
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-bnb-config/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'cd581c0c-6869-4c22-99c2-05200a5b957c',
'model_name': 'llama-bnb-config_1',
'outputs': [{'data': ['assistant'],
'datatype': 'BYTES',
'name': 'role',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': [' The capital of France is Paris. капитал of France is '
'Paris. Paris is located in the'],
'datatype': 'BYTES',
'name': 'content',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['text'],
'datatype': 'BYTES',
'name': 'type',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
To unload the model, run the following command:
!kubectl delete -f manifests/transformers/llama-bnb-config.yaml -n seldon
model.mlops.seldon.io "llama-bnb-config" deleted
Quantization with vLLM
AWQ
Loading a awq
quantized model is as simple as setting the model
to an already awq
quantized model on the Huggingface model hub or to the local disk and specifying the quantization settings to the appropriate method.
The model-settings.json
file for awq
is the following:
!cat models/vllm/llama-awq/model-settings.json
{
"name": "llama-awq",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TheBloke/Llama-2-7B-chat-AWQ",
"quantization": "awq"
}
}
}
}
}
To deploy the model on SC2, we need to apply the following manifest file:
!cat manifests/vllm/llama-awq.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama-awq
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/vllm/llama-awq"
requirements:
- llm-local
To apply the manifest file, run the following command:
!kubectl apply -f manifests/vllm/llama-awq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/llama-awq created
model.mlops.seldon.io/llama-awq condition met
Once the model is deployed, we can send an inference request:
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-awq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': '8de1bbd4-e77a-499e-9466-141f9202ef22',
'model_name': 'llama-awq_1',
'outputs': [{'data': ['assistant'],
'datatype': 'BYTES',
'name': 'role',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': [' The capital of France is Paris.'],
'datatype': 'BYTES',
'name': 'content',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['text'],
'datatype': 'BYTES',
'name': 'type',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
To unload the model, run the following command:
!kubectl delete -f manifests/transformers/llama-awq.yaml -n seldon
model.mlops.seldon.io "llama-awq" deleted
GPTQ
Similarly, loading a gptq
quantized model is as simple as setting the model
to an already gptq
quantized model on the Huggingface model hub or to the local disk and specifying the quantization settings to the appropriate method.
The model-settings.json
file for gptq
is the following:
!cat models/vllm/llama-gptq/model-settings.json
{
"name": "llama-gptq",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "TheBloke/Llama-2-7B-chat-GPTQ",
"quantization": "gptq"
}
}
}
}
}
To deploy the model on SC2, we need to apply the following manifest file:
!cat manifests/vllm/llama-gptq.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama-gptq
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/vllm/llama-gptq"
requirements:
- llm-local
To apply the manifest file, run the following command:
!kubectl apply -f manifests/vllm/llama-gptq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/llama-gptq created
model.mlops.seldon.io/llama-gptq condition met
Once the model is deployed, we can send an inference request:
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-gptq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'd1b7d0b8-af0d-4ffe-998d-9b14e43ef822',
'model_name': 'llama-gptq_1',
'outputs': [{'data': ['assistant'],
'datatype': 'BYTES',
'name': 'role',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': [' The capital of France is Paris (French: Paris).'],
'datatype': 'BYTES',
'name': 'content',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['text'],
'datatype': 'BYTES',
'name': 'type',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
To unload the model, run the following command:
!kubectl delete -f manifests/vllm/llama-gptq.yaml -n seldon
model.mlops.seldon.io "llama-gptq" deleted
Quantization with DeepSpeed
wf6af16
The Deepspeed backend allows to perform wf6af16
quantization upon loading only for the model running by default in float16
.
The model-settings.json
file for wf6af16
is the following:
!cat models/deepspeed/llama-wf6af16/model-settings.json
{
"name": "llama-wf6af16",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "deepspeed",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "meta-llama/Llama-2-7b-chat-hf",
"quantization_mode": "wf6af16"
}
}
}
}
}
To deploy the model on SC2, we need to apply the following manifest file:
!cat manifests/deepspeed/llama-wf6af16.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: local-wf6af16
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/deepspeed/llama-wf6af16"
requirements:
- llm-local
To apply the manifest file, run the following command:
!kubectl apply -f manifests/deepspeed/llama-wf6af16.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
Once the model is deployed, we can send an inference request:
# TODO: run this on a GPU with compute capability 8.0
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-wf6af16/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
To unload the model, run the following command:
!kubectl delete -f manifests/deepspeed/llama-wf6af16.yaml -n seldon
Last updated
Was this helpful?