# Quantization

The following demonstrates how to use the quantizaion capabilites of the Local runtime on top of Seldon Core 2 (SC2) to server your LLM on limited hardware (i.e., when the fp16 model does not fit on yout GPUs). We will showcase how to deploy quantized models, using the all our backend. To run this example you will need SC2 up and running, the Local runtime deployed on SC2, and a GPU with compute capability >= 8.0 for the `deepspeed` backeend.

In order to start serving LLMs you have to create a secret for the Huggingface token and deploy the Local Runtime server. Please check our [installation tutorial](/llm-module/introduction/installation.md) to see how to do so.

{% hint style="info" %}
We want to empahsize that currently that the qunatization capabilities of the `transformers` backend is restricted to a single GPU. Thus, any attempt to use quantization with multiple GPUs (i.e., for tensor parallelism) might result in undefined behaviour, depending on the backend you are using.
{% endhint %}

We are now ready to deploy our LLMs on SC2, but before that, let us define the inference request we will send to our models.

```python
import pprint
import requests
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')


inference_request = {
    "inputs": [
            {
                "name": "role", 
                "shape": [1], 
                "datatype": "BYTES", 
                "data": ["user"]
            },
            {
                "name": "content",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["What is the capital of France?"],
            },
            {
                "name": "type",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["text"],
            }
        ],
}
```

## Qunatization with Transformers

We begin with the `transformers` backend for which we deploy a `llama2-7b` model with different quantization methods.

### AWQ

Loading a awq quantized model is as simple as setting the uri to an already `awq` quantized model on the Huggingface model hub or to the local disk. Note that the scheduler is deactivated by setting `"max_tokens": -1` as it may not work properly with `awq` quantized models.

The `model-settings.json` file for `awq` is the following:

```python
!cat models/transformers/llama-awq/model-settings.json
```

```
{
    "name": "llama-awq",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TheBloke/Llama-2-7B-chat-AWQ",
                    "device": "cuda",
                    "max_tokens": -1
                }
            }
        }
    }
}
```

To deploy the model on SC2, we need to apply the following manifest file:

```python
!cat manifests/transformers/llama-awq.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama-awq
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/examples/models/local/quantization/models/transformers/llama-awq"
  requirements:
  - llm-local
```

To apply the manifest file, run the following command:

```python
!kubectl apply -f manifests/transformers/llama-awq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/llama-awq created
model.mlops.seldon.io/llama-awq condition met
```

Once the model is deployed, we can send an inference request:

```python
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-awq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': 'feb2f362-2cd3-44b1-a250-d3d89af44146',
 'model_name': 'llama-awq_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['  The capital of France is Paris.'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/transformers/llama-awq.yaml -n seldon
```

```
model.mlops.seldon.io "llama-awq" deleted
```

### GPTQ

Similarly, loading a `gptq` quantized model is as simple as setting the `model` to an already `gptq` quantized model on the Huggingface model hub or to the local disk. Note that the scheduler is deactivated by setting `"max_tokens": -1` as it may not work properly with gptq quantized models.

The `model-settings.json` file for `gptq` is the following:

```python
!cat models/transformers/llama-gptq/model-settings.json
```

```
{
    "name": "llama-gptq",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "uri": "TheBloke/Llama-2-7B-chat-GPTQ",
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TheBloke/Llama-2-7B-chat-GPTQ",
                    "device": "cuda",
                    "max_tokens": -1
                }
            }
        }
    }
}
```

To deploy the model on SC2, we need to apply the following manifest file:

```python
!cat manifests/transformers/llama-gptq.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama-gptq
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/transformers/llama-gptq"
  requirements:
  - llm-local
```

To apply the manifest file, run the following command:

```python
!kubectl apply -f manifests/transformers/llama-gptq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/llama-gptq created
model.mlops.seldon.io/llama-gptq condition met
```

Once the model is deployed, we can send an inference request:

```python
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-gptq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': '1f96a023-d14b-4f6c-a169-eec2ddfae959',
 'model_name': 'llama-gptq_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['  The capital of France is Paris (French: Paris).'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/transformers/llama-gptq.yaml -n seldon
```

```
model.mlops.seldon.io "llama-gptq" deleted
```

### BitsAndBytes quantization

The Transformers backend also allows to perform quantization upon loading using bitsandbytes quantization.

The `model-settings.json` file for `bnb` is the following:

```python
!cat models/transformers/llama-bnb/model-settings.json
```

```
{
    "name": "llama-bnb",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "meta-llama/Llama-2-7b-chat-hf",
                    "device": "cuda",
                    "max_tokens": -1,
                    "load_kwargs": {
                        "load_in_4bit": true
                    }
                }  
            }
        }
    }
}
```

You can also specify more advanced quantization settings as flows:

```python
!cat models/transformers/llama-bnb-config/model-settings.json
```

```
{
    "name": "llama-bnb-config",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "meta-llama/Llama-2-7b-chat-hf",
                    "device": "cuda",
                    "max_tokens": -1,
                    "load_kwargs": {
                        "quantization_config": {
                            "load_in_4bit": true,
                            "bnb_4bit_quant_type": "nf4",
                            "bnb_4bit_use_double_quant": true,
                            "bnb_4bit_compute_dtype": "float16"
                        }
                    }
                }  
            }
        }
    }
}
```

To deploy the model on SC2, we need to apply the following manifest file:

```python
!cat manifests/transformers/llama-bnb-config.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama-bnb-config
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/transformers/llama-bnb-config"
  requirements:
  - llm-local
```

To apply the manifest file, run the following command:

```python
!kubectl apply -f manifests/transformers/llama-bnb-config.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/llama-bnb-config created
model.mlops.seldon.io/llama-bnb-config condition met
```

Once the model is deployed, we can send an inference request:

```python
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-bnb-config/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': 'cd581c0c-6869-4c22-99c2-05200a5b957c',
 'model_name': 'llama-bnb-config_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['  The capital of France is Paris. капитал of France is '
                       'Paris. Paris is located in the'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/transformers/llama-bnb-config.yaml -n seldon
```

```
model.mlops.seldon.io "llama-bnb-config" deleted
```

## Quantization with vLLM

### AWQ

Loading a `awq` quantized model is as simple as setting the `model` to an already `awq` quantized model on the Huggingface model hub or to the local disk and specifying the quantization settings to the appropriate method.

The `model-settings.json` file for `awq` is the following:

```python
!cat models/vllm/llama-awq/model-settings.json
```

```
{
    "name": "llama-awq",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TheBloke/Llama-2-7B-chat-AWQ",
                    "quantization": "awq"
                }
            }
        }
    }
}
```

To deploy the model on SC2, we need to apply the following manifest file:

```python
!cat manifests/vllm/llama-awq.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama-awq
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/vllm/llama-awq"
  requirements:
  - llm-local
```

To apply the manifest file, run the following command:

```python
!kubectl apply -f manifests/vllm/llama-awq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/llama-awq created
model.mlops.seldon.io/llama-awq condition met
```

Once the model is deployed, we can send an inference request:

```python
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-awq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': '8de1bbd4-e77a-499e-9466-141f9202ef22',
 'model_name': 'llama-awq_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['  The capital of France is Paris.'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/transformers/llama-awq.yaml -n seldon
```

```
model.mlops.seldon.io "llama-awq" deleted
```

### GPTQ

Similarly, loading a `gptq` quantized model is as simple as setting the `model` to an already `gptq` quantized model on the Huggingface model hub or to the local disk and specifying the quantization settings to the appropriate method.

The `model-settings.json` file for `gptq` is the following:

```python
!cat models/vllm/llama-gptq/model-settings.json
```

```
{   
    "name": "llama-gptq",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TheBloke/Llama-2-7B-chat-GPTQ",
                    "quantization": "gptq"
                }
            }
        }
    }
}
```

To deploy the model on SC2, we need to apply the following manifest file:

```python
!cat manifests/vllm/llama-gptq.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama-gptq
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/vllm/llama-gptq"
  requirements:
  - llm-local
```

To apply the manifest file, run the following command:

```python
!kubectl apply -f manifests/vllm/llama-gptq.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/llama-gptq created
model.mlops.seldon.io/llama-gptq condition met
```

Once the model is deployed, we can send an inference request:

```python
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-gptq/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': 'd1b7d0b8-af0d-4ffe-998d-9b14e43ef822',
 'model_name': 'llama-gptq_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['  The capital of France is Paris (French: Paris).'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/vllm/llama-gptq.yaml -n seldon
```

```
model.mlops.seldon.io "llama-gptq" deleted
```

## Quantization with DeepSpeed

### wf6af16

The Deepspeed backend allows to perform `wf6af16` quantization upon loading only for the model running by default in `float16`.

The `model-settings.json` file for `wf6af16` is the following:

```python
!cat models/deepspeed/llama-wf6af16/model-settings.json
```

```
{
    "name": "llama-wf6af16",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "deepspeed",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "meta-llama/Llama-2-7b-chat-hf",
                    "quantization_mode": "wf6af16"
                }
            }
        }
    }
}
```

To deploy the model on SC2, we need to apply the following manifest file:

```python
!cat manifests/deepspeed/llama-wf6af16.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-wf6af16
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/quantization/models/deepspeed/llama-wf6af16"
  requirements:
  - llm-local
```

To apply the manifest file, run the following command:

```python
!kubectl apply -f manifests/deepspeed/llama-wf6af16.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

Once the model is deployed, we can send an inference request:

```python
# TODO: run this on a GPU with compute capability 8.0
endpoint = f"http://{get_mesh_ip()}/v2/models/llama-wf6af16/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/deepspeed/llama-wf6af16.yaml -n seldon
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/components/models/local/quantization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
