Quantization

The following demonstrates how to use the quantizaion capabilites of the Local runtime on top of Seldon Core 2 (SC2) to server your LLM on limited hardware (i.e., when the fp16 model does not fit on yout GPUs). We will showcase how to deploy quantized models, using the all our backend. To run this example you will need SC2 up and running, the Local runtime deployed on SC2, and a GPU with compute capability >= 8.0 for the deepspeed backeend.

In order to start serving LLMs you have to create a secret for the Huggingface token and deploy the Local Runtime server. Please check our installation tutorial to see how to do so.

We want to empahsize that currently that the qunatization capabilities of the transformers backend is restricted to a single GPU. Thus, any attempt to use quantization with multiple GPUs (i.e., for tensor parallelism) might result in undefined behaviour, depending on the backend you are using.

We are now ready to deploy our LLMs on SC2, but before that, let us define the inference request we will send to our models.

import pprint
import requests
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')


inference_request = {
    "inputs": [
            {
                "name": "role", 
                "shape": [1], 
                "datatype": "BYTES", 
                "data": ["user"]
            },
            {
                "name": "content",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["What is the capital of France?"],
            },
            {
                "name": "type",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["text"],
            }
        ],
}

Qunatization with Transformers

We begin with the transformers backend for which we deploy a llama2-7b model with different quantization methods.

AWQ

Loading a awq quantized model is as simple as setting the uri to an already awq quantized model on the Huggingface model hub or to the local disk. Note that the scheduler is deactivated by setting "max_tokens": -1 as it may not work properly with awq quantized models.

The model-settings.json file for awq is the following:

To deploy the model on SC2, we need to apply the following manifest file:

To apply the manifest file, run the following command:

Once the model is deployed, we can send an inference request:

To unload the model, run the following command:

GPTQ

Similarly, loading a gptq quantized model is as simple as setting the model to an already gptq quantized model on the Huggingface model hub or to the local disk. Note that the scheduler is deactivated by setting "max_tokens": -1 as it may not work properly with gptq quantized models.

The model-settings.json file for gptq is the following:

To deploy the model on SC2, we need to apply the following manifest file:

To apply the manifest file, run the following command:

Once the model is deployed, we can send an inference request:

To unload the model, run the following command:

BitsAndBytes quantization

The Transformers backend also allows to perform quantization upon loading using bitsandbytes quantization.

The model-settings.json file for bnb is the following:

You can also specify more advanced quantization settings as flows:

To deploy the model on SC2, we need to apply the following manifest file:

To apply the manifest file, run the following command:

Once the model is deployed, we can send an inference request:

To unload the model, run the following command:

Quantization with vLLM

AWQ

Loading a awq quantized model is as simple as setting the model to an already awq quantized model on the Huggingface model hub or to the local disk and specifying the quantization settings to the appropriate method.

The model-settings.json file for awq is the following:

To deploy the model on SC2, we need to apply the following manifest file:

To apply the manifest file, run the following command:

Once the model is deployed, we can send an inference request:

To unload the model, run the following command:

GPTQ

Similarly, loading a gptq quantized model is as simple as setting the model to an already gptq quantized model on the Huggingface model hub or to the local disk and specifying the quantization settings to the appropriate method.

The model-settings.json file for gptq is the following:

To deploy the model on SC2, we need to apply the following manifest file:

To apply the manifest file, run the following command:

Once the model is deployed, we can send an inference request:

To unload the model, run the following command:

Quantization with DeepSpeed

wf6af16

The Deepspeed backend allows to perform wf6af16 quantization upon loading only for the model running by default in float16.

The model-settings.json file for wf6af16 is the following:

To deploy the model on SC2, we need to apply the following manifest file:

To apply the manifest file, run the following command:

Once the model is deployed, we can send an inference request:

To unload the model, run the following command:

Last updated

Was this helpful?