Quantization
import pprint
import requests
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"]
},
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"data": ["What is the capital of France?"],
},
{
"name": "type",
"shape": [1],
"datatype": "BYTES",
"data": ["text"],
}
],
}Qunatization with Transformers
AWQ
GPTQ
BitsAndBytes quantization
Quantization with vLLM
AWQ
GPTQ
Quantization with DeepSpeed
wf6af16
Last updated
Was this helpful?