> For the complete documentation index, see [llms.txt](https://docs.seldon.ai/mlserver/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.seldon.ai/mlserver/examples/huggingface.md).

# Serving HuggingFace Transformer Models

Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:

* Loading of Transformer Model artifacts from the Hugging Face Hub.
* Model quantization & optimization using the Hugging Face Optimum library
* Request batching for GPU optimization (via adaptive batching and request batching)

In this example, we will showcase some of this features using an example model.

```python
# Import required dependencies
import requests
```

## Serving

Since we're using a pretrained model, we can skip straight to serving.

### `model-settings.json`

```python
%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2"
        }
    }
}
```

```
Overwriting ./model-settings.json
```

Now that we have our config in-place, we can start the server by running `mlserver start .`. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

```shell
mlserver start .
```

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

### Send test inference request

```python
inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["this is a test"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()
```

```
{'model_name': 'transformer',
 'id': 'eb160c6b-8223-4342-ad92-6ac301a9fa5d',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"generated_text": "this is a testnet with 1-3,000-bit nodes as nodes."}']}]}
```

### Using Optimum Optimized Models

We can also leverage the Optimum library that allows us to access quantized and optimized models.

We can download pretrained optimized models from the hub if available by enabling the `optimum_model` flag:

```python
%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "optimum_model": true
        }
    }
}
```

```
Overwriting ./model-settings.json
```

Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

```shell
mlserver start .
```

### Send Test Request to Optimum Optimized Model

The request can now be sent using the same request structure but using optimized models for better performance.

```python
inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["this is a test"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()
```

```
{'model_name': 'transformer',
 'id': '9c482c8d-b21e-44b1-8a42-7650a9dc01ef',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"generated_text": "this is a test of the \\"safe-code-safe-code-safe-code\\" approach. The method only accepts two parameters as parameters: the code. The parameter \'unsafe-code-safe-code-safe-code\' should"}']}]}
```

## Testing Supported Tasks

We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.

### Question Answering

```python
%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "question-answering"
        }
    }
}
```

```
Overwriting ./model-settings.json
```

Once again, you are able to run the model using the MLServer CLI.

```shell
mlserver start .
```

```python
inference_request = {
    "inputs": [
        {
            "name": "question",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["what is your name?"],
        },
        {
            "name": "context",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Hello, I am Seldon, how is it going"],
        },
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()
```

```
{'model_name': 'transformer',
 'id': '4efac938-86d8-41a1-b78f-7690b2dcf197',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"score": 0.9869915843009949, "start": 12, "end": 18, "answer": "Seldon"}']}]}
```

### Sentiment Analysis

```python
%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-classification"
        }
    }
}
```

```
Overwriting ./model-settings.json
```

Once again, you are able to run the model using the MLServer CLI.

```shell
mlserver start .
```

```python
inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is terrible!"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()
```

```
{'model_name': 'transformer',
 'id': '835eabbd-daeb-4423-a64f-a7c4d7c60a9b',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"label": "NEGATIVE", "score": 0.9996137022972107}']}]}
```

## GPU Acceleration

We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters

### Testing with CPU

We first test the time taken with the device=-1 which configures CPU by default

```python
%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "max_batch_size": 128,
    "max_batch_time": 1,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "device": -1
        }
    }
}
```

```
Overwriting ./model-settings.json
```

Once again, you are able to run the model using the MLServer CLI.

```shell
mlserver start .
```

```python
inference_request = {
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is a generation for the work" for i in range(512)],
        }
    ]
}

# Benchmark time
import time

start_time = time.monotonic()

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
)

print(f"Elapsed time: {time.monotonic() - start_time}")
```

```
Elapsed time: 66.42268538899953
```

We can see that it takes 81 seconds which is 8 times longer than the gpu example below.

### Testing with GPU

IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.

Now we'll run the benchmark with GPU configured, which we can do by setting `device=0`

```python
%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "device": 0
        }
    }
}
```

```
Overwriting ./model-settings.json
```

```python
inference_request = {
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is a generation for the work" for i in range(512)],
        }
    ]
}

# Benchmark time
import time

start_time = time.monotonic()

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
)

print(f"Elapsed time: {time.monotonic() - start_time}")
```

```
Elapsed time: 11.27933280000434
```

We can see that the elapsed time is 8 times less than the CPU version!

### Adaptive Batching with GPU

We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.

In our case we can enable adaptive batching with the `max_batch_size` which in our case we will set it ot 128.

We will also configure `max_batch_time` which specifies\` the maximum amount of time the MLServer orchestrator will wait before sending for inference.

```python
%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "max_batch_size": 128,
    "max_batch_time": 1,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "device": 0
        }
    }
}
```

```
Overwriting ./model-settings.json
```

In order to achieve the throughput required of 50 requests per second, we will use the tool `vegeta` which performs load testing.

We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.

```bash
%%bash
jq -ncM '{"method": "POST", "header": {"Content-Type": ["application/json"] }, "url": "http://localhost:8080/v2/models/transformer/infer", "body": "{\"inputs\":[{\"name\":\"text_inputs\",\"shape\":[1],\"datatype\":\"BYTES\",\"data\":[\"test\"]}]}" | @base64 }' \
          | vegeta \
                -cpus="2" \
                attack \
                -duration="3s" \
                -rate="50" \
                -format=json \
          | vegeta \
                report \
                -type=text
```

```
Requests      [total, rate, throughput]         150, 50.34, 22.28
Duration      [total, attack, wait]             6.732s, 2.98s, 3.753s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.975s, 3.168s, 3.22s, 4.065s, 4.183s, 4.299s, 4.318s
Bytes In      [total, mean]                     60978, 406.52
Bytes Out     [total, mean]                     12300, 82.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150  
Error Set:
```

```python
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.seldon.ai/mlserver/examples/huggingface.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
