OpenAI

The following demonstrates how to locally run an API Runtime instance to run inference with OpenAI models. It also illustrates the different ways it can be used.

This example only showcases the API Runtime as a stand-alone component, for a more integrated example check out the chatbot demo.

To get up and running we need to pull the runtime Docker image. To pull the docker image, you must be authenticated. Check our installation tutorial to see how you can authenticate with Docker CLI.

docker pull \
    europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-api:0.7.0

Before we can start the runtime we need to create the model-settings.json which will tell it what model to run:

!cat models/openai-chat-completions/model-settings.json
{
  "name": "openai-chat-completions",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_id": "gpt-3.5-turbo",
        "model_type": "chat.completions"
      }
    }
  }
}

In the above settings, the runtime config is specified in the parameters JSON field:

  1. the choice of provider_id - currently, the providers supported are "openai" and "gemini".

  2. we've chosen the gpt-3.5-turbo model and the chat.completions API.

Starting the Runtime

Finally, to start the server run:

docker run -it --rm -p 8080:8080 \
  -v ${PWD}/models:/models \
  -e MLSERVER_MODEL_OPENAI_API_KEY=<your_openai_key> \
  europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-api:0.7.0 \
  mlserver start /models

Sending Requests

To send our first request to the chat.completions endpoint that we are now serving via mlserver, we use the following:

import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["Hello from MLServer!"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/openai-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'aa65a0ac-6aea-4284-9e55-7c74e299390e',
 'model_name': 'openai-chat-completions',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['Hello! How can I assist you today?'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{\n'
                       '  "id": "chatcmpl-AYD7ygNL8rqda0t5mG691k1aXpGLC",\n'
                       '  "choices": [\n'
                       '    {\n'
                       '      "finish_reason": "stop",\n'
                       '      "index": 0,\n'
                       '      "logprobs": null,\n'
                       '      "message": {\n'
                       '        "content": "Hello! How can I assist you '
                       'today?",\n'
                       '        "refusal": null,\n'
                       '        "role": "assistant"\n'
                       '      }\n'
                       '    }\n'
                       '  ],\n'
                       '  "created": 1732716978,\n'
                       '  "model": "gpt-3.5-turbo-0125",\n'
                       '  "object": "chat.completion",\n'
                       '  "system_fingerprint": null,\n'
                       '  "usage": {\n'
                       '    "completion_tokens": 9,\n'
                       '    "prompt_tokens": 21,\n'
                       '    "total_tokens": 30,\n'
                       '    "completion_tokens_details": {\n'
                       '      "reasoning_tokens": 0,\n'
                       '      "audio_tokens": 0,\n'
                       '      "accepted_prediction_tokens": 0,\n'
                       '      "rejected_prediction_tokens": 0\n'
                       '    },\n'
                       '    "prompt_tokens_details": {\n'
                       '      "cached_tokens": 0,\n'
                       '      "audio_tokens": 0\n'
                       '    }\n'
                       '  }\n'
                       '}'],
              'datatype': 'BYTES',
              'name': 'output_all',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

Note that we've sent three tensors: a "role", a "content", and a "type" tensor. The "role" tensors tell the model who is speaking. In this case, it includes a "system" role and a "user" role. The "system" role is used to dictate the context of the interaction and the "user" role indicates that the matching content is sent by a user. In the above the system content is: "You are a helpful assistant" and the user content is "Hello from MLServer". The "type" tensor indicates that the content we sent is a text.

The endpoint responds with its own "role", "content" and "type" tensors. Its "role" is given as "assistant" and the "content" it returns is "Hello! How can I assist you today?". As well as this the server returns the full response received from OpenAI via the "output_all" tensor.

Requests with Parameters

As well as this we can add parameters to the request, to specify the number of generations and temperature, etc. For a list of all available parameters see the OpenAI documentation for their API. The following sets the temperature, the maximum number of tokens to generate and the number of generations to return:

import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [1], 
            "datatype": "BYTES", 
            "data": ["user"]
        },
        {
            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Hello from MLServer!"],
        },
        {
            "name": "type",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["text"]
        }
    ],
    "parameters": {
        "llm_parameters": {
            "temperature": 0.5,
            "max_tokens": 100,
        }
    },
}

endpoint = "http://localhost:8080/v2/models/openai-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': '6993f40a-46f3-4aca-ab0e-38b1d64b3c7b',
 'model_name': 'openai-chat-completions',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['Hello! How can I assist you today?'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{\n'
                       '  "id": "chatcmpl-AYD86FsU7OSnNFhPUk8SBL7GtJIRS",\n'
                       '  "choices": [\n'
                       '    {\n'
                       '      "finish_reason": "stop",\n'
                       '      "index": 0,\n'
                       '      "logprobs": null,\n'
                       '      "message": {\n'
                       '        "content": "Hello! How can I assist you '
                       'today?",\n'
                       '        "refusal": null,\n'
                       '        "role": "assistant"\n'
                       '      }\n'
                       '    }\n'
                       '  ],\n'
                       '  "created": 1732716986,\n'
                       '  "model": "gpt-3.5-turbo-0125",\n'
                       '  "object": "chat.completion",\n'
                       '  "system_fingerprint": null,\n'
                       '  "usage": {\n'
                       '    "completion_tokens": 9,\n'
                       '    "prompt_tokens": 12,\n'
                       '    "total_tokens": 21,\n'
                       '    "completion_tokens_details": {\n'
                       '      "reasoning_tokens": 0,\n'
                       '      "audio_tokens": 0,\n'
                       '      "accepted_prediction_tokens": 0,\n'
                       '      "rejected_prediction_tokens": 0\n'
                       '    },\n'
                       '    "prompt_tokens_details": {\n'
                       '      "cached_tokens": 0,\n'
                       '      "audio_tokens": 0\n'
                       '    }\n'
                       '  }\n'
                       '}'],
              'datatype': 'BYTES',
              'name': 'output_all',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

Note that if you are sending a single message, you must not encode the content and type as json.

Adding prompts

Prompting is a pivotal part of using large language models. It allows developers to embed messages sent by a user within other textual contexts that provide more information about the task you want the model to perform. For instance, we use the following prompt to give the model more context about the kind of question that the model is expected to answer:

!cat models/openai-prompt-chat-completions/math.jinja
Translate a math problem into Python code that can be executed in Python 3 REPL. Use the output of running this code to answer the question.

Question: "Question with math problem."
```python
"Code that solves the problem and prints the solution"
```
```output
"Output of running the code"
```
Answer: "Answer"

Begin.

Question: What is 37593 * 67?

```python
print(37593 * 67)
```
```output
2518731
```
Answer: 2518731

Question: {{ question[0] }}

In the above the content sent by the user can be inserted in the {question} variable. A developer should specify what content to insert thereby giving that tensor the name "question". To start with we need to create a new model-settings.json file. This will be the same as the previous one but in addition it specifies the prompt tempalte to be used through prompt_utils settings.

!cat models/openai-prompt-chat-completions/model-settings.json
{
  "name": "openai-prompt-chat-completions",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_id": "gpt-3.5-turbo",
        "model_type": "chat.completions"
      },
      "prompt_utils": {
        "prompt_options": {
          "uri": "math.jinja"
        }
      }
    }
  }
}

We can test this using the following:

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "role",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["user"],
        },
        {
            "name": "question",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["What is 100 * 2?"],
        },
    ]
}

endpoint = "http://localhost:8080/v2/models/openai-prompt-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': '396ef236-7ccb-4d57-aa1c-b44c0fc1302d',
 'model_name': 'openai-prompt-chat-completions',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['```python\n'
                       'print(100 * 2)\n'
                       '```\n'
                       '```output\n'
                       '200\n'
                       '```\n'
                       'Answer: 200'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{\n'
                       '  "id": "chatcmpl-AYD8UsZUETVwrZeaQJ19Y4uUpojjz",\n'
                       '  "choices": [\n'
                       '    {\n'
                       '      "finish_reason": "stop",\n'
                       '      "index": 0,\n'
                       '      "logprobs": null,\n'
                       '      "message": {\n'
                       '        "content": "```python\\nprint(100 * '
                       '2)\\n```\\n```output\\n200\\n```\\nAnswer: 200",\n'
                       '        "refusal": null,\n'
                       '        "role": "assistant"\n'
                       '      }\n'
                       '    }\n'
                       '  ],\n'
                       '  "created": 1732717010,\n'
                       '  "model": "gpt-3.5-turbo-0125",\n'
                       '  "object": "chat.completion",\n'
                       '  "system_fingerprint": null,\n'
                       '  "usage": {\n'
                       '    "completion_tokens": 23,\n'
                       '    "prompt_tokens": 129,\n'
                       '    "total_tokens": 152,\n'
                       '    "completion_tokens_details": {\n'
                       '      "reasoning_tokens": 0,\n'
                       '      "audio_tokens": 0,\n'
                       '      "accepted_prediction_tokens": 0,\n'
                       '      "rejected_prediction_tokens": 0\n'
                       '    },\n'
                       '    "prompt_tokens_details": {\n'
                       '      "cached_tokens": 0,\n'
                       '      "audio_tokens": 0\n'
                       '    }\n'
                       '  }\n'
                       '}'],
              'datatype': 'BYTES',
              'name': 'output_all',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

Note that we sent a single tensor named "question", which was the inserted content.

Embedding text

The OpenAI runtime also provides an embedding API that allows you to encode a body of text into a high-dimensional vector representation. Developers can then use this to perform vector searches over similarly encoded text. Here is an example of a model-settings.json:

!cat models/openai-embeddings/model-settings.json
{
  "name": "openai-embeddings",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_id": "text-embedding-ada-002",
        "model_type": "embeddings"
      }
    }
  }
}
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "input",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                "At the British Academy Film Awards, Oppenheimer wins Best Film and six other awards, including Best Actor for Cillian Murphy",
                "In American football, the Kansas City Chiefs defeat the San Francisco 49ers to win the Super Bowl.",
            ],
        },
    ]
}

endpoint = "http://localhost:8080/v2/models/openai-embeddings/infer"
response = requests.post(endpoint, json=inference_request)

pprint.pprint(response.json(), depth=3)
print('\nShape of vectors', response.json()["outputs"][0]["shape"])
{'id': 'e747c598-bf96-455e-a018-6762560f3bfd',
 'model_name': 'openai-embeddings',
 'outputs': [{'data': [...],
              'datatype': 'FP64',
              'name': 'embedding',
              'parameters': {...},
              'shape': [...]},
             {'data': [...],
              'datatype': 'BYTES',
              'name': 'output_all',
              'parameters': {...},
              'shape': [...]}],
 'parameters': {}}

Shape of vectors [2, 1536]

The above output is limited because the embedding vectors are quite long, you can see that there are two and each is length 1536. These can be used to search similar vectorized text corpora for similar content.

Image Generation

The OpenAI runtime also allows access to the Dall-E image generations endpoint, the model-settings.json is similar to the previous cases.

!cat models/openai-image-generations/model-settings.json
{
  "name": "openai-image-generations",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_type": "images.generations"
      }
    }
  }
}

When calling this endpoint we send a prompt tensor. Let's try and depict something nice and cheery!

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "prompt",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["A tropical paradise"],
        },
    ]
}

endpoint = "http://localhost:8080/v2/models/openai-image-generations/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': '251e104b-ca3d-4a6a-b53d-dc902be24dfc',
 'model_name': 'openai-image-generations',
 'outputs': [{'data': ['https://oaidalleapiprodscus.blob.core.windows.net/private/org-RpvLIXbLFPC4TppFUFheeITu/user-KNvzxivr23yj7hkdCOKiXwVT/img-RMzanu241DxCfRUaZGE9PtbI.png?st=2024-11-27T13%3A19%3A28Z&se=2024-11-27T15%3A19%3A28Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=d505667d-d6c1-4a0a-bac7-5c84a87759f8&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2024-11-27T00%3A20%3A13Z&ske=2024-11-28T00%3A20%3A13Z&sks=b&skv=2024-08-04&sig=ShKcwDQ8P%2B6xarlK0J1MpEpl%2Bl1uEap/UEDvAWpW19U%3D'],
              'datatype': 'BYTES',
              'name': 'image',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{\n'
                       '  "created": 1732717168,\n'
                       '  "data": [\n'
                       '    {\n'
                       '      "url": '
                       '"https://oaidalleapiprodscus.blob.core.windows.net/private/org-RpvLIXbLFPC4TppFUFheeITu/user-KNvzxivr23yj7hkdCOKiXwVT/img-RMzanu241DxCfRUaZGE9PtbI.png?st=2024-11-27T13%3A19%3A28Z&se=2024-11-27T15%3A19%3A28Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=d505667d-d6c1-4a0a-bac7-5c84a87759f8&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2024-11-27T00%3A20%3A13Z&ske=2024-11-28T00%3A20%3A13Z&sks=b&skv=2024-08-04&sig=ShKcwDQ8P%2B6xarlK0J1MpEpl%2Bl1uEap/UEDvAWpW19U%3D"\n'
                       '    }\n'
                       '  ]\n'
                       '}'],
              'datatype': 'BYTES',
              'name': 'output_all',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

The images are returned as a URL to their location that can be used to download them.

import os
import urllib.request

if not os.path.exists("images"):
    os.mkdir("images")

url = response.json()['outputs'][0]['data'][0]
urllib.request.urlretrieve(url, "images/generated-image.jpeg");

Azure Deployments

Our runtime can also be set up to integrate with OpenAI deployments on Azure. To set up such an Azure OpenAI deployment see here.

The following is an example model-settings.json file.

!cat models/azure-chat-completions/model-settings.json
{
  "name": "azure-openai-chat-model",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "api_type": "azure",
        "azure_endpoint": "https://mlinfra-openai-test.openai.azure.com/",
        "model_id": "mlinfra-gpt-35-turbo-test",
        "api_version": "2024-08-01-preview",
        "model_type": "chat.completions"
      }
    }
  }
}

In the above YAML definition

  • api_type is "azure" or "azure_ad"

  • azure_endpoint is the deployment endpoint

  • model_id is deployment name

  • api_version is "2024-08-01-preview" and could change in the future

You can call the models in azure as before - everything else stays the same.

Deploying on Seldon Core 2

We will now demostrate how do deploy the chat completions model on Seldon Core 2. All the other models can be deployed with the same steps.

While the runtime image can be used as a stand alone server in most cases you'll want to deploy it as part of a Kubernetes cluster. This section assumes the user has a Kubernetes cluster running with Seldon Core 2 installed in the seldon namespace. In order to start serving OpenAI models you will have to first create a secret for the OpenAI API key and deploy the API Runtime server. Please check our installation tutorial to see how you can do so.

To deploy the chat completions model, we will need to create the associated manifest file.

!cat manifests/openai-chat-completions.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: openai-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/examples/api/openai/models/openai_chat_completions"
  requirements:
  - openai

To load the model in Seldon Core 2, run:

!kubectl apply -f manifests/openai-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/openai-chat-completions created
model.mlops.seldon.io/openai-chat-completions condition met

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct ip:

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

As before, we can now send a request to the model:

import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["Hello from MLServer!"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ]
}

endpoint = f"http://{get_mesh_ip()}/v2/models/openai-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'd5325b77-ad4d-4389-8add-f5728b32a493',
 'model_name': 'openai-chat-completions_1',
 'model_version': '1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['Hello! How can I assist you today?'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{\n'
                       '  "id": "chatcmpl-AYDDh6QqLGR94agCusfrFDuOvunw5",\n'
                       '  "choices": [\n'
                       '    {\n'
                       '      "finish_reason": "stop",\n'
                       '      "index": 0,\n'
                       '      "logprobs": null,\n'
                       '      "message": {\n'
                       '        "content": "Hello! How can I assist you '
                       'today?",\n'
                       '        "refusal": null,\n'
                       '        "role": "assistant"\n'
                       '      }\n'
                       '    }\n'
                       '  ],\n'
                       '  "created": 1732717333,\n'
                       '  "model": "gpt-3.5-turbo-0125",\n'
                       '  "object": "chat.completion",\n'
                       '  "system_fingerprint": null,\n'
                       '  "usage": {\n'
                       '    "completion_tokens": 9,\n'
                       '    "prompt_tokens": 21,\n'
                       '    "total_tokens": 30,\n'
                       '    "completion_tokens_details": {\n'
                       '      "reasoning_tokens": 0,\n'
                       '      "audio_tokens": 0,\n'
                       '      "accepted_prediction_tokens": 0,\n'
                       '      "rejected_prediction_tokens": 0\n'
                       '    },\n'
                       '    "prompt_tokens_details": {\n'
                       '      "cached_tokens": 0,\n'
                       '      "audio_tokens": 0\n'
                       '    }\n'
                       '  }\n'
                       '}'],
              'datatype': 'BYTES',
              'name': 'output_all',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

You now have a deployed model in Seldon Core 2, ready and available for requests! To unload the model, you can run the following command:

!kubectl delete -f manifests/openai-chat-completions.yaml -n seldon
model.mlops.seldon.io "openai-chat-completions" deleted

Last updated

Was this helpful?