OpenAI Function Calling

In this example we demonstrate how to use function calling with the OpenAI runtime on Seldon Core 2. We will follow closely the OpenAI docs from here.

We begin by deploying an OpenAI model on Seldon Core 2. The namespace we are using in this example is seldon, but feel free to replace it based on your configuration.

For this tutorial you will need to create a secret with the OpenAI API key and deploy the API Runtime server. Please check our installation tutorial to see how you can do so.

We can deploy the OpenAI model using the following manifest file:

!cat manifests/openai-chat-completions.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: openai-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes/examples/api/openai-func-call/models/openai_chat_completions"
  requirements:
  - openai

The MLServer setting file is:

!cat models/openai_chat_completions/model-settings.json
{
  "name": "openai_chat_completions",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_id": "gpt-4o",
        "model_type": "chat.completions"
      }
    }
  }
}

As you can see, we are deploying a "gpt-4o" model to be used for chat completions.

To deploy the model, run the following command:

Function calling

Now that we have our model deployed, we are ready to send some requests.

We begin by defining the actual function we are about to call and the associated tools definition which is going to be a list of jsons.

We can now construct the first request in our conversation with "gpt-4o".

We construct and send another request in which we append the LLM's answer and the user's answer.

The model has responded now with tool_calls which contains the json of the parameters we have to call our function with. We decode the inference response and call our function to get the response.

Now after we have the delivery date, we feed it back to the LLM.

We can observe that the date we provided is included in the LLMs response. One inconvenience with the request above is the necessity of keeping track of the conversation history and the necessity of introducing some sort of padding. Those to problems will be addressed by the memory component which can keep track of the conversation history automatically.

Parallel function calling

Besides function calling, the OpenAI runtime includes support for parallel function calling as well.

Note that the LLM provides the answer for all provided cities.

Tool choice

In addition, to parallel function calling, the user can configure the function calling behaviour through the tool_choice tensor. For example, if we are interested in disabling function calling and force the model to only generate a user-facing message, you can either provide no tools, or set tool_choice to "none".

Congrats, you've now leveraged an OpenAI model to call functions using a model deployed in Kubernetes! To remove the model, run the following command:

Last updated

Was this helpful?