Tool Use
In the OpenAI Function Calling tutorial, we showed how we can combine the OpenAI Runtime with the Memory Runtime to perform function calling without worrying about managing the conversation history. Although this lifts the burden from the user to manage the history, it still has the caveat that the actual call to the tool has to be done locally by the user, and then the response has to be provided to the LLM to return its final answer.
In this tutorial, we will show how to automate the call to the actual function by wrapping that in a custom MLServer model deployment. We will also demonstrate how to add support for multiple functions and route the data flow path through your pipeline, depending on the LLM's choice of which function to call.
In this example, we will define two functions, get_temperature and get_wind_speed, which return the temperature and the wind speed for a geographical location defined by the latitude and the longitude, respectively.
The pipeline looks like this:
We see three routes that the data flow can take:
temperature route when the model is asked about the current temperature in a specific location - triggered when the
get_temperaturetensor is present.wind speed route when the model is asked about the current wind speed in a specific location - triggered when the
get_wind_speedtensor is present.default route, which returns immediately the response of the model if none of the above routes are taken.
The triggering mechanism is a feature in Seldon Core 2 (SC2). For further details on the triggering mechanism in SC2, we invite the reader to check our docs here.
One may wonder how those routing tensors are extracted. To extract the routing tensors, the OpenAI Runtime looks at the response content, and the tool calls tensors returned by the LLM and tries to extract the values for some particular keys the user provides in the model-settings.json file. For example, the tool_call tensor may look something like:
"tool_calls": [
{
"id": "call_62136354",
"type": "function",
"function": {
"arguments": "{'order_id': 'order_12345'}",
"name": "get_temperature",
},
}
]In this case, it will be enough to specify to extract the value under the "name" key, which will be "get_temperature". After extracting the value, the OpenAI runtime appends to the inference response outputs list an empty tensor with the name get_temperature. The presence of the get_temperature tensor will then trigger only the route responsible for fetching the temperature for the queried location.
Note that the LLM can also be asked some questions completely unrelated to the temperature of the wind speed at a location. In this case, we should not call any function, and we should return the answer immediately by routing the data through the default route. In this case, the OpenAI runtime will return an empty tensor called default, which is used to trigger the default route.
Custom MLServer models
Now that we have an idea of what the pipeline should look like, let's begin by implementing the custom MLServer model which will return the temperature and the wind speed.
We begin by defining the actual calls to the meto API:
In the code above, we defined the get_data function which makes the general call to the API, and then get_temperature and get_wind_speed which extract the approriate fileds from the response.
The custom MLServer model that will wrap the actual function of interest will only receive the tool_calls tensor from the OpenAI Model. This suffices, because all the arguments that should be passed to the function are stored in the tool_calls tensor. Thus, we will need define some functions which will parse the content of the tool_calls tensor.
We will also define a helper function, which constructs the output response in the format expected by the Memory Runtime component:
In essence, the function above only returns the response of the function we called. Some additionals tensors (e.g., tools) are forwarded which are not essential for this tutorial, but will be used in the following ones.
With all the helper functions in place, we can define the two custom MLServer models as follows:
We can now load the two models in SC2. The model-settings.json for the temperature model looks as follows:
Similarly, for the wind speed model, we have the following model-settings.json file:
For the two models, we will define a single manifest file which looks like this:
We can load the models by running the following command:
At this point, we have our function models deployed. There is one more custom MLServer model that we need to deploy before starting to create the pipelines - that is the tail model. All that the tail model is supposed to be doing is to get an inference request containing the history of the conversation, parse it, and return the last entrence. The tail model looks like this:
The model-settings.json file for the tail model is the following:
The associated manifest file is:
We can now deploy the identity models by running the following commands:
LLM models
At this point, we are done with the custom MLServer models. We can now move on to deploying the ChatGPT/LLM and the memory/history models.
We begin by deploying the LLM models. The model settings for the LLM model look like this:
What is important to note in the model settings above is the "extract_tensors" field which has the "key" field inside pointing to a list containing "name". This field informs the OpenAI runtime to look for the key "name" in the output of the LLM, and use the corresponding value as a name for an empty tensor, which will be added to the inference response output list. This tensor will serve as a conditional tensor, which will trigger different routes in the pipeline.
As can be seen in the pipeline, we will have to deploy two instances of the same LLM. The manifest file looks as follows:
We can deploy the models above by running the following command:
Memory components
Finally, we can deploy the memory components. The model-settings.json file for the memory component looks like this:
In the above configuration, we are specifying that we are interested in keeping a window size of 100 latest messages and that we are interested in storing some fields specific to the OpenAI function calling API.
As before, we will have to deploy multiple instances of the memory component, 6 in this case. The manifest file looks like this:
To deploy the memory components, run the following commands:
Pipelines
We are now ready to define the pipeline in the schema above. Note that we separated components with comments to make it easier to visualize them.
To deploy the pipeline, run the following commands:
Before sending the requests to the pipeline, we define a helper function to extract the IP address of the seldon-mesh service. This will help us define the endpoint to which we want to send the requests.
To inform the LLM about the functions we want to use (i.e., get_temperature and get_wind_speed), we define the tools object, which contains all the metadata needed by the LLM (e.g., name, description, arguments, etc.) to be able to provide the appropriate arguments to those functions. See OpenAI tutorial on function calling for more details.
We are now ready to start interacting with the pipeline. We begin by defining the memory UUID which will uniquely identify a chat history:
We also define a helper function to send requests to the pipeline to avoid repetitive code:
We begin by asking a general question to the LLM. In this case, no function should be called, and the answer should be provided through the default branch (i.e., general knowledge of the LLM):
We see that the LLM provided the right answer. Let us now ask about the current temperature in Bucharest. In this case, the model should call the temperature model, and the data should flow through the temperature branch.
We can also ask about the wind speed which will go through the wind speed branch.
Note that we didn't have to specify that we are talking about Bucharest. The model is able to understand that from the context provided by the memory component.
We can query the model about a different location. In this case London.
Once again, the data flows through the temperature and wind speed branches, respectively, to retrieve the results.
We can continue the conversation with other questions completely unrelated to the previous ones:
Or we can ask about what has been discussed until now:
To unload the pipeline and the models, run the following commands:
In case of a pipeline failure (e.g., a model inference fails), the pipeline will stop and return the error message to the user. Consistency of the chat history might not be guaranteed, thus it is recommended to check the history of the conversation by inspecting the memory component, and regenerate a new memory_id if needed.
Congrats! You've just deployed an agentic pipeline with the LLM module!
Last updated
Was this helpful?