Planning

To run this tutorial, you have to build Seldon Core v2 from the v2 branch.

In the previous tutorial, we showcased how to deploy an agentic workflow that implements a Tool Use pattern. Remember that in a tool use pattern, an agent receives a query, and based on the content of the query it decides whether or not it should call a tool. In the case that an answer can be provided based on the general knowledge of the LLM gathered during training, the answer is returned immediately. Otherwise, a tool must be called, and then the answer is provided to the user.

Although the Tool Use pattern can be handy to automate some tasks, it might still require an exchange of messages between the user and the agent to complete more complicated tasks. In our previous example, we defined two tools fetching the temperature and the wind speed for a specific location, denoted as get_temperature and get_wind_speed. Let us now imagine that we have a third tool that returns what the weather is like given the temperature and the wind speed. For example, we can imagine if the temperature is greater than 30 degrees Celsius and the speed is lower than 10 the weather tool can return that it is a hot day with low wind speed. Let's denote the weather tool as get_weather. What we are trying to emphasize with this example is the dependence of get_weather on the information provided by get_temperature and get_wind_speed. So, to call get_weather we must already have the answers from get_temperature and get_wind_speed. This means, that in the Tool Use pattern, we must first ask the agent to provide the temperature for a specific location, then the wind speed for the same location, and finally ask about the weather given the previous context. Those are three interactions that the user must have with the agent to fetch information about the weather in a specific location.

In this tutorial we introduce the Planning Pattern which will allow the agent to solve the dependent tasks in one go through planning. To achieve this, the LLM module makes use of cyclic pipelines supported in Seldon Core 2. Thus, we can deploy a cyclic pipeline that has a feedback loop allowing the agent to plan, call multiple tools, gather the results returned by those tools, call the tool that provides the final answer, and then exit the loop and return the response to the user.

The planning pipeline for our example looks like this:

We see four routes that the data flow can take:

Temperature when the model is asked about the current temperature in a specific location - triggered when the get_temperature tensor is present.
Wind Speed when the model is asked about the current wind speed in a specific location - triggered when the get_wind_speed tensor is present.
Weather when the model is asked about the current weather for a specific location - triggered when the get_weather tensor is present.
Default which immediately returns the response of the model if none of the above routes are taken.

For more details on how the triggering tensors work and what happens when the model is asked a question for which it can provide the answer based on its internal knowledge, see the previous tutorial

Custom MLServer models

For the get_temperature and get_wind_speed, we use the same models as in the previous tutorial. In this example, we add the get_weather model that depends on the result provided by get_temperature and get_windspeed defined as:

def get_weather(temperature: float, windspeed: float) -> str:
    if temperature > 30 and windspeed < 10:
        return "hot day with low wind speed."
    elif temperature < 10 and windspeed > 20:
        return "cold day with high wind speed."
    elif temperature > 20 and windspeed > 20:
        return "warm day with high wind speed."
    elif temperature < 20 and windspeed < 10:
        return "cool day with low wind speed."
    else:
        return "moderate with normal wind speed."

Here we define a helper function, which constructs the output response in the format expected by the Memory Runtime component:

import json
from typing import Any, Dict, Union, List

from mlserver.codecs import StringCodec
from mlserver.types import InferenceResponse

def construct_response(
    name: str, 
    version: str, 
    response: Union[str, Dict[str, Any]], 
    tool_call_id: str, 
    memory_id: str,
    tools: List[Dict[str, Any]]
) -> InferenceResponse:
    return InferenceResponse(
            model_name=name,
            model_version=version,
            outputs=[
                StringCodec.encode_output(
                    name="role",
                    payload=["tool"]
                ),
                StringCodec.encode_output(
                    name="content",
                    payload=[json.dumps(response) if not isinstance(response, str) else response]
                ),
                StringCodec.encode_output(
                    name="type",
                    payload=["text"]
                ),
                StringCodec.encode_output(
                    name="tool_call_id",
                    payload=[tool_call_id]
                ),
                StringCodec.encode_output(
                    name="memory_id",
                    payload=[memory_id]
                ),
                StringCodec.encode_output(
                    name="tools",
                    payload=[json.dumps(tool) for tool in tools]
                ),
            ]
        )

We also define some helper functions which will be used in the custom MLServer model:

from typing import Optional, Tuple
from mlserver.types import (
    InferenceRequest, 
    RequestInput,
    ResponseOutput
)


def extract_tensor_by_name(
    tensors: Union[List[RequestInput], List[ResponseOutput]], name: str
) -> Optional[Union[RequestInput, ResponseOutput]]:
    for tensor in tensors:
        if tensor.name == name:
            return tensor

    return None


def get_history(payload: InferenceRequest) -> List[Dict[str, Any]]:
    tensor = extract_tensor_by_name(payload.inputs, "history")
    tensor = StringCodec.decode_input(tensor)
    return [json.loads(tc) for tc in tensor]


def get_memory_id(payload: InferenceRequest) -> str:
    memory_id = extract_tensor_by_name(payload.inputs, "memory_id")
    return StringCodec.decode_input(memory_id)[0]


def get_tools(payload: InferenceRequest) -> List[str]:
    tools = StringCodec.decode_input(extract_tensor_by_name(payload.inputs, "tools"))
    return [json.loads(tc) for tc in tools]


def get_args(history: List[Dict[str, Any]], label: str) -> Tuple[str, Dict[str, Any]]:
    tool_calls = history[-1]['tool_calls']
    tool_calls = [json.loads(tc) for tc in tool_calls]
    args = None

    for tool_call in tool_calls:
        if tool_call['function']['name'] == label:
            tool_call_id = tool_call['id']
            args = json.loads(tool_call['function']['arguments'])
            break

    return tool_call_id, args

With all the helper functions in place, we can define the custom MLServer model as follows:

from mlserver.model import MLModel
from mlserver.types import InferenceRequest, InferenceResponse


class GetWeather(MLModel):
    async def load(self) -> bool:
        self.ready = True
        return self.ready

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        memory_id = get_memory_id(payload)
        tools = get_tools(payload)

        history = get_history(payload)
        tool_call_id, args = get_args(history, "get_weather")
        weather = get_weather(args['temperature'], args['windspeed'])
        
        return construct_response(
            name=self.name,
            version=self.version,
            response=weather,
            tool_call_id=tool_call_id,
            memory_id=memory_id,
            tools=tools
        )

The model-settings.json for the weather model looks as follows:

!cat models/get-weather/model-settings.json

{
    "name": "get-weather",
    "implementation": "model.GetWeather",
    "parameters": {
        "version": "v0.1.0"
    }
}

Here define a tail model, which extracts the last entry from the provided history. The model definition is the following:

import json
from mlserver import MLModel
from mlserver.codecs import StringCodec
from mlserver.types import InferenceRequest, InferenceResponse


class TailModel(MLModel):
    async def load(self) -> bool:
        self.ready = True
        return self.ready

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        response = json.loads(
            StringCodec.decode_input(
                extract_tensor_by_name(payload.inputs, "history")
            )[-1]
        )
        return InferenceResponse(
            model_name=self.name,
            model_version=self.version,
            outputs=[
                StringCodec.encode_output(
                    name="role",
                    payload=[response["role"]]
                ),
                StringCodec.encode_output(
                    name="content",
                    payload=[response["content"][0]["value"]]
                ),
                StringCodec.encode_output(
                    name="type",
                    payload=[response["content"][0]["type"]]
                ),
            ]
        )

The corresponding model-settings.json file is the following:

!cat models/tail/model-settings.json

{
    "name": "get-weather",
    "implementation": "model.TailModel",
    "parameters": {
        "version": "v0.1.0"
    }
}

There is one last model in the pipeline, which is specific to the cyclic pipelines in Core 2 with the following definition:

import asyncio
from mlserver import MLModel, ModelSettings
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput


class IdentityModel(MLModel):
    def __init__(self, settings: ModelSettings):
        super().__init__(settings)
        self.params = settings.parameters
        self.extra = self.params.extra if self.params is not None else None
        self.delay = self.extra.get("delay")
        

    async def load(self) -> bool:
        self.ready = True
        return self.ready

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        if self.delay:
            await asyncio.sleep(self.delay)
        
        return InferenceResponse(
            model_name=self.name,
            model_version=self.version,
            outputs=[
                ResponseOutput(
                    name=request_input.name,
                    shape=request_input.shape,
                    datatype=request_input.datatype,
                    parameters=request_input.parameters,
                    data=request_input.data
                ) for request_input in payload.inputs
            ]
        )

The identity model simply forwards all the inputs and adds a custom delay. In our case, the custom delay is 1 millisecond which corresponds to the joining window interval from Kafka Streams. This delay is required to avoid infinite loops due to joining operations in Kakfa Stream. For further details about how Kafka Streams works, see the following link.

The model-settings.json file for the identity model is the following:

!cat models/identity/model-settings.json

{
    "name": "identity",
    "implementation": "model.IdentityModel",
    "parameters": {
        "version": "v0.1.0",
        "extra": {
            "delay": 0.001
        }
    }
}

We are now ready to deploy all the models. The manifest file for all the models is the following:

!cat manifest/models.yaml

---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: get-temperature
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/get-temperature"
  requirements:
  - mlserver
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: get-wind-speed
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/get-wind-speed"
  requirements:
  - mlserver
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: get-weather
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/get-weather"
  requirements:
  - mlserver
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: default-tail
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/tail"
  requirements:
  - mlserver
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: default-llm
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/llm"
  requirements:
  - openai
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: history-1
spec:
  storageUri: "gs://seldon-models/llm-runtimes/loop-test/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: history-2
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: history-3
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: history-4
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: history-5
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: history-6
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: identity-loop
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/agentic-workflows/planning/identity"
  requirements:
  - mlserver

We can load all the models by running the following command:

!kubectl apply -f manifest/models.yaml -n seldon
!kubectl wait --for condition=ready --timeout=300s models --all -n seldon

model.mlops.seldon.io/get-temperature created
model.mlops.seldon.io/get-wind-speed created
model.mlops.seldon.io/get-weather created
model.mlops.seldon.io/default-tail created
model.mlops.seldon.io/default-llm created
model.mlops.seldon.io/history-1 created
model.mlops.seldon.io/history-2 created
model.mlops.seldon.io/history-3 created
model.mlops.seldon.io/history-4 created
model.mlops.seldon.io/history-5 created
model.mlops.seldon.io/history-6 created
model.mlops.seldon.io/identity-loop created
model.mlops.seldon.io/default-llm condition met
model.mlops.seldon.io/default-tail condition met
model.mlops.seldon.io/get-temperature condition met
model.mlops.seldon.io/get-weather condition met
model.mlops.seldon.io/get-wind-speed condition met
model.mlops.seldon.io/history-1 condition met
model.mlops.seldon.io/history-2 condition met
model.mlops.seldon.io/history-3 condition met
model.mlops.seldon.io/history-4 condition met
model.mlops.seldon.io/history-5 condition met
model.mlops.seldon.io/history-6 condition met
model.mlops.seldon.io/identity-loop condition met

At this point, we have all the models up and running and we are ready to define the computation pipeline. The manifest file for the pipeline is the following:

!cat manifest/pipeline.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: planning-pipeline
spec:
  allowCycles: true
  maxStepRevisits: 3
  steps:
  - name: history-1
    inputs:
    - planning-pipeline.inputs.memory_id
    - planning-pipeline.inputs.role
    - planning-pipeline.inputs.content
    - planning-pipeline.inputs.type
    - planning-pipeline.inputs.tool_calls
    - planning-pipeline.inputs.tools
  - name: default-llm
    inputsJoinType: any
    inputs:
    - history-1.outputs
    - identity-loop.outputs
  # This step is added to the pipeline to demonstrate the use of the `tail` tool
  - name: history-2
    joinWindowMs: 1
    inputs:
    - default-llm.outputs.memory_id
    - default-llm.outputs.role
    - default-llm.outputs.content
    - default-llm.outputs.type
    - default-llm.outputs.tool_calls
    - default-llm.outputs.tools
    triggers:
    - default-llm.outputs.default
  - name: default-tail
    joinWindowMs: 1
    inputs:
    - history-2.outputs
  # This step is added to the pipeline to demonstrate the use of the `get_temperature` tool
  - name: history-3
    joinWindowMs: 1
    inputs:
    - default-llm.outputs.memory_id
    - default-llm.outputs.role
    - default-llm.outputs.content
    - default-llm.outputs.type
    - default-llm.outputs.tool_calls
    - default-llm.outputs.tools
    triggers:
    - default-llm.outputs.get_temperature
  - name: get-temperature
    joinWindowMs: 1
    inputs:
    - history-3.outputs
  # This step is added to the pipeline to demonstrate the use of the `get_wind_speed` tool
  - name: history-4
    joinWindowMs: 1
    inputs:
    - default-llm.outputs.memory_id
    - default-llm.outputs.role
    - default-llm.outputs.content
    - default-llm.outputs.type
    - default-llm.outputs.tool_calls
    - default-llm.outputs.tools
    triggers:
    - default-llm.outputs.get_wind_speed
  - name: get-wind-speed
    joinWindowMs: 1
    inputs:
    - history-4.outputs
  # This step is added to the pipeline to demonstrate the use of the `get_weather` tool
  - name: history-5
    joinWindowMs: 1
    inputs:
    - default-llm.outputs.memory_id
    - default-llm.outputs.role
    - default-llm.outputs.content
    - default-llm.outputs.type
    - default-llm.outputs.tool_calls
    - default-llm.outputs.tools
    triggers:
    - default-llm.outputs.get_weather
  - name: get-weather
    joinWindowMs: 1
    inputs:
    - history-5.outputs
  # Gathering all the outputs from the tool calls
  - name: history-6
    inputsJoinType: any
    inputs:
    - get-temperature.outputs.memory_id
    - get-temperature.outputs.role
    - get-temperature.outputs.content
    - get-temperature.outputs.type
    - get-temperature.outputs.tool_call_id
    - get-temperature.outputs.tool_calls
    - get-temperature.outputs.tools

    - get-wind-speed.outputs.memory_id
    - get-wind-speed.outputs.role
    - get-wind-speed.outputs.content
    - get-wind-speed.outputs.type
    - get-wind-speed.outputs.tool_call_id
    - get-wind-speed.outputs.tool_calls
    - get-wind-speed.outputs.tools

    - get-weather.outputs.memory_id
    - get-weather.outputs.role
    - get-weather.outputs.content
    - get-weather.outputs.type
    - get-weather.outputs.tool_call_id
    - get-weather.outputs.tool_calls
    - get-weather.outputs.tools
  - name: identity-loop
    joinWindowMs: 1
    inputs:
    - history-6.outputs
  # This is where the output is defined
  output:
    steps:
    - default-tail

Note that we set allowCycles to true and the maxStepRevisits to 3. The allowCycles flag is required to enable cyclic pipelines in Seldon Core 2, and the maxStepRevisits defines how many times the pipeline can revisit the same step before it stops. In our case, we set it to 3, because we expect the pipeline to revisit the get_temperature, get_wind_speed, and get_weather steps at most once each.

We can now deploy the pipeline by running the following commands:

!kubectl apply -f manifest/pipeline.yaml -n seldon
!kubectl wait --for condition=ready --timeout=300s pipelines --all -n seldon

pipeline.mlops.seldon.io/planning-pipeline created
pipeline.mlops.seldon.io/planning-pipeline condition met

Before sending the requests to the pipeline, we define a helper function to extract the IP address of the seldon-mesh service. This will help us define the endpoint to which we want to send the requests.

import requests
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

To inform the LLM about the functions we want to use (i.e., get_temperature, get_wind_speed, and get_weather), we define the tools object, which contains all the metadata needed by the LLM (e.g., name, description, arguments, etc.) to be able to provide the appropriate arguments to those functions. See OpenAI tutorial on function calling for more details.

tools = [
    json.dumps({
        "type": "function",
        "function": {
            "name": "get_temperature",
            "description": "Get current temperature for provided coordinates.",
            "parameters": {
                "type": "object",
                "properties": {
                    "latitude": {"type": "number"},
                    "longitude": {"type": "number"}
                },
                "required": ["latitude", "longitude"],
                "additionalProperties": False
            },
            "strict": True
        }
    }),
    json.dumps({
        "type": "function",
        "function": {
            "name": "get_wind_speed",
            "description": "Get current wind speed for provided coordinates.",
            "parameters": {
                "type": "object",
                "properties": {
                    "latitude": {"type": "number"},
                    "longitude": {"type": "number"}
                },
                "required": ["latitude", "longitude"],
                "additionalProperties": False
            },
            "strict": True
        }
    }),
    json.dumps({
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather based on the proviced temperature and windspeed.",
            "parameters": {
                "type": "object",
                "properties": {
                    "temperature": {"type": "number"},
                    "windspeed": {"type": "number"}
                },
                "required": ["temperature", "windspeed"],
                "additionalProperties": False
            },
            "strict": True
        }
    }),
]

We are now ready to start interacting with the pipeline. We begin by defining the memory UUID which will uniquely identify a chat history:

from uuid import uuid4
memory_id = str(uuid4())

We also define a helper function to send requests to the pipeline to avoid repetitive code:

def send_request(content: str):
    inference_request = {
        "inputs": [
            {
                "name": "memory_id",
                "shape": [1],
                "datatype": "BYTES",
                "data": [memory_id],
                "parameters": {"content_type": "str"},
            },
            {
                "name": "role",
                "shape": [2],
                "datatype": "BYTES",
                "parameters": {"content_type": "str"},
                "data": ["system", "user"],
            },
            {
                "name": "content",
                "shape": [2],
                "datatype": "BYTES",
                "parameters": {"content_type": "str"},
                "data": [
                    json.dumps([
                        "You are a helpful assistant who has access to multiple tools to solve a task. "
                        "A task can be solved by using one or more tools. You can use the tools to get "
                        "information needed to provide the final answer. "
                    ]),
                    json.dumps([
                        content
                    ])
                ]
            },
            {
                "name": "type",
                "shape": [2],
                "datatype": "BYTES",
                "parameters": {"content_type": "str"},
                "data": [
                    json.dumps(["text"]),
                    json.dumps(["text"])
                ]
            },
            {
                "name": "tools",
                "shape": [3],
                "datatype": "BYTES",
                "parameters": {"content_type": "str"},
                "data": tools
            },
        ]
    }

    endpoint = f"http://{get_mesh_ip()}/v2/pipelines/planning-pipeline/infer"
    return requests.post(endpoint, json=inference_request)

We can now ask the agent to provide details about the weather in a particular location:

response = send_request("How is the weather in Bucharest? Please provide the temperature and wind speed in your answer.")
print(response.json()['outputs'][1]['data'][0])

The current weather in Bucharest is a cool day with a temperature of 18°C and a wind speed of 8.3 km/h.

As we can see, the answer shows that the model can plan the necessary steps to answer our question by fetching the temperature and the wind speed to respond to the weather.

We can check that the tools are called by inspecting the history of the conversation:

def send_memory_request():
    inference_request = {
        "inputs": [
            {
                "name": "memory_id",
                "shape": [1],
                "datatype": "BYTES",
                "data": [memory_id],
                "parameters": {"content_type": "str"},
            },
        ]
    }

    endpoint = f"http://{get_mesh_ip()}/v2/models/history-1/infer"
    return requests.post(endpoint, json=inference_request)

send_memory_request().json()['outputs']

[{'name': 'history',
  'shape': [9, 1],
  'datatype': 'BYTES',
  'parameters': {'content_type': 'str'},
  'data': ['{"role": "system", "content": [{"type": "text", "value": "You are a helpful assistant who has access to multiple tools to solve a task. A task can be solved by using one or more tools. You can use the tools to get information needed to provide the final answer. "}]}',
   '{"role": "user", "content": [{"type": "text", "value": "How is the weather in Bucharest? Please provide the temperature and wind speed in your answer."}]}',
   '{"role": "assistant", "content": [{"type": "text", "value": ""}], "tool_calls": ["{\\"id\\":\\"call_d1Vs6iXM4qW35hDqFIv5g5HR\\",\\"function\\":{\\"arguments\\":\\"{\\\\\\"latitude\\\\\\":44.4268,\\\\\\"longitude\\\\\\":26.1025}\\",\\"name\\":\\"get_temperature\\"},\\"type\\":\\"function\\"}"]}',
   '{"role": "tool", "content": [{"type": "text", "value": "{\\"temperature\\": 18.0}"}], "tool_call_id": "call_d1Vs6iXM4qW35hDqFIv5g5HR"}',
   '{"role": "assistant", "content": [{"type": "text", "value": ""}], "tool_calls": ["{\\"id\\":\\"call_P3skGK6A3az2BZDEGNspQLek\\",\\"function\\":{\\"arguments\\":\\"{\\\\\\"latitude\\\\\\":44.4268,\\\\\\"longitude\\\\\\":26.1025}\\",\\"name\\":\\"get_wind_speed\\"},\\"type\\":\\"function\\"}"]}',
   '{"role": "tool", "content": [{"type": "text", "value": "{\\"wind_speed\\": 8.3}"}], "tool_call_id": "call_P3skGK6A3az2BZDEGNspQLek"}',
   '{"role": "assistant", "content": [{"type": "text", "value": ""}], "tool_calls": ["{\\"id\\":\\"call_DTSaVIUaMH5uzRcDvMAMzjL5\\",\\"function\\":{\\"arguments\\":\\"{\\\\\\"temperature\\\\\\":18.0,\\\\\\"windspeed\\\\\\":8.3}\\",\\"name\\":\\"get_weather\\"},\\"type\\":\\"function\\"}"]}',
   '{"role": "tool", "content": [{"type": "text", "value": "cool day with low wind speed."}], "tool_call_id": "call_DTSaVIUaMH5uzRcDvMAMzjL5"}',
   '{"role": "assistant", "content": [{"type": "text", "value": "The current weather in Bucharest is a cool day with a temperature of 18\\u00b0C and a wind speed of 8.3 km/h."}]}']},
 {'name': 'memory_id',
  'shape': [1],
  'datatype': 'BYTES',
  'parameters': {'content_type': 'str'},
  'data': ['a5d9515d-dae4-4970-913f-642d7820c645']}]

In the history of the conversation, we can see exactly what all three tools are called.

One can still ask simpler questions which require calling a single tool or none:

response = send_request("What is the temperature in Bucharest?")
print(response.json()['outputs'][1]['data'][0])

The current temperature in Bucharest is 18.0°C.

response = send_request("What is the capital of Romania?")
print(response.json()['outputs'][1]['data'][0])

The capital of Romania is Bucharest.

In our example, the tools return real time data from the Open-Meteo API, so the answers will change depending on the current weather in the specified location.

To unload the pipeline and the models, run the following commands:

!kubectl delete pipelines --all -n seldon
!kubectl delete models --all -n seldon

pipeline.mlops.seldon.io "planning-pipeline" deleted
model.mlops.seldon.io "default-llm" deleted
model.mlops.seldon.io "default-tail" deleted
model.mlops.seldon.io "get-temperature" deleted
model.mlops.seldon.io "get-weather" deleted
model.mlops.seldon.io "get-wind-speed" deleted
model.mlops.seldon.io "history-1" deleted
model.mlops.seldon.io "history-2" deleted
model.mlops.seldon.io "history-3" deleted
model.mlops.seldon.io "history-4" deleted
model.mlops.seldon.io "history-5" deleted
model.mlops.seldon.io "history-6" deleted
model.mlops.seldon.io "identity-loop" deleted

If the pipeline encounters a failure (e.g., a model inference error), it will halt and return an error message to the user. Since the system prompt is included with every request to the LLM, there's no risk of the chat history becoming inconsistent due to a failure. This is because the memory component retrieves all previous messages up to the last system prompt to provide context to the LLM.

Congrats! You've just deployed an agentic worfklow with the LLM module!

PreviousTool Use NextMonitoring

Last updated 22 days ago

Was this helpful?