Streaming

Seldon Core 2 (SC2) supports streaming via HTTP/REST SSE and gRPC. Streaming is extremely useful for LLM-based applications since response latency can be significant (e.g., up to a few seconds, depending on the model size, available hardware, response size, etc.). Generating the response incrementally provides a much better user experience, as users can start processing or reading the response as it is generated rather than waiting for the entire response to be ready.

Currently, we support streaming only for single-model deployments, not for pipelines.

To enable streaming support, define the following environment variables in the server configuration file:

- name: MLSERVER_PARALLEL_WORKERS
  value: "0"
- name: MLSERVER_GZIP_ENABLED
  value: "false"

For this tutorial you will need the api and local server up and running. See the installation steps for futher details.

OpenAI

We first demonstarte how to use streaming with the OpenAI runtime. In this example, we deploy a gpt-3.5-turbo model, defined by the following model-settings.json file:

!cat models/openai-chat-completions/model-settings.json
{
  "name": "openai-chat-completions",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_id": "gpt-3.5-turbo",
        "model_type": "chat.completions"
      }
    }
  }
}

The corresponding manifest file is the following:

!cat manifests/openai-chat-completions.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: openai-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/api/openai/models/openai-chat-completions"
  requirements:
  - openai

To deploy the model, run the following commands:

!kubectl apply -f manifests/openai-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/openai-chat-completions created
model.mlops.seldon.io/openai-chat-completions condition met

We define a helper function to get the mesh IP. This function will help us construct the endpoint to reach the model.

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

As briefly mentioned before, MLServer uses SSE under the hood. Thus, we need to initiate an SSE connection. To do so, we install the httpx-sse package. Run the following command to install it:

!pip install httpx-sse -q

With our model deployed and our dependencies installed we are ready to send a request to the model.

import time
import json

import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["Does bad weather imporve productivity?"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ], 
}

endpoint = f"http://{get_mesh_ip()}/v2/models/openai-chat-completions/infer_stream"
buffer = []

with httpx.Client() as client:
    with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
        for sse in event_source.iter_sse():
            response = InferenceResponse.model_validate_json(sse.data)
            print(StringCodec.decode_output(response.outputs[1])[0], end='')
            time.sleep(0.1)  # sleep for a bit for demonstration purposes
Bad weather can have both positive and negative effects on productivity, depending on the individual and the situation. Some people may find bad weather to be a hindrance to productivity, as it can lead to transportation delays, discomfort, and difficulty focusing. On the other hand, bad weather can also provide opportunities for increased productivity, such as staying indoors and focusing on work without distractions from outdoor activities. Ultimately, the impact of bad weather on productivity will vary from person to person and may also depend on the specific tasks being performed.

As you can see, we received the response streamed token by token. Note that we hit the infer_stream endpoint instead of the infer (generate_stream and generate are also available and equivalent to infer_stream and infer)

To delete the model, run the following command:

!kubectl delete -f manifests/openai-chat-completions.yaml -n seldon
model.mlops.seldon.io "openai-chat-completions" deleted

Local

We now demonstrate how to stream the response for a model deployed with the Local Runtime.

We deploy the microsoft/Phi-3.5-mini-instruct model with the following model-settings.json file:

!cat models/local-chat-completions/model-settings.json
{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}

The associated manifest file is the following:

!cat manifests/local-chat-completions.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/chat/models/local-chat-completions"
  requirements:
  - llm-local

To deploy the model, run the following command:

!kubectl apply -f manifests/local-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/local-chat-completions created
model.mlops.seldon.io/local-chat-completions condition met

With the model deployed, we can now send a request to the model using the following code:

import time
import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [1], 
            "datatype": "BYTES",
            "data": ["user"]
        },
        {

            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Does money buy happiness?"],
        },
        {
            "name": "type",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["text"]
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer_stream"
with httpx.Client() as client:
    with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
        for sse in event_source.iter_sse():
            response = InferenceResponse.model_validate_json(sse.data)
            print(StringCodec.decode_output(response.outputs[1])[0], end='')
            time.sleep(0.1)  # sleep for a bit for demonstration purposes
 The relationship between money and happiness has been extensively studied in psychology, sociology, and economics. However, the answer to whether money can buy happiness is complex and not straightforward. Here are several key points to consider:

1. Basic needs: According to Maslow's hierarchy of needs, people need basic necessities like food, water, shelter, and healthcare to survive. Money can help in acquiring these essentials, potentially reducing stress and providing the foundation for happiness.

2. Financial security: Having sufficient money can provide a sense of security and lessen worries about the future, such as debt or job stability. Moreover, financial security allows individuals to focus on personal growth, family, and hobbies, which can contribute to overall happiness.

3. Pleasure and comfort: Increased disposable income allows people to enjoy their free time, focus on relaxation, leisure activities, and hobbies, which can contribute to feeling happier. For some, the comfort and luxury provided by an abundance of money can enhance one's quality of life.

On the other hand, research also indicates that beyond a certain point, the curve of happiness related

As before, the response is streamed back token by token. Note that we also hit the infer_stream endpoint instead of infer.

To unload the model, run the following command:

!kubectl delete -f manifests/local-chat-completions.yaml -n seldon
model.mlops.seldon.io "local-chat-completions" deleted

Congrats! You've just used the streaming functionality from SC2!

The gRPC example is coming soon!

Last updated

Was this helpful?