# Streaming

Seldon Core 2 (SC2) supports streaming via HTTP/REST SSE and gRPC. Streaming is extremely useful for LLM-based applications since response latency can be significant (e.g., up to a few seconds, depending on the model size, available hardware, response size, etc.). Generating the response incrementally provides a much better user experience, as users can start processing or reading the response as it is generated rather than waiting for the entire response to be ready.

{% hint style="info" %}
Currently, we support streaming only for single-model deployments, not for pipelines.
{% endhint %}

To enable streaming support, define the following environment variables in the server configuration file:

```yaml
- name: MLSERVER_PARALLEL_WORKERS
  value: "0"
- name: MLSERVER_GZIP_ENABLED
  value: "false"
```

For this tutorial you will need the `api` and `local` server up and running. See the [installation steps](/llm-module/introduction/installation.md) for futher details.

## OpenAI

We first demonstarte how to use streaming with the `OpenAI` runtime. In this example, we deploy a `gpt-3.5-turbo` model, defined by the following `model-settings.json` file:

```python
!cat models/openai-chat-completions/model-settings.json
```

```
{
  "name": "openai-chat-completions",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_id": "gpt-3.5-turbo",
        "model_type": "chat.completions"
      }
    }
  }
}
```

The corresponding manifest file is the following:

```python
!cat manifests/openai-chat-completions.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: openai-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/api/openai/models/openai-chat-completions"
  requirements:
  - openai
```

To deploy the model, run the following commands:

```python
!kubectl apply -f manifests/openai-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/openai-chat-completions created
model.mlops.seldon.io/openai-chat-completions condition met
```

We define a helper function to get the mesh IP. This function will help us construct the endpoint to reach the model.

```python
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')
```

As briefly mentioned before, MLServer uses SSE under the hood. Thus, we need to initiate an SSE connection. To do so, we install the `httpx-sse` package. Run the following command to install it:

```python
!pip install httpx-sse -q
```

With our model deployed and our dependencies installed we are ready to send a request to the model.

```python
import time
import json

import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["Does bad weather imporve productivity?"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ], 
}

endpoint = f"http://{get_mesh_ip()}/v2/models/openai-chat-completions/infer_stream"
buffer = []

with httpx.Client() as client:
    with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
        for sse in event_source.iter_sse():
            response = InferenceResponse.model_validate_json(sse.data)
            print(StringCodec.decode_output(response.outputs[1])[0], end='')
            time.sleep(0.1)  # sleep for a bit for demonstration purposes

```

```
Bad weather can have both positive and negative effects on productivity, depending on the individual and the situation. Some people may find bad weather to be a hindrance to productivity, as it can lead to transportation delays, discomfort, and difficulty focusing. On the other hand, bad weather can also provide opportunities for increased productivity, such as staying indoors and focusing on work without distractions from outdoor activities. Ultimately, the impact of bad weather on productivity will vary from person to person and may also depend on the specific tasks being performed.
```

As you can see, we received the response streamed token by token. Note that we hit the `infer_stream` endpoint instead of the `infer` (`generate_stream` and `generate` are also available and equivalent to `infer_stream` and `infer`)

To delete the model, run the following command:

```python
!kubectl delete -f manifests/openai-chat-completions.yaml -n seldon
```

```
model.mlops.seldon.io "openai-chat-completions" deleted
```

## Local

We now demonstrate how to stream the response for a model deployed with the Local Runtime.

We deploy the `microsoft/Phi-3.5-mini-instruct` model with the following `model-settings.json` file:

```python
!cat models/local-chat-completions/model-settings.json
```

```
{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}
```

The associated manifest file is the following:

```python
!cat manifests/local-chat-completions.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/chat/models/local-chat-completions"
  requirements:
  - llm-local
```

To deploy the model, run the following command:

```python
!kubectl apply -f manifests/local-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/local-chat-completions created
model.mlops.seldon.io/local-chat-completions condition met
```

With the model deployed, we can now send a request to the model using the following code:

```python
import time
import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [1], 
            "datatype": "BYTES",
            "data": ["user"]
        },
        {

            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Does money buy happiness?"],
        },
        {
            "name": "type",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["text"]
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer_stream"
with httpx.Client() as client:
    with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
        for sse in event_source.iter_sse():
            response = InferenceResponse.model_validate_json(sse.data)
            print(StringCodec.decode_output(response.outputs[1])[0], end='')
            time.sleep(0.1)  # sleep for a bit for demonstration purposes
```

```
 The relationship between money and happiness has been extensively studied in psychology, sociology, and economics. However, the answer to whether money can buy happiness is complex and not straightforward. Here are several key points to consider:

1. Basic needs: According to Maslow's hierarchy of needs, people need basic necessities like food, water, shelter, and healthcare to survive. Money can help in acquiring these essentials, potentially reducing stress and providing the foundation for happiness.

2. Financial security: Having sufficient money can provide a sense of security and lessen worries about the future, such as debt or job stability. Moreover, financial security allows individuals to focus on personal growth, family, and hobbies, which can contribute to overall happiness.

3. Pleasure and comfort: Increased disposable income allows people to enjoy their free time, focus on relaxation, leisure activities, and hobbies, which can contribute to feeling happier. For some, the comfort and luxury provided by an abundance of money can enhance one's quality of life.

On the other hand, research also indicates that beyond a certain point, the curve of happiness related
```

As before, the response is streamed back token by token. Note that we also hit the `infer_stream` endpoint instead of `infer`.

To unload the model, run the following command:

```python
!kubectl delete -f manifests/local-chat-completions.yaml -n seldon
```

```
model.mlops.seldon.io "local-chat-completions" deleted
```

Congrats! You've just used the streaming functionality from SC2!

The gRPC example is coming soon!


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/components/models/streaming.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
