Streaming
Seldon Core 2 (SC2) supports streaming via HTTP/REST SSE and gRPC. Streaming is extremely useful for LLM-based applications since response latency can be significant (e.g., up to a few seconds, depending on the model size, available hardware, response size, etc.). Generating the response incrementally provides a much better user experience, as users can start processing or reading the response as it is generated rather than waiting for the entire response to be ready.
To enable streaming support, define the following environment variables in the server configuration file:
- name: MLSERVER_PARALLEL_WORKERS
value: "0"
- name: MLSERVER_GZIP_ENABLED
value: "false"
For this tutorial you will need the api
and local
server up and running. See the installation steps for futher details.
OpenAI
We first demonstarte how to use streaming with the OpenAI
runtime. In this example, we deploy a gpt-3.5-turbo
model, defined by the following model-settings.json
file:
!cat models/openai-chat-completions/model-settings.json
{
"name": "openai-chat-completions",
"implementation": "mlserver_llm_api.LLMRuntime",
"parameters": {
"extra": {
"provider_id": "openai",
"config": {
"model_id": "gpt-3.5-turbo",
"model_type": "chat.completions"
}
}
}
}
The corresponding manifest file is the following:
!cat manifests/openai-chat-completions.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: openai-chat-completions
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/api/openai/models/openai-chat-completions"
requirements:
- openai
To deploy the model, run the following commands:
!kubectl apply -f manifests/openai-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/openai-chat-completions created
model.mlops.seldon.io/openai-chat-completions condition met
We define a helper function to get the mesh IP. This function will help us construct the endpoint to reach the model.
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
As briefly mentioned before, MLServer uses SSE under the hood. Thus, we need to initiate an SSE connection. To do so, we install the httpx-sse
package. Run the following command to install it:
!pip install httpx-sse -q
With our model deployed and our dependencies installed we are ready to send a request to the model.
import time
import json
import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec
inference_request = {
"inputs": [
{
"name": "role",
"shape": [2],
"datatype": "BYTES",
"data": ["system", "user"]
},
{
"name": "content",
"shape": [2],
"datatype": "BYTES",
"data": [
json.dumps(["You are a helpful assistant"]),
json.dumps(["Does bad weather imporve productivity?"]),
],
},
{
"name": "type",
"shape": [2],
"datatype": "BYTES",
"data": [
json.dumps(["text"]),
json.dumps(["text"]),
]
}
],
}
endpoint = f"http://{get_mesh_ip()}/v2/models/openai-chat-completions/infer_stream"
buffer = []
with httpx.Client() as client:
with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
for sse in event_source.iter_sse():
response = InferenceResponse.model_validate_json(sse.data)
print(StringCodec.decode_output(response.outputs[1])[0], end='')
time.sleep(0.1) # sleep for a bit for demonstration purposes
Bad weather can have both positive and negative effects on productivity, depending on the individual and the situation. Some people may find bad weather to be a hindrance to productivity, as it can lead to transportation delays, discomfort, and difficulty focusing. On the other hand, bad weather can also provide opportunities for increased productivity, such as staying indoors and focusing on work without distractions from outdoor activities. Ultimately, the impact of bad weather on productivity will vary from person to person and may also depend on the specific tasks being performed.
As you can see, we received the response streamed token by token. Note that we hit the infer_stream
endpoint instead of the infer
(generate_stream
and generate
are also available and equivalent to infer_stream
and infer
)
To delete the model, run the following command:
!kubectl delete -f manifests/openai-chat-completions.yaml -n seldon
model.mlops.seldon.io "openai-chat-completions" deleted
Local
We now demonstrate how to stream the response for a model deployed with the Local Runtime.
We deploy the microsoft/Phi-3.5-mini-instruct
model with the following model-settings.json
file:
!cat models/local-chat-completions/model-settings.json
{
"name": "local-chat-completions",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "microsoft/Phi-3.5-mini-instruct",
"tensor_parallel_size": 4,
"dtype": "float16",
"gpu_memory_utilization": 0.8,
"max_model_len": 1024,
"default_generate_kwargs": {
"max_tokens": 256
}
}
}
}
}
}
The associated manifest file is the following:
!cat manifests/local-chat-completions.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: local-chat-completions
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/chat/models/local-chat-completions"
requirements:
- llm-local
To deploy the model, run the following command:
!kubectl apply -f manifests/local-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/local-chat-completions created
model.mlops.seldon.io/local-chat-completions condition met
With the model deployed, we can now send a request to the model using the following code:
import time
import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"]
},
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"data": ["Does money buy happiness?"],
},
{
"name": "type",
"shape": [1],
"datatype": "BYTES",
"data": ["text"]
}
],
}
endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer_stream"
with httpx.Client() as client:
with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
for sse in event_source.iter_sse():
response = InferenceResponse.model_validate_json(sse.data)
print(StringCodec.decode_output(response.outputs[1])[0], end='')
time.sleep(0.1) # sleep for a bit for demonstration purposes
The relationship between money and happiness has been extensively studied in psychology, sociology, and economics. However, the answer to whether money can buy happiness is complex and not straightforward. Here are several key points to consider:
1. Basic needs: According to Maslow's hierarchy of needs, people need basic necessities like food, water, shelter, and healthcare to survive. Money can help in acquiring these essentials, potentially reducing stress and providing the foundation for happiness.
2. Financial security: Having sufficient money can provide a sense of security and lessen worries about the future, such as debt or job stability. Moreover, financial security allows individuals to focus on personal growth, family, and hobbies, which can contribute to overall happiness.
3. Pleasure and comfort: Increased disposable income allows people to enjoy their free time, focus on relaxation, leisure activities, and hobbies, which can contribute to feeling happier. For some, the comfort and luxury provided by an abundance of money can enhance one's quality of life.
On the other hand, research also indicates that beyond a certain point, the curve of happiness related
As before, the response is streamed back token by token. Note that we also hit the infer_stream
endpoint instead of infer
.
To unload the model, run the following command:
!kubectl delete -f manifests/local-chat-completions.yaml -n seldon
model.mlops.seldon.io "local-chat-completions" deleted
Congrats! You've just used the streaming functionality from SC2!
The gRPC example is coming soon!
Last updated
Was this helpful?