Seldon Core 2 (SC2) supports streaming via HTTP/REST SSE and gRPC. Streaming is extremely useful for LLM-based applications since response latency can be significant (e.g., up to a few seconds, depending on the model size, available hardware, response size, etc.). Generating the response incrementally provides a much better user experience, as users can start processing or reading the response as it is generated rather than waiting for the entire response to be ready.
Currently, we support streaming only for single-model deployments, not for pipelines.
To enable streaming support, define the following environment variables in the server configuration file:
For this tutorial you will need the api and local server up and running. See the installation steps for futher details.
OpenAI
We first demonstarte how to use streaming with the OpenAI runtime. In this example, we deploy a gpt-3.5-turbo model, defined by the following model-settings.json file:
We define a helper function to get the mesh IP. This function will help us construct the endpoint to reach the model.
As briefly mentioned before, MLServer uses SSE under the hood. Thus, we need to initiate an SSE connection. To do so, we install the httpx-sse package. Run the following command to install it:
With our model deployed and our dependencies installed we are ready to send a request to the model.
As you can see, we received the response streamed token by token. Note that we hit the infer_stream endpoint instead of the infer (generate_stream and generate are also available and equivalent to infer_stream and infer)
To delete the model, run the following command:
Local
We now demonstrate how to stream the response for a model deployed with the Local Runtime.
We deploy the microsoft/Phi-3.5-mini-instruct model with the following model-settings.json file:
The associated manifest file is the following:
To deploy the model, run the following command:
With the model deployed, we can now send a request to the model using the following code:
As before, the response is streamed back token by token. Note that we also hit the infer_stream endpoint instead of infer.
To unload the model, run the following command:
Congrats! You've just used the streaming functionality from SC2!
import time
import json
import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec
inference_request = {
"inputs": [
{
"name": "role",
"shape": [2],
"datatype": "BYTES",
"data": ["system", "user"]
},
{
"name": "content",
"shape": [2],
"datatype": "BYTES",
"data": [
json.dumps(["You are a helpful assistant"]),
json.dumps(["Does bad weather imporve productivity?"]),
],
},
{
"name": "type",
"shape": [2],
"datatype": "BYTES",
"data": [
json.dumps(["text"]),
json.dumps(["text"]),
]
}
],
}
endpoint = f"http://{get_mesh_ip()}/v2/models/openai-chat-completions/infer_stream"
buffer = []
with httpx.Client() as client:
with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
for sse in event_source.iter_sse():
response = InferenceResponse.model_validate_json(sse.data)
print(StringCodec.decode_output(response.outputs[1])[0], end='')
time.sleep(0.1) # sleep for a bit for demonstration purposes
Bad weather can have both positive and negative effects on productivity, depending on the individual and the situation. Some people may find bad weather to be a hindrance to productivity, as it can lead to transportation delays, discomfort, and difficulty focusing. On the other hand, bad weather can also provide opportunities for increased productivity, such as staying indoors and focusing on work without distractions from outdoor activities. Ultimately, the impact of bad weather on productivity will vary from person to person and may also depend on the specific tasks being performed.
model.mlops.seldon.io/local-chat-completions created
model.mlops.seldon.io/local-chat-completions condition met
import time
import httpx
from httpx_sse import connect_sse
from mlserver.types import InferenceResponse
from mlserver.codecs import StringCodec
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"]
},
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"data": ["Does money buy happiness?"],
},
{
"name": "type",
"shape": [1],
"datatype": "BYTES",
"data": ["text"]
}
],
}
endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer_stream"
with httpx.Client() as client:
with connect_sse(client, "POST", endpoint, json=inference_request) as event_source:
for sse in event_source.iter_sse():
response = InferenceResponse.model_validate_json(sse.data)
print(StringCodec.decode_output(response.outputs[1])[0], end='')
time.sleep(0.1) # sleep for a bit for demonstration purposes
The relationship between money and happiness has been extensively studied in psychology, sociology, and economics. However, the answer to whether money can buy happiness is complex and not straightforward. Here are several key points to consider:
1. Basic needs: According to Maslow's hierarchy of needs, people need basic necessities like food, water, shelter, and healthcare to survive. Money can help in acquiring these essentials, potentially reducing stress and providing the foundation for happiness.
2. Financial security: Having sufficient money can provide a sense of security and lessen worries about the future, such as debt or job stability. Moreover, financial security allows individuals to focus on personal growth, family, and hobbies, which can contribute to overall happiness.
3. Pleasure and comfort: Increased disposable income allows people to enjoy their free time, focus on relaxation, leisure activities, and hobbies, which can contribute to feeling happier. For some, the comfort and luxury provided by an abundance of money can enhance one's quality of life.
On the other hand, research also indicates that beyond a certain point, the curve of happiness related