Streaming

Seldon Core 2 (SC2) supports streaming via HTTP/REST SSE and gRPC. Streaming is extremely useful for LLM-based applications since response latency can be significant (e.g., up to a few seconds, depending on the model size, available hardware, response size, etc.). Generating the response incrementally provides a much better user experience, as users can start processing or reading the response as it is generated rather than waiting for the entire response to be ready.

Currently, we support streaming only for single-model deployments, not for pipelines.

To enable streaming support, define the following environment variables in the server configuration file:

- name: MLSERVER_PARALLEL_WORKERS
  value: "0"
- name: MLSERVER_GZIP_ENABLED
  value: "false"

For this tutorial you will need the api and local server up and running. See the installation steps for futher details.

OpenAI

We first demonstarte how to use streaming with the OpenAI runtime. In this example, we deploy a gpt-3.5-turbo model, defined by the following model-settings.json file:

!cat models/openai-chat-completions/model-settings.json
{
  "name": "openai-chat-completions",
  "implementation": "mlserver_llm_api.LLMRuntime",
  "parameters": {
    "extra": {
      "provider_id": "openai",
      "config": {
        "model_id": "gpt-3.5-turbo",
        "model_type": "chat.completions"
      }
    }
  }
}

The corresponding manifest file is the following:

To deploy the model, run the following commands:

We define a helper function to get the mesh IP. This function will help us construct the endpoint to reach the model.

As briefly mentioned before, MLServer uses SSE under the hood. Thus, we need to initiate an SSE connection. To do so, we install the httpx-sse package. Run the following command to install it:

With our model deployed and our dependencies installed we are ready to send a request to the model.

As you can see, we received the response streamed token by token. Note that we hit the infer_stream endpoint instead of the infer (generate_stream and generate are also available and equivalent to infer_stream and infer)

To delete the model, run the following command:

Local

We now demonstrate how to stream the response for a model deployed with the Local Runtime.

We deploy the microsoft/Phi-3.5-mini-instruct model with the following model-settings.json file:

The associated manifest file is the following:

To deploy the model, run the following command:

With the model deployed, we can now send a request to the model using the following code:

As before, the response is streamed back token by token. Note that we also hit the infer_stream endpoint instead of infer.

To unload the model, run the following command:

Congrats! You've just used the streaming functionality from SC2!

The gRPC example is coming soon!

Last updated

Was this helpful?