The mlserver
package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs.
In this example, we create a simple Identity Text Model
which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model.
The next step will be to serve our model using mlserver
. For that, we will first implement an extension that serves as the runtime to perform inference using our custom TextModel
.
This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:
split the text into words using the white space as the delimiter.
wait 0.5 seconds between each word to simulate a slow model.
return each word one by one.
As it can be seen, the predict_stream
method receives as an input an AsyncIterator
of InferenceRequest
and returns an AsyncIterator
of InferenceResponse
. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.
Note that although unary-unary can be covered by predict_stream
method as well, mlserver
already covers that through the predict
method.
One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above.
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Note the currently there are three main limitations of the streaming support in MLServer:
distributed workers are not supported (i.e., the parallel_workers
setting should be set to 0
)
gzip
middleware is not supported for REST (i.e., gzip_enabled
setting should be set to false
)
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be run from the same directory where our config files are or point to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.
To test our model, we will use the following inference request:
To send a REST streaming request to the server, we will use the following Python code:
To send a gRPC streaming request to the server, we will use the following Python code:
Note that for gRPC, the request is transformed into an async generator which is then passed to the ModelStreamInfer
method. The response is also an async generator which can be iterated over to get the response.