Inference
This section will discuss how to make inference calls against your Seldon models or pipelines.
You can make synchronous inference requests via REST or gRPC or asynchronous requests via Kafka topics. The content of your request should be an inference v2 protocol payload:
REST payloads will generally be in the JSON v2 protocol format.
gRPC and Kafka payloads must be in the Protobuf v2 protocol format.
Synchronous Requests
For making synchronous requests, the process will generally be:
Find the appropriate service endpoint (IP address and port) for accessing the installation of Seldon Core 2.
Determine the appropriate headers/metadata for the request.
Make requests via REST or gRPC.
Find the Seldon Service Endpoint
````{tab} Docker Compose
In the default Docker Compose setup, container ports are accessible from the host machine.
This means you can use `localhost` or `0.0.0.0` as the hostname.
The default port for sending inference requests to the Seldon system is `9000`.
This is controlled by the `ENVOY_DATA_PORT` environment variable for Compose.
Putting this together, you can send inference requests to `0.0.0.0:9000`.
````
````{tab} Kubernetes
In Kubernetes, Seldon creates a single `Service` called `seldon-mesh` in the namespace it is installed into.
By default, this namespace is also called `seldon-mesh`.
If this `Service` is exposed via a load balancer, the appropriate address and port can be found via:
```bash
kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
```
If you are not using a `LoadBalancer` for the `seldon-mesh` `Service`, you can still send inference requests.
For development and testing purposes, you can port-forward the `Service` locally using the below.
Inference requests can then be sent to `localhost:8080`.
```
kubectl port-forward svc/seldon-mesh -n seldon-mesh 8080:80
```
If you are using a service mesh like Istio or Ambassador, you will need to use the IP address of the service mesh ingress and determine the appropriate port.
````
Make Inference Requests
Let us imagine making inference requests to a model called iris.
This iris model has the following schema, which can be set in a model-settings.json file for MLServer:
Examples are given below for some common tools for making requests.
Request Routing
Seldon Routes
Seldon needs to determine where to route requests to, as models and pipelines might have the same name. There are two ways of doing this: header-based routing (preferred) and path-based routing.
Extending our examples from above, the requests may look like the below when using header-based routing.
Ingress Routes
If you are using an ingress controller to make inference requests with Seldon, you will need to configure the routing rules correctly.
There are many ways to do this, but custom path prefixes will not work with gRPC. This is because gRPC determines the path based on the Protobuf definition. Some gRPC implementations permit manipulating paths when sending requests, but this is by no means universal.
If you want to expose your inference endpoints via gRPC and REST in a consistent way, you should use virtual hosts, subdomains, or headers.
The downside of using only paths is that you cannot differentiate between different installations of Seldon Core 2 or between traffic to Seldon and any other inference endpoints you may have exposed via the same ingress.
You might want to use a mixture of these methods; the choice is yours.
Asynchronous Requests
The Seldon architecture uses Kafka and therefore asynchronous requests can be sent by pushing inference v2 protocol payloads to the appropriate topic. Topics have the following form:
Model Inference
For a local install if you have a model iris, you would be able to send a prediction request by pushing to the topic: seldon.default.model.iris.inputs.
The response will appear on seldon.default.model.iris.outputs.
For a Kubernetes install in seldon-mesh if you have a model iris, you would be able to send a prediction request by pushing to the topic: seldon.seldon-mesh.model.iris.inputs.
The response will appear on seldon.seldon-mesh.model.iris.outputs.
Pipeline Inference
For a local install if you have a pipeline mypipeline, you would be able to send a prediction request by pushing to the topic: seldon.default.pipeline.mypipeline.inputs. The response will appear on seldon.default.pipeline.mypipeline.outputs.
For a Kubernetes install in seldon-mesh if you have a pipeline mypipeline, you would be able to send a prediction request by pushing to the topic: seldon.seldon-mesh.pipeline.mypipeline.inputs. The response will appear on seldon.seldon-mesh.pipeline.mypipeline.outputs.
Pipeline Metadata
It may be useful to send metadata alongside your inference.
If using Kafka directly as described above, you can attach Kafka metadata to your request, which will be passed around the graph. When making synchronous requests to your pipeline with REST or gRPC you can also do this.
For REST requests add HTTP headers prefixed with
X-For gRPC requests add metadata with keys starting with
X-
You can also do this with the Seldon CLI by setting headers with the --header argument (and also showing response headers with the --show-headers argument)
For pipeline inference, the response also contains a x-pipeline-version header, indicating which version of pipeline it ran inference with.
Request IDs
For both model and pipeline requests the response will contain a x-request-id response header. For pipeline requests this can be used to inspect the pipeline steps via the CLI, e.g.:
The --offset parameter specifies how many messages (from the latest) you want to search to find your request. If not specified the last request will be shown.
x-request-id will also appear in tracing spans.
If x-request-id is passed in by the caller then this will be used. It is the caller's responsibility to ensure it is unique.
The IDs generated are XIDs.
Last updated
Was this helpful?

