Python Serialization Cost Benchmark

Prequisites

  • An authenticated K8S cluster with istio and Seldon Core installed

    • You can use the ansible seldon-core playbook at https://github.com/SeldonIO/ansible-k8s-collection

  • vegeta and ghz benchmarking tools

Port forward to istio

kubectl port-forward $(kubectl get pods -l istio=ingressgateway -n istio-system -o jsonpath='{.items[0].metadata.name}') -n istio-system 8003:8080

Tests

  • Large Batch Size

    • predict method with:

      • REST

        • ndarray

        • tensor

        • tftensor

      • gRPC

        • ndarray

        • tensor

        • tftensor

    • predict_raw method with:

      • REST

        • ndarray

        • tensor

        • tftensor

      • gRPC

        • ndarray

        • tensor

        • tftensor

  • Small Batch Size

    • predict method with:

      • REST

        • ndarray

        • tensor

        • tftensor

      • gRPC

        • ndarray

        • tensor

        • tftensor

TLDR

  • gRPC is faster than REST

  • tftensor is best for large batch size

  • ndarray with gRPC is bad for large batch size

  • simpler tensor/ndarray is better for small batch size

from IPython.core.magic import register_line_cell_magic


@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, "w") as f:
        f.write(cell.format(**globals()))
VERSION = !cat ../../../version.txt
VERSION = VERSION[0]
VERSION
'1.10.0-dev'
!kubectl create namespace seldon
Error from server (AlreadyExists): namespaces "seldon" already exists
!helm upgrade --install seldon-core seldon-core-operator --repo https://storage.googleapis.com/seldon-charts --version 1.9.0 --namespace seldon-system --set istio.enabled="true" --set istio.gateway="seldon-gateway.istio-system.svc.cluster.local"
Release "seldon-core" has been upgraded. Happy Helming!
NAME: seldon-core
LAST DEPLOYED: Thu Jul  1 14:03:55 2021
NAMESPACE: seldon-system
STATUS: deployed
REVISION: 2
TEST SUITE: None

Test with Predict method on Large Batch Size

The seldontest_predict has simply a predict method that does a loop with a configurable number of iterations (default 1) to simulate work. The iterations can be set as a Seldon parameter but in this case we are looking to benchmark the serialization/deserialization cost so want a minimal amount of work.

%%writetemplate model.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: seldon-model
  namespace: seldon
spec:
  predictors:
  - annotations:
      seldon.io/no-engine: "true"
    componentSpecs:
    - spec:
        containers:
        - image: seldonio/seldontest_predict:{VERSION}
          imagePullPolicy: IfNotPresent
          name: classifier
          resources:
            requests:
              cpu: 1
            limits:
              cpu: 1
          env:
          - name: GUNICORN_WORKERS
            value: "1"
          - name: GUNICORN_THREADS
            value: "1"
        tolerations:
        - key: model
          operator: Exists
          effect: NoSchedule
    graph:
      children: []
      name: classifier
      type: MODEL
    name: default
    replicas: 1
!kubectl apply -f model.yaml
seldondeployment.machinelearning.seldon.io/seldon-model created
!kubectl wait --for condition=ready --timeout=600s pods --all -n seldon
pod/seldon-model-default-0-classifier-5445bd4ccf-c2vdr condition met

Create payloads and associated vegeta configurations for

  1. ndarray

  2. tensor

  3. tftensor

We will create an array of 100,000 consecutive integers.

import json

sz = 100000
vals = list(range(sz))
valStr = f"{vals}"
payload = '{"data": {"ndarray": [' + valStr + "]}}"
with open("data_ndarray.json", "w") as f:
    f.write(payload)
payload_tensor = (
    '{"data":{"tensor":{"shape":[1,' + str(sz) + '],"values":' + valStr + "}}}"
)
with open("data_tensor.json", "w") as f:
    f.write(payload_tensor)
import numpy as np
import tensorflow as tf
from google.protobuf import json_format

array = np.array(vals)
tftensor = tf.make_tensor_proto(array)
jStrTensor = json_format.MessageToJson(tftensor)
jTensor = json.loads(jStrTensor)
payload_tftensor = (
    '{"data":{"tftensor":' + json.dumps(jTensor, separators=(",", ":")) + "}}"
)
with open("data_tftensor.json", "w") as f:
    f.write(payload_tftensor)
import base64
import json

sample_string_bytes = payload_tensor.encode("ascii")
base64_bytes = base64.b64encode(sample_string_bytes)
base64_string = base64_bytes.decode("ascii")
jqPayload = {
    "method": "POST",
    "url": "http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions",
    "body": base64_string,
    "header": {"Content-Type": ["application/json"]},
}
with open("vegeta_tensor.json", "w") as f:
    f.write(json.dumps(jqPayload, separators=(",", ":")))
    f.write("\n")

sample_string_bytes = payload.encode("ascii")
base64_bytes = base64.b64encode(sample_string_bytes)
base64_string = base64_bytes.decode("ascii")
jqPayload = {
    "method": "POST",
    "url": "http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions",
    "body": base64_string,
    "header": {"Content-Type": ["application/json"]},
}
with open("vegeta_ndarray.json", "w") as f:
    f.write(json.dumps(jqPayload, separators=(",", ":")))
    f.write("\n")


sample_string_bytes = payload_tftensor.encode("ascii")
base64_bytes = base64.b64encode(sample_string_bytes)
base64_string = base64_bytes.decode("ascii")
jqPayload = {
    "method": "POST",
    "url": "http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions",
    "body": base64_string,
    "header": {"Content-Type": ["application/json"]},
}
with open("vegeta_tftensor.json", "w") as f:
    f.write(json.dumps(jqPayload, separators=(",", ":")))
    f.write("\n")

Smoke test port-forward to check everything is working

!curl -X POST -H 'Content-Type: application/json' \
   -d '@./data_ndarray.json' \
    http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions
{"data":{"names":[],"ndarray":[1]},"meta":{"requestPath":{"classifier":"seldonio/seldontest_predict:1.10.0-dev"}}}
!curl -X POST -H 'Content-Type: application/json' \
   -d '@./data_tensor.json' \
    http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions
{"data":{"names":[],"tensor":{"shape":[1],"values":[1]}},"meta":{"requestPath":{"classifier":"seldonio/seldontest_predict:1.10.0-dev"}}}
!curl -X POST -H 'Content-Type: application/json' \
   -d '@./data_tftensor.json' \
    http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions
{"data":{"names":[],"tftensor":{"dtype":"DT_INT64","int64Val":["1"],"tensorShape":{"dim":[{"size":"1"}]}}},"meta":{"requestPath":{"classifier":"seldonio/seldontest_predict:1.10.0-dev"}}}

Test REST

  1. ndarray

  2. tensor

  3. tftensor

This can be done locally as the results should be indicative of the relative differences rather than very accurate timings.

%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_ndarray.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         518, 51.76, 51.66
Duration      [total, attack, wait]             10.027s, 10.008s, 19.333ms
Latencies     [min, mean, 50, 90, 95, 99, max]  17.337ms, 19.355ms, 19.136ms, 20.336ms, 21.214ms, 24.886ms, 27.831ms
Bytes In      [total, mean]                     59570, 115.00
Bytes Out     [total, mean]                     356857970, 688915.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:518  
Error Set:
%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_tensor.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         504, 50.35, 50.25
Duration      [total, attack, wait]             10.03s, 10.01s, 19.353ms
Latencies     [min, mean, 50, 90, 95, 99, max]  17.885ms, 19.897ms, 19.616ms, 21.1ms, 22.205ms, 25.498ms, 34.99ms
Bytes In      [total, mean]                     69048, 137.00
Bytes Out     [total, mean]                     347225760, 688940.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:504  
Error Set:
%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_tftensor.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         636, 63.55, 63.45
Duration      [total, attack, wait]             10.023s, 10.008s, 14.782ms
Latencies     [min, mean, 50, 90, 95, 99, max]  13.646ms, 15.756ms, 15.461ms, 17.41ms, 18.729ms, 20.628ms, 23.465ms
Bytes In      [total, mean]                     118932, 187.00
Bytes Out     [total, mean]                     678466356, 1066771.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:636  
Error Set:

Example results

ndarray
tensor
tftensor

19.8ms

19.7ms

16.2ms

Test gRPC

  1. ndarray

  2. tensor

  3. tftensor

%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_ndarray.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	24
  Total:	10.13 s
  Slowest:	278.81 ms
  Fastest:	242.25 ms
  Average:	244.06 ms
  Requests/sec:	2.37

Response time histogram:
  242.253 [1]	|∎∎∎∎∎∎∎
  245.909 [2]	|∎∎∎∎∎∎∎∎∎∎∎∎∎
  249.564 [4]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  253.219 [6]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  256.874 [4]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  260.530 [1]	|∎∎∎∎∎∎∎
  264.185 [1]	|∎∎∎∎∎∎∎
  267.840 [2]	|∎∎∎∎∎∎∎∎∎∎∎∎∎
  271.496 [0]	|
  275.151 [1]	|∎∎∎∎∎∎∎
  278.806 [1]	|∎∎∎∎∎∎∎

Latency distribution:
  10 % in 247.44 ms 
  25 % in 249.47 ms 
  50 % in 252.85 ms 
  75 % in 260.70 ms 
  90 % in 272.55 ms 
  95 % in 278.81 ms 
  0 % in 0 ns 

Status code distribution:
  [OK]         23 responses   
  [Canceled]   1 responses    

Error distribution:
  [1]   rpc error: code = Canceled desc = grpc: the client connection is closing   
%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_tensor.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	92
  Total:	10.10 s
  Slowest:	21.23 ms
  Fastest:	4.91 ms
  Average:	7.58 ms
  Requests/sec:	9.11

Response time histogram:
  4.906 [1]	|∎
  6.539 [55]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  8.171 [17]	|∎∎∎∎∎∎∎∎∎∎∎∎
  9.804 [4]	|∎∎∎
  11.436 [0]	|
  13.069 [4]	|∎∎∎
  14.701 [3]	|∎∎
  16.334 [3]	|∎∎
  17.966 [0]	|
  19.599 [2]	|∎
  21.232 [2]	|∎

Latency distribution:
  10 % in 5.51 ms 
  25 % in 5.70 ms 
  50 % in 6.14 ms 
  75 % in 7.09 ms 
  90 % in 14.14 ms 
  95 % in 18.77 ms 
  0 % in 0 ns 

Status code distribution:
  [OK]         91 responses   
  [Canceled]   1 responses    

Error distribution:
  [1]   rpc error: code = Canceled desc = grpc: the client connection is closing   
%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_tftensor.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	425
  Total:	10.04 s
  Slowest:	16.38 ms
  Fastest:	3.97 ms
  Average:	5.33 ms
  Requests/sec:	42.31

Response time histogram:
  3.970 [1]	|
  5.211 [281]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  6.452 [91]	|∎∎∎∎∎∎∎∎∎∎∎∎∎
  7.692 [25]	|∎∎∎∎
  8.933 [8]	|∎
  10.174 [6]	|∎
  11.415 [7]	|∎
  12.656 [2]	|
  13.896 [1]	|
  15.137 [1]	|
  16.378 [1]	|

Latency distribution:
  10 % in 4.34 ms 
  25 % in 4.54 ms 
  50 % in 4.89 ms 
  75 % in 5.52 ms 
  90 % in 6.79 ms 
  95 % in 8.30 ms 
  99 % in 11.71 ms 

Status code distribution:
  [OK]         424 responses   
  [Canceled]   1 responses     

Error distribution:
  [1]   rpc error: code = Canceled desc = grpc: the client connection is closing   

Example results

ndarray
tensor
tftensor

253ms

8.4ms

5.5ms

Conclusions

  • gRPC is generally faster than REST except for ndarray which is much worse and should not be used with gRPC

  • tftensor is fastest

!kubectl delete -f model.yaml
seldondeployment.machinelearning.seldon.io "seldon-model" deleted

Test Predct Raw

%%writetemplate model.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: seldon-model
  namespace: seldon
spec:
  predictors:
  - annotations:
      seldon.io/no-engine: "true"
    componentSpecs:
    - spec:
        containers:
        - image: seldonio/seldontest_predict_raw:{VERSION}
          imagePullPolicy: IfNotPresent
          name: classifier
          resources:
            requests:
              cpu: 1
            limits:
              cpu: 1
          env:
          - name: GUNICORN_WORKERS
            value: "1"
          - name: GUNICORN_THREADS
            value: "1"
        tolerations:
        - key: model
          operator: Exists
          effect: NoSchedule
    graph:
      children: []
      name: classifier
      type: MODEL
    name: default
    replicas: 1
!kubectl apply -f model.yaml
seldondeployment.machinelearning.seldon.io/seldon-model created
!kubectl wait --for condition=ready --timeout=600s pods --all -n seldon
pod/seldon-model-default-0-classifier-5dc8fbd597-kk7td condition met

Smoke test port-forward to check everything is working

!curl -X POST -H 'Content-Type: application/json' \
   -d '@./data_tftensor.json' \
    http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions
[1]

Test REST

  1. ndarray

  2. tensor

  3. tftensor

This can be done locally as the results should be indicative of the relative differences rather than very accurate timings.

%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_ndarray.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         724, 72.35, 72.25
Duration      [total, attack, wait]             10.021s, 10.007s, 14.458ms
Latencies     [min, mean, 50, 90, 95, 99, max]  12.228ms, 13.838ms, 13.683ms, 14.641ms, 15.489ms, 17.888ms, 22.263ms
Bytes In      [total, mean]                     2896, 4.00
Bytes Out     [total, mean]                     498774460, 688915.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:724  
Error Set:
%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_tensor.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         724, 72.32, 72.22
Duration      [total, attack, wait]             10.025s, 10.011s, 14.307ms
Latencies     [min, mean, 50, 90, 95, 99, max]  12.362ms, 13.844ms, 13.701ms, 14.655ms, 15.493ms, 17.976ms, 18.802ms
Bytes In      [total, mean]                     2896, 4.00
Bytes Out     [total, mean]                     498792560, 688940.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:724  
Error Set:
%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_tftensor.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         901, 90.04, 89.93
Duration      [total, attack, wait]             10.018s, 10.007s, 11.64ms
Latencies     [min, mean, 50, 90, 95, 99, max]  8.955ms, 11.116ms, 10.994ms, 12.099ms, 12.721ms, 15.208ms, 19.918ms
Bytes In      [total, mean]                     3604, 4.00
Bytes Out     [total, mean]                     961160671, 1066771.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:901  
Error Set:

Example results

ndarray
tensor
tftensor

13.3ms

13.3ms

11.1ms

Test gRPC

  1. ndarray

  2. tensor

  3. tftensor

%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_ndarray.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	44
  Total:	10.04 s
  Slowest:	69.07 ms
  Fastest:	44.44 ms
  Average:	46.03 ms
  Requests/sec:	4.38

Response time histogram:
  44.440 [1]	|∎
  46.904 [31]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  49.367 [6]	|∎∎∎∎∎∎∎∎
  51.831 [2]	|∎∎∎
  54.294 [2]	|∎∎∎
  56.758 [0]	|
  59.221 [0]	|
  61.684 [0]	|
  64.148 [0]	|
  66.611 [0]	|
  69.075 [1]	|∎

Latency distribution:
  10 % in 45.05 ms 
  25 % in 45.40 ms 
  50 % in 46.30 ms 
  75 % in 47.34 ms 
  90 % in 50.16 ms 
  95 % in 53.38 ms 
  0 % in 0 ns 

Status code distribution:
  [OK]         43 responses   
  [Canceled]   1 responses    

Error distribution:
  [1]   rpc error: code = Canceled desc = grpc: the client connection is closing   
%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_tensor.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	92
  Total:	10.10 s
  Slowest:	19.81 ms
  Fastest:	4.93 ms
  Average:	7.91 ms
  Requests/sec:	9.11

Response time histogram:
  4.932 [1]	|∎
  6.419 [53]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  7.907 [12]	|∎∎∎∎∎∎∎∎∎
  9.395 [5]	|∎∎∎∎
  10.882 [4]	|∎∎∎
  12.370 [1]	|∎
  13.858 [3]	|∎∎
  15.346 [3]	|∎∎
  16.833 [2]	|∎∎
  18.321 [3]	|∎∎
  19.809 [4]	|∎∎∎

Latency distribution:
  10 % in 5.21 ms 
  25 % in 5.68 ms 
  50 % in 6.04 ms 
  75 % in 8.27 ms 
  90 % in 15.77 ms 
  95 % in 19.04 ms 
  0 % in 0 ns 

Status code distribution:
  [OK]         91 responses   
  [Canceled]   1 responses    

Error distribution:
  [1]   rpc error: code = Canceled desc = grpc: the client connection is closing   
%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_tftensor.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	426
  Total:	10.03 s
  Slowest:	11.74 ms
  Fastest:	3.67 ms
  Average:	5.02 ms
  Requests/sec:	42.48

Response time histogram:
  3.668 [1]	|
  4.475 [174]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  5.282 [141]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  6.089 [43]	|∎∎∎∎∎∎∎∎∎∎
  6.897 [30]	|∎∎∎∎∎∎∎
  7.704 [16]	|∎∎∎∎
  8.511 [6]	|∎
  9.318 [8]	|∎∎
  10.126 [2]	|
  10.933 [1]	|
  11.740 [3]	|∎

Latency distribution:
  10 % in 4.08 ms 
  25 % in 4.27 ms 
  50 % in 4.61 ms 
  75 % in 5.30 ms 
  90 % in 6.62 ms 
  95 % in 7.66 ms 
  99 % in 10.26 ms 

Status code distribution:
  [OK]         425 responses   
  [Canceled]   1 responses     

Error distribution:
  [1]   rpc error: code = Canceled desc = grpc: the client connection is closing   

Example results

ndarray
tensor
tftensor

46ms

7.9ms

5.0ms

Conclusions

  • predict_raw is faster than predict but you will need to handle the serialization/deserializtion yourself which maybe will make them equivalent unless specific techniques can be applied for your use case.

Test with Predict method on Small Batch Size

The seldontest_predict has simply a predict method that does a loop with a configurable number of iterations (default 1) to simulate work. The iterations can be set as a Seldon parameter but in this case we are looking to benchmark the serialization/deserialization cost so want a minimal amount of work.

%%writetemplate model.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: seldon-model
  namespace: seldon
spec:
  predictors:
  - annotations:
      seldon.io/no-engine: "true"
    componentSpecs:
    - spec:
        containers:
        - image: seldonio/seldontest_predict:{VERSION}
          imagePullPolicy: IfNotPresent
          name: classifier
          resources:
            requests:
              cpu: 1
            limits:
              cpu: 1
          env:
          - name: GUNICORN_WORKERS
            value: "1"
          - name: GUNICORN_THREADS
            value: "1"
        tolerations:
        - key: model
          operator: Exists
          effect: NoSchedule
    graph:
      children: []
      name: classifier
      type: MODEL
    name: default
    replicas: 1
!kubectl apply -f model.yaml
seldondeployment.machinelearning.seldon.io/seldon-model configured
!kubectl wait --for condition=ready --timeout=600s pods --all -n seldon
pod/seldon-model-default-0-classifier-5445bd4ccf-bgkcm condition met

Create payloads and associated vegeta configurations for

  1. ndarray

  2. tensor

  3. tftensor

We will create an array of 100,000 consecutive integers.

import json

sz = 1
vals = list(range(sz))
valStr = f"{vals}"
payload = '{"data": {"ndarray": [' + valStr + "]}}"
with open("data_ndarray.json", "w") as f:
    f.write(payload)
payload_tensor = (
    '{"data":{"tensor":{"shape":[1,' + str(sz) + '],"values":' + valStr + "}}}"
)
with open("data_tensor.json", "w") as f:
    f.write(payload_tensor)
import numpy as np
import tensorflow as tf
from google.protobuf import json_format

array = np.array(vals)
tftensor = tf.make_tensor_proto(array)
jStrTensor = json_format.MessageToJson(tftensor)
jTensor = json.loads(jStrTensor)
payload_tftensor = (
    '{"data":{"tftensor":' + json.dumps(jTensor, separators=(",", ":")) + "}}"
)
with open("data_tftensor.json", "w") as f:
    f.write(payload_tftensor)
import base64
import json

sample_string_bytes = payload_tensor.encode("ascii")
base64_bytes = base64.b64encode(sample_string_bytes)
base64_string = base64_bytes.decode("ascii")
jqPayload = {
    "method": "POST",
    "url": "http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions",
    "body": base64_string,
    "header": {"Content-Type": ["application/json"]},
}
with open("vegeta_tensor.json", "w") as f:
    f.write(json.dumps(jqPayload, separators=(",", ":")))
    f.write("\n")

sample_string_bytes = payload.encode("ascii")
base64_bytes = base64.b64encode(sample_string_bytes)
base64_string = base64_bytes.decode("ascii")
jqPayload = {
    "method": "POST",
    "url": "http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions",
    "body": base64_string,
    "header": {"Content-Type": ["application/json"]},
}
with open("vegeta_ndarray.json", "w") as f:
    f.write(json.dumps(jqPayload, separators=(",", ":")))
    f.write("\n")


sample_string_bytes = payload_tftensor.encode("ascii")
base64_bytes = base64.b64encode(sample_string_bytes)
base64_string = base64_bytes.decode("ascii")
jqPayload = {
    "method": "POST",
    "url": "http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions",
    "body": base64_string,
    "header": {"Content-Type": ["application/json"]},
}
with open("vegeta_tftensor.json", "w") as f:
    f.write(json.dumps(jqPayload, separators=(",", ":")))
    f.write("\n")

Smoke test port-forward to check everything is working

!curl -X POST -H 'Content-Type: application/json' \
   -d '@./data_tensor.json' \
    http://localhost:8003/seldon/seldon/seldon-model/api/v1.0/predictions
{"data":{"names":[],"tensor":{"shape":[1],"values":[1]}},"meta":{"requestPath":{"classifier":"seldonio/seldontest_predict:1.10.0-dev"}}}

Test REST

  1. ndarray

  2. tensor

  3. tftensor

This can be done locally as the results should be indicative of the relative differences rather than very accurate timings.

%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_ndarray.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         5538, 553.80, 553.67
Duration      [total, attack, wait]             10.002s, 10s, 2.364ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.569ms, 1.804ms, 1.739ms, 1.984ms, 2.198ms, 2.861ms, 6.62ms
Bytes In      [total, mean]                     636870, 115.00
Bytes Out     [total, mean]                     155064, 28.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:5538  
Error Set:
%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_tensor.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         5557, 555.65, 555.55
Duration      [total, attack, wait]             10.003s, 10.001s, 1.753ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.578ms, 1.798ms, 1.74ms, 1.925ms, 2.119ms, 2.981ms, 5.968ms
Bytes In      [total, mean]                     761309, 137.00
Bytes Out     [total, mean]                     266736, 48.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:5557  
Error Set:
%%bash
vegeta attack -format=json -duration=10s -rate=0 -max-workers=1 -targets=vegeta_tftensor.json | 
  vegeta report -type=text
Requests      [total, rate, throughput]         4548, 454.75, 454.65
Duration      [total, attack, wait]             10.003s, 10.001s, 2.141ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.937ms, 2.197ms, 2.138ms, 2.351ms, 2.482ms, 3.215ms, 9.424ms
Bytes In      [total, mean]                     850476, 187.00
Bytes Out     [total, mean]                     436608, 96.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:4548  
Error Set:

Example results

ndarray
tensor
tftensor

1.8ms

1.8ms

2.1ms

Test gRPC

  1. ndarray

  2. tensor

  3. tftensor

%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_ndarray.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	6506
  Total:	10.01 s
  Slowest:	18.58 ms
  Fastest:	1.26 ms
  Average:	1.46 ms
  Requests/sec:	650.23

Response time histogram:
  1.260 [1]	|
  2.992 [6465]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  4.724 [30]	|
  6.456 [5]	|
  8.187 [2]	|
  9.919 [1]	|
  11.651 [0]	|
  13.382 [0]	|
  15.114 [0]	|
  16.846 [0]	|
  18.578 [1]	|

Latency distribution:
  10 % in 1.33 ms 
  25 % in 1.36 ms 
  50 % in 1.39 ms 
  75 % in 1.45 ms 
  90 % in 1.58 ms 
  95 % in 1.79 ms 
  99 % in 2.50 ms 

Status code distribution:
  [OK]            6505 responses   
  [Unavailable]   1 responses      

Error distribution:
  [1]   rpc error: code = Unavailable desc = transport is closing   
%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_tensor.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	6429
  Total:	10.01 s
  Slowest:	16.30 ms
  Fastest:	1.29 ms
  Average:	1.49 ms
  Requests/sec:	642.56

Response time histogram:
  1.287 [1]	|
  2.789 [6375]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  4.290 [36]	|
  5.792 [11]	|
  7.293 [2]	|
  8.795 [1]	|
  10.296 [0]	|
  11.798 [1]	|
  13.299 [0]	|
  14.801 [0]	|
  16.303 [1]	|

Latency distribution:
  10 % in 1.36 ms 
  25 % in 1.38 ms 
  50 % in 1.42 ms 
  75 % in 1.48 ms 
  90 % in 1.60 ms 
  95 % in 1.80 ms 
  99 % in 2.67 ms 

Status code distribution:
  [OK]            6428 responses   
  [Unavailable]   1 responses      

Error distribution:
  [1]   rpc error: code = Unavailable desc = transport is closing   
%%bash
ghz \
    --insecure \
    --proto ../../../proto/prediction.proto \
    --call seldon.protos.Seldon/Predict \
    --data-file=./data_tftensor.json \
    --qps=0 \
    --cpus=1 \
    --concurrency=1 \
    --duration="10s" \
    --format summary \
    --metadata='{"seldon": "seldon-model", "namespace": "seldon"}' \
    localhost:8003
Summary:
  Count:	6066
  Total:	10.01 s
  Slowest:	9.38 ms
  Fastest:	1.39 ms
  Average:	1.57 ms
  Requests/sec:	606.20

Response time histogram:
  1.387 [1]	|
  2.187 [5945]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  2.986 [84]	|∎
  3.785 [20]	|
  4.585 [7]	|
  5.384 [2]	|
  6.183 [4]	|
  6.983 [0]	|
  7.782 [0]	|
  8.582 [1]	|
  9.381 [1]	|

Latency distribution:
  10 % in 1.46 ms 
  25 % in 1.48 ms 
  50 % in 1.52 ms 
  75 % in 1.57 ms 
  90 % in 1.66 ms 
  95 % in 1.81 ms 
  99 % in 2.61 ms 

Status code distribution:
  [OK]            6065 responses   
  [Unavailable]   1 responses      

Error distribution:
  [1]   rpc error: code = Unavailable desc = transport is closing   

Example results

ndarray
tensor
tftensor

1.46ms

1.49ms

1.57ms

Conclusions

  • gRPC is generally faster than REST

  • There is very little difference between payload types with simpler tensor/ndarray probably being slightly faster

!kubectl delete -f model.yaml
seldondeployment.machinelearning.seldon.io "seldon-model" deleted

Last updated

Was this helpful?