Batch Processing with Argo Workflows and HDFS

In this notebook we will dive into how you can run batch processing with Argo Workflows and Seldon Core.

Dependencies:

  • Seldon core installed as per the docs with an ingress

  • HDFS namenode/datanode accessible from your cluster (here in-cluster installation for demo)

  • Argo Workfklows installed in cluster (and argo CLI for commands)

  • Python hdfscli for interacting with the installed hdfs instance

Setup

Install Seldon Core

Use the notebook to set-up Seldon Core with Ambassador or Istio Ingress.

Note: If running with KIND you need to make sure do follow these steps as workaround to the /.../docker.sock known issue:

kubectl patch -n argo configmap workflow-controller-configmap \
    --type merge -p '{"data": {"config": "containerRuntimeExecutor: k8sapi"}}'

Install HDFS

For this example we will need a running hdfs storage. We can use these helm charts from Gradiant.

helm repo add gradiant https://gradiant.github.io/charts/
kubectl create namespace hdfs-system || echo "namespace hdfs-system already exists"
helm install hdfs gradiant/hdfs --namespace hdfs-system

Once installation is complete, run in separate terminal a port-forward command for us to be able to push/pull batch data.

Install and configure hdfscli

In this example we will be using hdfscli Python library for interacting with HDFS. It supports both the WebHDFS (and HttpFS) API as well as Kerberos authentication (not covered by the example).

You can install it with

To be able to put input-data.txt for our batch job into hdfs we need to configure the client

Install Argo Workflows

You can follow the instructions from the official Argo Workflows Documentation.

You also need to make sure that argo has permissions to create seldon deployments - for this you can create a role:

A service account:

And a binding

Create Seldon Deployment

For purpose of this batch example we will assume that Seldon Deployment is created independently from the workflow logic

Create Input Data

Prepare HDFS config / client image

For connecting to the hdfs from inside the cluster we will use the same hdfscli tool as we used above to put data in there.

We will configure hdfscli using hdfscli.cfg file stored inside kubernetes secret:

For the client image we will use a following minimal Dockerfile

That is build and published as seldonio/hdfscli:1.6.0-dev

Create Workflow

This simple workflow will consist of three stages:

  • download-input-data

  • process-batch-inputs

  • upload-output-data

Pull output-data from hdfs

Last updated

Was this helpful?