Quickstart

In this tutorial we will show you how to create a Machine Learning pipeline for loan classsification. The pipeline is composed of an Large Language Model (LLM) prompted for entity extraction, a custom MLServer model for preprocessing the json returned by the LLM, and a Random Forest classifier which output the Loan Status (i.e., a binary variable, either "Yes"(1) or "No"(0)).

Our pipeline will look like this:

The workflow for the diagram above is the following:

  1. The input to the pipeline is natural text - a loan application in which the applicant provides information on the basis of which our pipeline will determine whether or not to automatically approve the loan.

  2. Then the LLM processes that text, extracting information we are interested in, and outputting the relevant fields in json format. Note that the LLM knows what to extract based on the system prompt we provide.

  3. The json is then passed to the preprocessor model which parses the json and converts it into a numpy array to be passed to the Random Forest classifier.

  4. Finally, the Random forest classifier outputs the prediction. We return both the preprocessed json and the classification label.

Now that we clarified what is the task, we are ready to start deploying the models and constructing the computational pipeline. For this, you will need an up and running MLServer instance (comes with the installation of Seldon Core 2) to serve the preprocessor model and the Random Forest classifier, as well as an instance of the Local Runtime to serve the LLM. We will load a mistralai/Mistral-7B-Instruct-v0.2 model on our infrastructure, so make sure you have enough resources to do so.

For running this tutorial we used 4 NVIDIA GeForce RTX 2080 Ti. Other configurations are also possible.

We begin by first defining the model settings for the Random Forest classifier. Note that we already trained a Random Forest classifier, so you don't need to worry about that step. The associated model-settings.json file is:

!cat models/loan-model/model-settings.json
{
    "name": "loan-model",
    "implementation": "mlserver_sklearn.SKLearnModel",
    "parameters": {
        "uri": "./loan-predictor-rf.joblib",
        "version": "v0.1.0"
    }
}

The associated manifest file is:

!cat manifests/loan-model.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: loan-model
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/getting-started/quickstart/models/loan-model"
  requirements:
  - sklearn

To deploy the Random Forest classifier, run the following command:

!kubectl apply -f manifests/loan-model.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/loan-model created
model.mlops.seldon.io/loan-model condition met

Now we can move to the preprocessor model. Since this is a custom model, we need to first provide the implementation for it. Our implementation is the following:

import json
from json import JSONDecodeError

import numpy as np
from mlserver import MLModel
from mlserver.logging import logger
from mlserver.codecs import StringCodec, NumpyCodec
from mlserver.types import InferenceRequest, InferenceResponse


enc_np = lambda k, v: NumpyCodec.encode_output(k, np.array([v]))
enc_str = lambda k, v: StringCodec.encode_output(k, [v])


class PreProcessorModel(MLModel):
    features = [
        "ApplicantIncome",
        "LoanAmount",
        "Loan_Amount_Term",
        "Gender",
        "Married",
        "Dependents",
        "Education",
        "Self_Employed",
        "Property_Area",
    ]

    async def load(self) -> bool:
        self.ready = True
        return self.ready

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        logger.info("in preprocessor")
        data = self._unpack_input(payload)
        return self._build_response(data)

    def _unpack_input(self, payload: InferenceRequest):
        request_data = {}
        for inp in payload.inputs:
            request_data[inp.name] = StringCodec.decode_input(inp)
        try:
            logger.info(request_data)
            data = request_data["content"][0]
            data = data[data.index("{") : data.rindex("}") + 1]
            logger.info(data)
            data = json.loads(data)
        except JSONDecodeError as err:
            logger.info("error decoding json!")
            logger.info(request_data["content"][0])
            logger.info(err)
            data = {}
        logger.info(data)
        return data

    def _build_response(self, state):
        none_to_string = lambda x: x if x is not None else "none"
        data = {key: none_to_string(value) for key, value in state.items()}
        logger.info(data)
        data = self._preprocess_input(data)
        inference_response = InferenceResponse(
            model_name="preprocessor",
            outputs=[data, StringCodec.encode_output("data", [json.dumps(state)])],
        )
        return inference_response

    def _preprocess_input(self, vals):
        tensor = np.array([0] * len(feature_list))
        for ind, feature in enumerate(feature_list):
            data = self._get_feature_value(feature, vals)
            tensor[ind] = data
        return NumpyCodec.encode_output("predict", np.array([tensor]))

    def _get_feature_value(self, feature, vals):
        feature, *category = feature.split("-")
        if not category:
            value = vals[feature]
        else:
            value = 1 if (category.pop() == vals[feature]) else 0
        return value


feature_list = [
    "ApplicantIncome",
    "LoanAmount",
    "Loan_Amount_Term",
    "Gender-Female",
    "Gender-Male",
    "Married-No",
    "Married-Yes",
    "Dependents-0",
    "Dependents-1",
    "Dependents-2",
    "Dependents-3+",
    "Education-Graduate",
    "Education-Not Graduate",
    "Self_Employed-No",
    "Self_Employed-Yes",
    "Property_Area-Rural",
    "Property_Area-Semiurban",
    "Property_Area-Urban",
]

You don't have to do anything with the code above since we already uploaded all our artifacts in the google bucket. We only provided it for clarity.

The model-settings.json for the preprocessor model looks like this:

!cat models/preprocessor/model-settings.json
{
    "name": "preprocessor",
    "implementation": "model.PreProcessorModel",
    "parameters": {
        "version": "v0.1.0"
    }
}

The associated manifest file is:

!cat manifests/preprocessor.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: preprocessor
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/getting-started/quickstart/models/preprocessor"
  requirements:
  - mlserver

To deploy the preprocessor, run the following command:

!kubectl apply -f manifests/preprocessor.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/preprocessor created
model.mlops.seldon.io/loan-model condition met
model.mlops.seldon.io/preprocessor condition met

We are now ready to deploy the last model, which is the actual LLM. The model-settings.json is:

!cat models/mistral/model-settings.json
{
    "name": "mistral",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "mistralai/Mistral-7B-Instruct-v0.2",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "max_model_len": 4096,
                    "default_generate_kwargs": {
                        "max_tokens": 1024
                    }
                }
            }
        }
    }
}

For this model we specified the following:

  1. The vllm backend used by our inference engine.

  2. The model_type which specifies that the model is going to be used for chat completion.

  3. The model which is the reference to the model name from the HuggingFace model hub.

  4. We split the model on 4 GPUs by specifying tensor_parallel_size=4.

  5. Our hardware only supports float16 to store the weights, but feel free to remove it or replace it with bfloat16 if your hardware supports it.

  6. We specify the maximum amount of tokens to be processed through max_model_len.

  7. Finally, the maximum number of tokens to be generated is set to 1024 through max_tokens.

The manifest file for the LLM is:

!cat manifests/mistral.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: mistral
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/getting-started/quickstart/models/mistral"
  requirements:
  - llm-local

To deploy the model, run the following command:

!kubectl apply -f manifests/mistral.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/mistral created
model.mlops.seldon.io/loan-model condition met
model.mlops.seldon.io/mistral condition met
model.mlops.seldon.io/preprocessor condition met

Now that all our models are deployed, we can define and deploy the pipeline. The manifest file which contains the definition of the pipeline is the following:

!cat manifests/pipeline.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: entity-extraction
spec:
  steps:
    - name: mistral
    - name: preprocessor
      inputs:
      - mistral
    - name: loan-model
      inputs:
      - preprocessor.outputs.predict
  output:
    steps:
    - loan-model
    - preprocessor.outputs.data

To deploy the pipeline, run the following commands:

!kubectl apply -f manifests/pipeline.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
pipeline.mlops.seldon.io/entity-extraction created
pipeline.mlops.seldon.io/entity-extraction condition met

We are now ready to perform inference through our pipeline, but before that we define a helper function to get the IP of the mesh:

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

We also define the system prompt for the LLM which contains a description of the behaviour we expect from the model.

system_prompt = """\
You are a helpful data entry assistant whose responsibility is extracting data from a message sent by a user. The following is
such a message. Please extract the users details and return in a json dict with keys: "ApplicantIncome", "LoanAmount", 
"Loan_Amount_Term", "Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area"

Please ensure that "ApplicantIncome" is an integer greater than 0
Please ensure that "LoanAmount" is an integer greater than 0
Please ensure that "Loan_Amount_Term" is an integer value corresponding to number of days, do not provide an expression.
Please ensure that "Gender" is either "Male" or "Female"
Please ensure that "Married" is either "No" or "Yes"
Please ensure that "Dependents" is one of "0", "1", "2" or "3+"
Please ensure that "Education" is one of "Graduate" or "Not_Graduate"
Please ensure that "Self_Employed" is one of "Yes" or "No"
Please ensure that "Property_Area" is one of "Rural", "Semiurban" or "Urban"

Please only return JSON do not add any other text! If values are missing set them to a string: "none"."""

The prompt we will send to the model which is the actual text to be processed and from which we want to extract the information is the following:

prompt = """Hi,
I am looking for a loan to buy a new car.
I am a software engineer and I'm employed by Google. I am married and have a wife and 2 kids. I have a house in Central London and My income is 50,000.
I would like a loan amount of 100,000 and I am looking for a term of 10 years. I'm a graduate.
Mr John Doe"""

We can send now the request:

import json
import requests

inference_request = {
    "inputs": [
        {
            "name": "role",
            "shape": [2],
            "datatype": "BYTES",
            "data": ["system", "user"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps([system_prompt]),
                json.dumps([prompt]),
            ],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]),
                json.dumps(["text"]),
            ],
            "parameters": {"content_type": "str"},
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/pipelines/entity-extraction/infer"
response = requests.post(endpoint, json=inference_request)

The outputs we get are the following:

print("Prediction:", response.json()['outputs'][0]['data'][0])
print("Data:", response.json()['outputs'][1]['data'][0])
Prediction: 1
Data: {"ApplicantIncome": 50000, "LoanAmount": 100000, "Loan_Amount_Term": 3650, "Gender": "Male", "Married": "Yes", "Dependents": "2", "Education": "Graduate", "Self_Employed": "No", "Property_Area": "Urban"}

We can see that the LLM succesfuly generated a valid json which was then processed and sent to the Random Forest classifier for prediction.

Congrats! You just deployed a computational pipeline which combines an LLM and classical machine learning model.

To delete the pipeline and unload the models, run the following commands:

!kubectl delete -f manifests/pipeline.yaml -n seldon
!kubectl delete -f manifests/mistral.yaml -n seldon
!kubectl delete -f manifests/preprocessor.yaml -n seldon
!kubectl delete -f manifests/loan-model.yaml -n seldon
pipeline.mlops.seldon.io "entity-extraction" deleted
model.mlops.seldon.io "mistral" deleted
model.mlops.seldon.io "preprocessor" deleted
model.mlops.seldon.io "loan-model" deleted

Last updated

Was this helpful?