LogoLogo
  • MLServer
  • Getting Started
  • User Guide
    • Content Types (and Codecs)
    • OpenAPI Support
    • Parallel Inference
    • Adaptive Batching
    • Custom Inference Runtimes
    • Metrics
    • Deployment
      • Seldon Core
      • KServe
    • Streaming
  • Inference Runtimes
    • SKLearn
    • XGBoost
    • MLFlow
    • Spark MlLib
    • LightGBM
    • Catboost
    • Alibi-Detect
    • Alibi-Explain
    • HuggingFace
    • Custom
  • Reference
    • MLServer Settings
    • Model Settings
    • MLServer CLI
    • Python API
      • MLModel
      • Types
      • Codecs
      • Metrics
  • Examples
    • Serving Scikit-Learn models
    • Serving XGBoost models
    • Serving LightGBM models
    • Serving MLflow models
    • Serving a custom model
    • Serving Alibi-Detect models
    • Serving HuggingFace Transformer Models
    • Multi-Model Serving
    • Model Repository API
    • Content Type Decoding
    • Custom Conda environments in MLServer
    • Serving a custom model with JSON serialization
    • Serving models through Kafka
    • Streaming
    • Deploying a Custom Tensorflow Model with MLServer and Seldon Core
  • Changelog
Powered by GitBook
On this page
  • Usage
  • Content Types
  • Settings
  • Loading models
  • Reference

Was this helpful?

Edit on GitHub
Export as PDF
  1. Inference Runtimes

HuggingFace

PreviousAlibi-ExplainNextCustom

Last updated 7 months ago

Was this helpful?

This package provides a MLServer runtime compatible with HuggingFace Transformers.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-huggingface

For further information on how to use MLServer with HuggingFace, you can check out this .

Content Types

The HuggingFace runtime will always decode the input request using its own built-in codec. Therefore, at the request level will be ignored. Note that this doesn't include annotations, which will be respected as usual.

Settings

The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

---
emphasize-lines: 5-8
---
{
  "name": "qa",
  "implementation": "mlserver_huggingface.HuggingFaceRuntime",
  "parameters": {
    "extra": {
      "task": "question-answering",
      "optimum_model": true
    }
  }
}
These settings can also be injected through environment variables prefixed with `MLSERVER_MODEL_HUGGINGFACE_`, e.g.

```bash
MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true
```

Loading models

Local models

It is possible to load a local model into a HuggingFace pipeline by specifying the model artefact folder path in parameters.uri in model-settings.json.

HuggingFace models

Models in the HuggingFace hub can be loaded by specifying their name in parameters.extra.pretrained_model in model-settings.json.

If `parameters.extra.pretrained_model` is specified, it takes precedence over `parameters.uri`.

Reference

You can find the full reference of the accepted extra settings for the HuggingFace runtime below:


.. autopydantic_settings:: mlserver_huggingface.settings.HuggingFaceSettings
worked out example
content type annotations
input-level content type