Prompting

The Prompt Runtime enables the deployment of models specifically designed for generating prompts. It processes input tensors, compiles prompts using Jinja templates, and forwards these prompts to a local language model (LLM) for completion. While it is generally more efficient to compile prompts directly within the model deployment, this approach can be restrictive when the same model needs to be reused across different tasks or within a pipeline. Redeploying models frequently is impractical, particularly given the significant resource demands of large language models. The Prompt Runtime provides a flexible solution by allowing the reuse of the same local LLM with minimal additional overhead, requiring only one extra inference request. The Prompt Runtime is intended to complement the Local Runtime in the following manner:

  • A local LLM is deployed using the default chat template included in its config.json (see HuggingFace Models).

  • Multiple prompts can then be deployed by referencing the desired LLM for completion in the model-settings.json.

Last updated

Was this helpful?