Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Job Types

Overview

Replicate supports three distinct job kinds — predictions, training, and procedures (pipelines) — all running through the same Director/cog infrastructure but with different queue prefixes, cog endpoints, timeout behaviors, and webhook structures. Workers AI has a single job type: inference. Task-type differentiation (text-generation, image-to-text, etc.) exists only as an input/output schema concern in worker-constellation-entry; everything downstream is task-agnostic.


Replicate Job Types

Director supports three job kinds, configured per-deployment via DIRECTOR_JOB_KIND (director/config.go:37):

  • prediction (default)
  • training
  • procedure

The job kind determines which cog HTTP endpoint Director calls, how webhooks are structured, and what timeout behavior applies.

How Job Kind Is Set

Cluster sets DIRECTOR_JOB_KIND when generating the Director pod spec (pkg/kubernetes/deployable.go:1268-1280):

  • isProcedureMode()procedure
  • isTraining()training
  • Neither → default prediction

Where:

DeployableKind (Web Side)

The web app distinguishes four deployable kinds (deployable_config.py:114-120):

DeployableKindDirector JobKindRedis Queue PatternKey Pattern
DEPLOYMENT_PREDICTIONpredictioninput:prediction:{key}dp-{uuid4_hex}
FUNCTION_PREDICTIONprocedureinput:prediction:{key} (or UNIFIED_PIPELINE_QUEUE)fp-{hash}
VERSION_PREDICTIONprediction or procedureinput:prediction:{key} (or UNIFIED_PIPELINE_QUEUE)vp-{docker_image_id[:32]}
VERSION_TRAININGtraininginput:training:{key}vt-{docker_image_id[:32]}

Key prefixes: dp- = deployment, fp- = function/pipeline, vp- = version prediction, vt- = version training. FUNCTION_PREDICTION is used for pipelines (procedures) — see the Procedures section below.

Queue names are generated by DeployableConfig.queue (deployable_config.py:565-589).

OfficialModel and ModelDeployment

Some models expose a “versionless API” — users run predictions against a model name (POST /v1/models/{owner}/{name}/predictions) rather than a specific version hash.

OfficialModel (current production) — Ties a Model to a backing Deployment and a BillingConfig (official_model.py). From the end user’s perspective, an OfficialModel behaves like a Deployment — it has scaling, hardware selection, and version pinning — but the user interacts with it via the model name, not a deployment name. The actual infrastructure (K8s deployment, Director pods, Redis queues) runs through the backing Deployment.

ModelDeployment (stalled replacement) — Intended to replace OfficialModel by associating a Model with a Deployment (and optionally a Procedure) via a labeled relationship (model_deployment.py). The migration from OfficialModel to ModelDeployment was started but the work stalled and was rolled back. Both entities exist in the codebase and the dispatch logic checks for a default ModelDeployment first, falling back to OfficialModel (api_internal_views.py:303-326). TODO comments in the code (e.g. “Drop is_official from here once OfficialModels are gone” at model.py:1322) reflect the intended direction.

A model uses_versionless_api when it has either an OfficialModel or a default ModelDeployment (model.py:1321-1323).

Predictions

The default job kind. Director sends PUT /predictions/{id} to cog (cog/client.go:183-184).

Three prediction subtypes exist, distinguished by how the deployable is created:

Version predictions — Direct runs against a model version. Created when a user runs a prediction against a specific version ID. Key: vp-{docker_image_id[:32]} (version.py:618-619).

Deployment predictions — Runs through a named deployment. The deployment owns the scaling config, hardware selection, and version pinning. Kind is DEPLOYMENT_PREDICTION with key prefix dp- (deployment.py:611, deployment.py:1024-1027).

Hotswap predictions — A version that has additional_weights and a base_version. The base container stays running and loads different weights (e.g. LoRA adapters) at runtime instead of booting a new container per version.

Hotswap requirements (version.py:586-594):

  • Version has additional_weights
  • Version has a base_version
  • Version’s base docker image matches the base version’s docker image
  • Base version’s model is public, image is non-virtual, and image accepts replicate weights

Hotswap key: hp-{base_version.docker_image_id[:32]} (version.py:605-608). Multiple hotswappable versions sharing the same base version share the same key, so they share the same Director pool.

Hotswap uses prefer_same_stream=True (logic.py:1243), which tells Director’s Redis message queue to prefer consuming from the same stream shard repeatedly (redis/queue.go:248-251). This increases the chance that a Director instance serves the same version’s weights consecutively, avoiding unnecessary weight reloads. PreferSameStream is incompatible with batching (MaxConcurrentPredictions > 1) (config.go:90-92).

Training

DIRECTOR_JOB_KIND=training. Director sends PUT /trainings/{id} (if cog >= 0.11.6), otherwise falls back to PUT /predictions/{id} (cog/client.go:279-311).

Training-specific behaviors:

  • Queue prefix is input:training: instead of input:prediction:
  • Timeout uses the higher of the prediction timeout and TrainTimeoutSeconds (24 hours) (deployable.go:1901)
  • Prediction.Destination field is present (specifies where trained weights go) (cog/types.go:182)
  • Concurrency is always 1 (deployable_config.py:417-418)
  • Webhook payload wraps the prediction in a Job with Kind: "training" and the prediction in the Training field instead of Prediction (worker.go:746-749)

Training is set via DeployableConfig.training_mode which maps to DeployableKind.VERSION_TRAINING (deployable_config.py:591-595).

Procedures

DIRECTOR_JOB_KIND=procedure. Director sends PUT /procedures/{id} (cog/client.go:54).

Procedures are multi-step workflows (pipelines) that run user-provided Python code on a shared runtime image (pipelines-runtime). Unlike predictions and trainings, the container image is not user-built — it’s a standard runtime that loads user code at execution time.

Procedure source loading:

Before executing, Director downloads the procedure source archive (tar.gz) from a signed URL, extracts it to a local cache directory, and rewrites the procedure_source_url in the prediction context to a file:// path (worker.go:540-548). The procedures.Manager handles download, caching (keyed by URL hash), and extraction (internal/procedures/manager.go).

Webhook compatibility hack:

When sending webhooks back to the API, Director overwrites the procedure job kind to prediction because the API doesn’t understand the procedure kind yet (worker.go:734-741).

Procedure subtypes in web:

  • Procedure — persistent, owned by a user/org, tied to a model. Uses DeployableKind.FUNCTION_PREDICTION (procedure.py:230).
  • EphemeralProcedure — temporary/draft, used for iteration. Source archives expire after 12 hours. Uses DeployableKind.VERSION_PREDICTION (procedure.py:42).

Both can route to the UNIFIED_PIPELINE_QUEUE if configured (deployable_config.py:573-584), which is a shared warm compute pool.

Workers AI Job Types

Workers AI has a single job type: inference. There is no training, no procedure/pipeline execution, and no multi-step workflow at the request level.

Task Types (Schema-Level Only)

worker-constellation-entry defines 14 task types in taskMappings (catalog.ts):

  • text-generation, text-classification, text-embeddings, text-to-image, text-to-speech
  • automatic-speech-recognition, image-classification, image-to-text, image-text-to-text
  • object-detection, translation, summarization, multimodal-embeddings
  • dumb-pipe (passthrough, no input/output transformation)

Each task type is a class that defines:

  • Input schema validation and preprocessing
  • Payload generation (transforming user input to backend format)
  • Output postprocessing

The task type is determined by the model’s task_slug from the config API. The infrastructure path (constellation-entry → constellation-server → GPU container) is identical regardless of task type. Task types are purely a worker-constellation-entry concern — constellation-entry and constellation-server are task-agnostic.

LoRA at Inference Time

constellation-server supports LoRA adapter loading for Triton-backed models (inference/triton.rs:142-214). When a request includes a lora parameter:

  1. Finetune config is fetched from the config API (Deus) via the model-repository crate
  2. Config is validated: finetune must exist for the current model, all required assets must be present
  3. lora_config and lora_name are injected as additional Triton inference inputs

This is inference-time adapter loading, not training. The base model stays loaded and the adapter is applied per-request. There is no LoRA-aware routing or stickiness — constellation-server’s endpoint selection and permit system are unaware of the lora field, so consecutive requests for the same adapter may hit different GPU containers.

Async Queue

Two conditions must be met for async execution:

  1. The model must have use_async_queue enabled in the config API (Deus) — a per-model property, default false (config.rs:134)
  2. The requester must opt in per-request via options.queueRequest: true in the JSON body (ai.ts:133-144)

When both are true, the request is POSTed to the async-queue-worker service binding (queue.ts:23). The response is either 200 (completed immediately) or 202 (queued). To retrieve results, the requester sends a follow-up request with inputs.request_id set, which triggers a GET to the queue service (ai.ts:148-153).

Limitations:

  • Polling only — no webhook/callback support
  • No streaming — inputs.stream is explicitly deleted when queueRequest is true (ai.ts:135)

This contrasts with Replicate where every prediction is async by default (webhook-driven), and Prefer: wait is the opt-in for synchronous behavior. Workers AI is synchronous by default, with async as an opt-in that requires both model-level enablement and per-request signaling.

Key Differences

Job type diversity:

  • Replicate: Three distinct job kinds (prediction, training, procedure) with different cog endpoints, timeout behaviors, queue prefixes, and webhook structures
  • Workers AI: Single job type (inference) with task-type differentiation only at the input/output schema level

Training support:

  • Replicate: First-class training support with dedicated queue prefix, extended timeouts (24h), destination field for trained weights, and separate cog endpoint
  • Workers AI: No training capability. LoRA adapters are loaded at inference time from pre-trained weights, not trained on the platform

Multi-step workflows:

  • Replicate: Procedures allow user-provided Python code to run on a shared runtime, enabling multi-model orchestration and custom logic. Source code is downloaded and cached per-execution
  • Workers AI: No equivalent. Multi-step orchestration is left to the caller

Weight loading model:

  • Replicate: Hotswap mechanism allows a base container to serve multiple versions by loading different weights at runtime. Uses prefer_same_stream to optimize for cache locality across a shared Director pool
  • Workers AI: LoRA adapters are injected per-request at the Triton level. No concept of a shared base container serving multiple “versions” — each model ID maps to a fixed deployment

Job type awareness:

  • Replicate: The same infrastructure (cluster, Director, cog) handles all three job kinds. Job kind is a configuration axis that changes queue prefixes, cog endpoint paths, timeout calculations, and webhook payload structure — but the systems are the same
  • Workers AI: Task type is a thin layer in worker-constellation-entry only. Everything downstream (constellation-entry, constellation-server, GPU containers) is task-agnostic