Job Types
Overview
Replicate supports three distinct job kinds — predictions, training, and procedures (pipelines) — all running through the same Director/cog infrastructure but with different queue prefixes, cog endpoints, timeout behaviors, and webhook structures. Workers AI has a single job type: inference. Task-type differentiation (text-generation, image-to-text, etc.) exists only as an input/output schema concern in worker-constellation-entry; everything downstream is task-agnostic.
Replicate Job Types
Director supports three job kinds, configured per-deployment via DIRECTOR_JOB_KIND
(director/config.go:37):
prediction(default)trainingprocedure
The job kind determines which cog HTTP endpoint Director calls, how webhooks are structured, and what timeout behavior applies.
How Job Kind Is Set
Cluster sets DIRECTOR_JOB_KIND when generating the Director pod spec
(pkg/kubernetes/deployable.go:1268-1280):
isProcedureMode()→procedureisTraining()→training- Neither → default
prediction
Where:
isProcedureMode()=UseProcedureMode || UseUnifiedPipelineNodePool(deployable_config.go:234)isTraining()=TrainingMode != nil && *TrainingMode(deployable_config.go:321)
DeployableKind (Web Side)
The web app distinguishes four deployable kinds
(deployable_config.py:114-120):
| DeployableKind | Director JobKind | Redis Queue Pattern | Key Pattern |
|---|---|---|---|
DEPLOYMENT_PREDICTION | prediction | input:prediction:{key} | dp-{uuid4_hex} |
FUNCTION_PREDICTION | procedure | input:prediction:{key} (or UNIFIED_PIPELINE_QUEUE) | fp-{hash} |
VERSION_PREDICTION | prediction or procedure | input:prediction:{key} (or UNIFIED_PIPELINE_QUEUE) | vp-{docker_image_id[:32]} |
VERSION_TRAINING | training | input:training:{key} | vt-{docker_image_id[:32]} |
Key prefixes: dp- = deployment, fp- = function/pipeline, vp- = version prediction,
vt- = version training. FUNCTION_PREDICTION is used for pipelines (procedures) — see
the Procedures section below.
Queue names are generated by DeployableConfig.queue
(deployable_config.py:565-589).
OfficialModel and ModelDeployment
Some models expose a “versionless API” — users run predictions against a model name
(POST /v1/models/{owner}/{name}/predictions) rather than a specific version hash.
OfficialModel (current production) — Ties a Model to a backing Deployment and a
BillingConfig
(official_model.py).
From the end user’s perspective, an OfficialModel behaves like a Deployment — it has
scaling, hardware selection, and version pinning — but the user interacts with it via the
model name, not a deployment name. The actual infrastructure (K8s deployment, Director
pods, Redis queues) runs through the backing Deployment.
ModelDeployment (stalled replacement) — Intended to replace OfficialModel by
associating a Model with a Deployment (and optionally a Procedure) via a labeled
relationship
(model_deployment.py).
The migration from OfficialModel to ModelDeployment was started but the work stalled and
was rolled back. Both entities exist in the codebase and the dispatch logic checks for a
default ModelDeployment first, falling back to OfficialModel
(api_internal_views.py:303-326).
TODO comments in the code (e.g. “Drop is_official from here once OfficialModels are
gone” at
model.py:1322)
reflect the intended direction.
A model uses_versionless_api when it has either an OfficialModel or a default
ModelDeployment
(model.py:1321-1323).
Predictions
The default job kind. Director sends PUT /predictions/{id} to cog
(cog/client.go:183-184).
Three prediction subtypes exist, distinguished by how the deployable is created:
Version predictions — Direct runs against a model version. Created when a user runs a
prediction against a specific version ID. Key: vp-{docker_image_id[:32]}
(version.py:618-619).
Deployment predictions — Runs through a named deployment. The deployment owns the
scaling config, hardware selection, and version pinning. Kind is DEPLOYMENT_PREDICTION
with key prefix dp-
(deployment.py:611,
deployment.py:1024-1027).
Hotswap predictions — A version that has additional_weights and a base_version.
The base container stays running and loads different weights (e.g. LoRA adapters) at
runtime instead of booting a new container per version.
Hotswap requirements
(version.py:586-594):
- Version has
additional_weights - Version has a
base_version - Version’s base docker image matches the base version’s docker image
- Base version’s model is public, image is non-virtual, and image accepts replicate weights
Hotswap key: hp-{base_version.docker_image_id[:32]}
(version.py:605-608).
Multiple hotswappable versions sharing the same base version share the same key, so they
share the same Director pool.
Hotswap uses prefer_same_stream=True
(logic.py:1243),
which tells Director’s Redis message queue to prefer consuming from the same stream shard
repeatedly
(redis/queue.go:248-251).
This increases the chance that a Director instance serves the same version’s weights
consecutively, avoiding unnecessary weight reloads. PreferSameStream is incompatible
with batching (MaxConcurrentPredictions > 1)
(config.go:90-92).
Training
DIRECTOR_JOB_KIND=training. Director sends PUT /trainings/{id} (if cog >= 0.11.6),
otherwise falls back to PUT /predictions/{id}
(cog/client.go:279-311).
Training-specific behaviors:
- Queue prefix is
input:training:instead ofinput:prediction: - Timeout uses the higher of the prediction timeout and
TrainTimeoutSeconds(24 hours) (deployable.go:1901) Prediction.Destinationfield is present (specifies where trained weights go) (cog/types.go:182)- Concurrency is always 1
(
deployable_config.py:417-418) - Webhook payload wraps the prediction in a
JobwithKind: "training"and the prediction in theTrainingfield instead ofPrediction(worker.go:746-749)
Training is set via DeployableConfig.training_mode which maps to
DeployableKind.VERSION_TRAINING
(deployable_config.py:591-595).
Procedures
DIRECTOR_JOB_KIND=procedure. Director sends PUT /procedures/{id}
(cog/client.go:54).
Procedures are multi-step workflows (pipelines) that run user-provided Python code on a
shared runtime image (pipelines-runtime). Unlike predictions and trainings, the
container image is not user-built — it’s a standard runtime that loads user code at
execution time.
Procedure source loading:
Before executing, Director downloads the procedure source archive (tar.gz) from a signed
URL, extracts it to a local cache directory, and rewrites the
procedure_source_url in the prediction context to a file:// path
(worker.go:540-548).
The procedures.Manager handles download, caching (keyed by URL hash), and extraction
(internal/procedures/manager.go).
Webhook compatibility hack:
When sending webhooks back to the API, Director overwrites the procedure job kind to
prediction because the API doesn’t understand the procedure kind yet
(worker.go:734-741).
Procedure subtypes in web:
Procedure— persistent, owned by a user/org, tied to a model. UsesDeployableKind.FUNCTION_PREDICTION(procedure.py:230).EphemeralProcedure— temporary/draft, used for iteration. Source archives expire after 12 hours. UsesDeployableKind.VERSION_PREDICTION(procedure.py:42).
Both can route to the UNIFIED_PIPELINE_QUEUE if configured
(deployable_config.py:573-584),
which is a shared warm compute pool.
Workers AI Job Types
Workers AI has a single job type: inference. There is no training, no procedure/pipeline execution, and no multi-step workflow at the request level.
Task Types (Schema-Level Only)
worker-constellation-entry defines 14 task types in taskMappings
(catalog.ts):
text-generation,text-classification,text-embeddings,text-to-image,text-to-speechautomatic-speech-recognition,image-classification,image-to-text,image-text-to-textobject-detection,translation,summarization,multimodal-embeddingsdumb-pipe(passthrough, no input/output transformation)
Each task type is a class that defines:
- Input schema validation and preprocessing
- Payload generation (transforming user input to backend format)
- Output postprocessing
The task type is determined by the model’s task_slug from the config API. The
infrastructure path (constellation-entry → constellation-server → GPU container) is
identical regardless of task type. Task types are purely a worker-constellation-entry
concern — constellation-entry and constellation-server are task-agnostic.
LoRA at Inference Time
constellation-server supports LoRA adapter loading for Triton-backed models
(inference/triton.rs:142-214).
When a request includes a lora parameter:
- Finetune config is fetched from the config API (Deus) via the
model-repositorycrate - Config is validated: finetune must exist for the current model, all required assets must be present
lora_configandlora_nameare injected as additional Triton inference inputs
This is inference-time adapter loading, not training. The base model stays loaded and the
adapter is applied per-request. There is no LoRA-aware routing or stickiness —
constellation-server’s endpoint selection and permit system are unaware of the lora
field, so consecutive requests for the same adapter may hit different GPU containers.
Async Queue
Two conditions must be met for async execution:
- The model must have
use_async_queueenabled in the config API (Deus) — a per-model property, defaultfalse(config.rs:134) - The requester must opt in per-request via
options.queueRequest: truein the JSON body (ai.ts:133-144)
When both are true, the request is POSTed to the async-queue-worker service binding
(queue.ts:23).
The response is either 200 (completed immediately) or 202 (queued). To retrieve results,
the requester sends a follow-up request with inputs.request_id set, which triggers a GET
to the queue service
(ai.ts:148-153).
Limitations:
- Polling only — no webhook/callback support
- No streaming —
inputs.streamis explicitly deleted whenqueueRequestis true (ai.ts:135)
This contrasts with Replicate where every prediction is async by default (webhook-driven),
and Prefer: wait is the opt-in for synchronous behavior. Workers AI is synchronous by
default, with async as an opt-in that requires both model-level enablement and per-request
signaling.
Key Differences
Job type diversity:
- Replicate: Three distinct job kinds (prediction, training, procedure) with different cog endpoints, timeout behaviors, queue prefixes, and webhook structures
- Workers AI: Single job type (inference) with task-type differentiation only at the input/output schema level
Training support:
- Replicate: First-class training support with dedicated queue prefix, extended timeouts (24h), destination field for trained weights, and separate cog endpoint
- Workers AI: No training capability. LoRA adapters are loaded at inference time from pre-trained weights, not trained on the platform
Multi-step workflows:
- Replicate: Procedures allow user-provided Python code to run on a shared runtime, enabling multi-model orchestration and custom logic. Source code is downloaded and cached per-execution
- Workers AI: No equivalent. Multi-step orchestration is left to the caller
Weight loading model:
- Replicate: Hotswap mechanism allows a base container to serve multiple versions by
loading different weights at runtime. Uses
prefer_same_streamto optimize for cache locality across a shared Director pool - Workers AI: LoRA adapters are injected per-request at the Triton level. No concept of a shared base container serving multiple “versions” — each model ID maps to a fixed deployment
Job type awareness:
- Replicate: The same infrastructure (cluster, Director, cog) handles all three job kinds. Job kind is a configuration axis that changes queue prefixes, cog endpoint paths, timeout calculations, and webhook payload structure — but the systems are the same
- Workers AI: Task type is a thin layer in worker-constellation-entry only. Everything downstream (constellation-entry, constellation-server, GPU containers) is task-agnostic