Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Workload Types in Replicate Stack

The Replicate Stack handles several distinct workload types, each with different characteristics and configurations. Understanding these distinctions is essential for comparing with Workers AI, which has a more uniform workload model.

Overview

Replicate defines four primary workload types via the DeployableKind enum (replicate/web/models/models/deployable_config.py:114):

class DeployableKind(models.TextChoices):
    DEPLOYMENT_PREDICTION = "deployment-prediction", "deployment-prediction"
    FUNCTION_PREDICTION = "function-prediction", "function-prediction"
    VERSION_PREDICTION = "version-prediction", "version-prediction"
    VERSION_TRAINING = "version-training", "version-training"

Each workload type has its own deployable_metadata_for_* function in replicate/web/models/logic.py that generates the appropriate configuration.

1. Deployment Predictions

What: Predictions running on a Deployment - a stable, long-lived identifier that routes to a backing model version. The backing version can be changed over time and doesn’t need to be owned by the same account as the deployment.

Characteristics:

  • Persistent deployment entity (configuration/routing), but infrastructure can scale to 0 replicas
  • Custom configuration per deployment that can override many version-level settings
  • Uses dedicated deployment key for consistent routing

Configuration: deployable_metadata_for_deployment_prediction()

Code references:

  • Deployment model: replicate/web/models/models/deployment.py
  • Kind validation: logic.py:1156 - asserts kind == DeployableKind.DEPLOYMENT_PREDICTION and not deployment.used_by_model

Queue behavior: Standard shuffle-sharded queues per deployment

2. Function Predictions (Pipelines/Procedures)

What: Predictions for multi-step workflows (Replicate Pipelines). Function predictions run on a shared container image with CPU-only hardware that “swaps” in procedure source code at prediction time. Similar to hotswaps but specifically for CPU-only Python code rather than GPU model weights.

Characteristics:

  • Run on CPU hardware (no GPU)
  • Share a base container image across procedures
  • Procedure source code is swapped in at runtime (analogous to weight swapping in hotswaps)
  • Part of a larger multi-step workflow
  • Uses AbstractProcedure model, not Version

Configuration: deployable_metadata_for_procedure_prediction()

Code references:

  • Procedure model: replicate/web/models/models/procedure.py
  • Kind validation: logic.py:1165 - when kind == DeployableKind.FUNCTION_PREDICTION, asserts deployment.used_by_model

Director configuration: Director runs procedures with DIRECTOR_JOB_KIND=procedure (director/config.go:37)

Queue behavior: Standard queues, no special routing

3. Version Predictions

What: Predictions running directly on a model version (not through a deployment).

Characteristics:

  • Ephemeral infrastructure (scaled up/down based on demand)
  • Configuration comes from version’s current_prediction_deployable_config
  • Two sub-types: normal and hotswap (see below)

Configuration: deployable_metadata_for_version_prediction()

3a. Normal Version Predictions

What: Standard version predictions without hotswapping.

Characteristics:

  • One container image per version
  • Standard queue routing
  • prefer_same_stream = False (default)

Code: logic.py:1214-1222

if not version.is_hotswappable:
    metadata = DeployableConfigSerializer(
        version.current_prediction_deployable_config
    ).data

3b. Hotswap Version Predictions

What: Versions that share a base container but load different weights at runtime. Multiple “hotswap versions” can run on the same pod by swapping weights instead of restarting containers.

Characteristics:

  • Multiple versions share the same base Docker image
  • Each version has additional_weights (weights loaded at runtime)
  • Base version must be public, non-virtual, and accept Replicate weights
  • prefer_same_stream = True - workers preferentially consume from the same stream to optimize weight locality
  • Uses base version’s deployment key for infrastructure sharing

When a version is hotswappable (version.py:586):

def is_hotswappable(self) -> bool:
    if not self.additional_weights:
        return False
    if not self.base_version:
        return False
    if self.base_docker_image_id != self.base_version.docker_image_relation_id:
        return False
    return self.base_version.is_valid_hotswap_base

Configuration: logic.py:1229-1244

deployable_config_fields = {
    ...
    "docker_image": version.base_version.docker_image_relation,
    "fuse_config": None,
    "key": version.base_version.key_for_hotswap_base_predictions,
    "prefer_same_stream": True,  # KEY DIFFERENCE
}

Queue behavior:

  • Shuffle-sharded queues (like all workloads)
  • DIRECTOR_PREFER_SAME_STREAM=true set for hotswap versions
  • Director workers repeatedly consume from the same stream before checking others
  • Optimizes for weight caching locality (reduce weight download/loading overhead)

Why prefer_same_stream matters:

  • Hotswap versions load different weights into the same container
  • If a worker has already loaded weights for version A, it’s faster to keep processing version A predictions
  • Without prefer_same_stream, workers round-robin across all streams, losing weight locality benefits

Code references:

4. Version Trainings

What: Training jobs for versions marked as trainable.

Characteristics:

  • Uses separate current_training_deployable_config (not prediction config)
  • Longer timeouts and different resource requirements
  • Different billing model
  • Director runs with DIRECTOR_JOB_KIND=training

Configuration: deployable_metadata_for_version_training()

Code references:

Summary Table

Workload TypeInfrastructureConfig Sourceprefer_same_streamDirector Job Kind
Deployment PredictionPersistentDeployment configFalseprediction
Function PredictionEphemeralProcedure configFalseprocedure
Version Prediction (normal)EphemeralVersion prediction configFalseprediction
Version Prediction (hotswap)Shared baseBase version config + weightsTrueprediction
Version TrainingEphemeralVersion training configFalsetraining

Key Takeaways

  1. Replicate has explicit workload type separation with different configurations and behaviors per type
  2. Hotswap predictions are unique - they’re the only workload using prefer_same_stream=True for weight locality optimization
  3. Function predictions use code swapping - similar to hotswaps but for CPU-only Python code instead of GPU model weights
  4. Each workload type has distinct characteristics - different infrastructure models, configuration sources, and queue behaviors

These workload distinctions will be referenced throughout the document when discussing queue behavior, timeouts, scaling, and other operational characteristics.