Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Deployment Management

Overview

Replicate’s autoscaler is reactive: predictions create K8s Deployments on demand, queue depth drives scaling at 1-second resolution, and idle deployments get pruned automatically. Workers AI’s ai-scheduler is proactive: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and there is no idle-based pruning. The systems reflect fundamentally different assumptions — bursty ephemeral workloads vs. an always-available model catalog.


Replicate: Autoscaler

The autoscaler runs in every GPU model serving kubernetes cluster. It manages k8s Deployment objects in the models or serving namespaces — creating them when prediction traffic appears, scaling replica counts based on queue depth, and pruning idle ones.

Source: cluster/pkg/autoscaler/autoscaler.go (fa8042d)

Concurrent loops

LoopIntervalPurpose
Deployment Dispatcherevent-driven (BRPOP)Creates/updates K8s Deployments when predictions arrive
Scaler1 secondAdjusts replica counts based on queue depth
Scale State Snapshotter1 secondCaptures queue lengths + replica counts into Redis
Queue Pruner1 hourDeletes stuck requests older than 24 hours
Deployment Pruner1 minuteDeletes K8s Deployments that have been idle too long

Deployment Dispatcher

┌───────────────┐
│ replicate/api │
└──────┬────────┘
       │ LPUSH
       ▼
┌──────────────────────┐
│ prediction-versions  │
│ (Redis list)         │
└──────┬───────────────┘
       │ BRPOP
       ▼
┌──────────────────────┐
│ Deployment           │
│ Dispatcher           │
│ (event-driven)       │
└──────┬───────────────┘
       │ create/update
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentDispatcher BRPOPs from the prediction-versions Redis queue. Each prediction triggers ensureDeployableDeployment, which creates or updates the K8s Deployment if the config has changed. Change detection uses a config hash comparison plus a template version serial.

Key behavior:

  • Deployments are created on demand — the first prediction for a version/deployment triggers K8s Deployment creation
  • Rate-limited to avoid overwhelming the K8s API
  • Config hash comparison means no-op if nothing changed

Scaler

┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────┬───────────────┘
       │ read replicas + queue lengths
       ▼
┌──────────────────────┐
│ Snapshotter (1s)     │
└──────┬───────────────┘
       │ write
       ▼
┌──────────────────────┐
│ Redis cache          │
│ (scale state)        │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Scaler (1s)          │
│ computeNewReplica    │
│ CountV2              │
└──────┬───────────────┘
       │ patch replicas
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startScaler runs every 1 second. For each deployable found in K8s, it loads the scale state from Redis cache and calls scaleDeployable().

The scaling algorithm (computeNewReplicaCountV2) is HPA-inspired:

desiredReplicas = currentReplicas × (metricValue / targetMetricValue)

Where the metric is backlog-per-instance:

backlogPerInstance = (queueLength + queueHeadroom) / instances

Three dampening mechanisms prevent oscillation:

  1. Scaling policies — rate limits on scale-out and scale-in. Default scale-out: allow +5 count or +100% per minute (whichever is larger). Default scale-in: unrestricted rate.
  2. Stabilization windows — the algorithm considers the min (for scale-in) or max (for scale-out) desired replica count over a time window. Default scale-out: 30 seconds. Default scale-in: 2 minutes.
  3. Hysteresis — ignore small oscillations below a threshold (default 0.02).

Additional scaling features:

  • Slow start: cap at 5 replicas until the first pod reports ready
  • Scale-to-zero: supported, with configurable idle timeout delay. Gated by scale-to-zero-delay kill switch.
  • Override min replicas: per-deployable minimum via DeployableConfig
  • Emergency cap: cap-max-replicas feature flag

Scaling configuration

scaling.Config holds per-deployable scaling parameters:

FieldDefaultSource
MetricTargetconfig.BacklogPerInstance flagCLI flag
Hysteresis0.02hardcoded
MinReplicas0DeployableConfig.ScalingConfig
MaxReplicasconfig.MaxReplicas flagCLI flag
ScaleOut behavior30s stabilization, +5 or +100%/minalgorithm_v2_defaults.go
ScaleIn behavior2min stabilization, no rate limitalgorithm_v2_defaults.go

Per-deployable overrides come from deployable.ScalingConfig, which is set via the web’s DeployableConfig and serialized into K8s annotations.

Deployment Pruner

┌──────────────────────┐
│ Redis cache          │
│ (last-request-time)  │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Deployment Pruner    │
│ (1min)               │
└──────┬───────────────┘
       │ delete idle
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentPruner runs every 1 minute. It deletes K8s Deployments that haven’t received a prediction since DeployableDeploymentDeleteInactiveAfter. Max 100 deletions per cycle. Fails safe if the latest request time is missing from Redis (skips the deployment rather than deleting it).

Queue Pruner

startQueuePruner runs every 1 hour. Deletes stuck requests older than 24 hours from per-deployable Redis streams. The API’s sweeper also cleans these streams at much shorter intervals (30s). See Queue Management for details on both cleanup mechanisms and how they overlap.


Workers AI: ai-scheduler

Workers AI deployment management is split across multiple systems. ai-scheduler is a Rust binary deployed to a core datacenter K8s cluster (pdx-c). It has multiple subcommands, each run as a separate K8s deployment: AutoScaler, Scheduler, AdminAPI, ReleaseManagerWatcher, ExternalNodesWatcher, RoutingUpdater. They share the scheduling crate as a library.

Source: ai-scheduler/ (89f8e0d)

Architecture overview

┌──────────────────────────────────────────────────────────┐
│                      ai-scheduler                        │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐ │
│  │ auto_scaling  │  │  admin_api   │  │ release_mgr    │ │
│  │ (5min loop)   │  │ (manual ops) │  │ _watcher       │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────────┘ │
│         │                 │                  │           │
│         └────────┬────────┴──────────────────┘           │
│                  ▼                                        │
│         ┌──────────────┐                                 │
│         │  scheduling  │  (action execution)             │
│         └──────┬───────┘                                 │
│                │                                         │
└────────────────┼─────────────────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
  Cloud Chamber      External K8s
  (internal GPU)     (OKE, CoreWeave, Nebius,
                      Lambda, GCP, Crusoe)
                          │
                          ▼
                  inference-kubernetes-engine
                  (Model CRD operator)

Three entry points produce scheduling actions:

  • auto_scaling — utilization-based autoscaler loop
  • admin_api — manual scaling endpoints for humans
  • release_manager_watcher — watches for software version changes, triggers rolling updates

All actions flow through the scheduling app, which applies them to either Cloud Chamber (internal capacity) or external K8s clusters via the external_nodes module.

Autoscaler (auto_scaling)

The autoscaler runs every 5 minutes. Each cycle:

  1. Fetches model usage from ClickHouse (request counts, inference time per minute over 15-minute windows)
  2. Fetches usage “forecast” from ClickHouse (same-time-last-week data, not a forecasting model)
  3. Fetches utilization metrics from Prometheus (soft — errors produce empty data, not failures)
  4. Loads model configuration from Config API
  5. Gets current application state from Cloud Chamber
  6. Fetches external endpoint health from Quicksilver and counts healthy deployments from Cloud Chamber
  7. Handles external capacity scheduling (may emit ScheduleExternalModelApplications)
  8. Computes desired instance count per model
  9. Emits UpscaleModelApplications or DownscaleModelApplications actions

Scaling algorithms

Three algorithms coexist. Selection depends on per-model Config API properties:

1. Request-count-based (default fallback):

instances = avg_count_per_min × autoscaling_factor

Default autoscaling_factor is 0.8. This assumes each inference request takes ~1 minute. Crude, but serves as the baseline.

2. Utilization-based (model_utilization_autoscaler = true):

Computes utilization from cumulative inference time:

required_to_fit = inference_time_secs_per_min / 60 / max_concurrent_requests
utilization = required_to_fit / model_instances

Uses a deadband instead of dampening:

  • If utilization < 1 / (factor + 0.15) → downscale
  • If utilization > 1 / factor → upscale
  • Otherwise → no change

Minimum factor is clamped to 1.2 (20% overprovisioning floor). Takes the max of current and forecast inference time.

3. EZ utilization-based (scaling_config.utilization_based.active = true):

The newest algorithm. Configurable per-model via ScalingConfig:

scaling_config:
  disabled: false
  utilization_based:
    active: true
    min_utilization: 0.3
    max_utilization: 0.8
    utilization_scale: 1.0
    use_forecast: true
    use_out_of_cap: true
    prometheus: false

Features:

  • Configurable min/max utilization bounds (replaces hardcoded deadband)
  • Optional forecast-based scaling (use_forecast)
  • Out-of-capacity adjustment (use_out_of_cap): inflates measured utilization by (successful + out_of_cap) / successful to account for requests turned away. successful_per_min floored at 0.1 to avoid division by zero.
  • Safety cap: when OOC adjustment is active, measured utilization capped at 2× the sans-OOC value
  • Prometheus metrics as an alternative utilization source
  • Asymmetric instance base: upscale decisions use model_healthy_instances, downscale uses model_requested_instances
  • Unhealthy instance handling: adds +1 buffer when upscaling with unhealthy instances; gradually decrements requested count when >1 unhealthy instances exist

When active, this algorithm overrides the request-count-based algorithm. However, if model_utilization_autoscaler is also true, the older utilization-based algorithm takes final precedence — the priority order is: request-count → EZ utilization → old utilization.

Instance bounds

All algorithms clamp results to [min_count, max_count] from Config API:

  • Default min_count: 5 (set per autoscaler instance via CLI flag default_min_count)
  • Default max_count: 100
  • Downscale batch size capped at 10 per cycle

There is no idle-based scale-to-zero. Models stay provisioned at min_count (default 5, but can be set to 0 per-model).

Kill switches

  • disable_scaling — global Config API property, disables all scaling
  • scaling_config.disabled — per-model disable
  • Tier change grace period — skips models that recently changed tiers
  • Tier selector — autoscaler instances can be scoped to specific tier ranges

Scheduling actions

Actions are applied via Cloud Chamber API or external K8s:

ActionTargetEffect
UpscaleModelApplicationsCloud ChamberPATCH application instances count up
DownscaleModelApplicationsCloud ChamberPATCH application instances count down
ScheduleExternalModelApplicationsExternal K8sCreate/patch Model CRD replicas
CreateModelApplicationCloud ChamberCreate new CC application + set instances
DeployModelApplicationToTigerCloud ChamberDeploy to Tiger (canary) environment
DeleteDeploymentCloud ChamberDelete specific deployment
RemoveModelApplicationCloud ChamberDelete entire application
ModifyApplicationRegionsCloud ChamberModify region constraints
ModifyApplicationSchedulingPriorityCloud ChamberModify scheduling priority
ModifyApplicationAffinitiesCloud ChamberModify colocation/affinity constraints

Instance count is clamped to 0–1400 per application. Up/downscale distributes changes round-robin across applications.

Actions are defined in the Action enum.

External capacity

External capacity (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) has a separate management model controlled by the ExternalCapacity config (Management enum):

#![allow(unused)]
fn main() {
pub enum Management {
    Disabled,   // no auto-management
    Manual,     // humans manage it entirely
    Upscaling,  // autoscaler can only scale UP
    Full,       // autoscaler can scale both directions
}
}

The autoscaler code has this comment:

NOTE: currently auto-scaler can only scale up external instances but not scale down. In order to scale down, use admin api endpoint: /models/schedule_externally

This comment may be stale — Management::Full supports both directions in the code. The real constraint is that the autoscaler’s scaling algorithms don’t dynamically compute external replica targets; external capacity is config-driven (expected_replicas), not utilization-driven.

The external path works as follows:

  1. Infrastructure: K8s clusters provisioned via Terraform (oci-terraform/ repo — OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe)
  2. Model operator: inference-kubernetes-engine watches Model CRDs and reconciles K8s Deployments/Services to match
  3. Scaling up: autoscaler patches Model CRD spec.replicas via KubernetesProvider.schedule()
  4. Scaling down: manual via admin API or the scale-down CLI tool in inference-kubernetes-engine, which gradually decrements replicas with a configurable interval and batch size

Admin API

Manual scaling endpoints behind is_workers_ai_team() access check:

  • GET /models/scale — upscale a model by amount (creates UpscaleModelApplications action; notably a GET for a mutating op)
  • POST /models/schedule — run the full scheduler loop for a single model
  • POST /models/schedule_externally — set external replica count
  • POST /models/schedule_externally_on_specific_provider — target specific provider
  • POST /models/remove — delete specific deployment by colo/metal
  • POST /models/delete_applications — delete all applications for a model, including external CRDs
  • GET /models/status — model status/debugging
  • Tiger management — create, list, delete canary deployments

Model provisioning

Unlike Replicate, Workers AI models are pre-provisioned rather than created on demand from inference traffic:

  1. Model is registered in Config API with properties (min_count, max_count, software, gpu_memory, etc.)
  2. release_manager_watcher detects the software version and creates Cloud Chamber applications
  3. Autoscaler maintains instance count within [min_count, max_count] based on utilization
  4. There is no equivalent of Replicate’s deployment pruner — models stay provisioned at min_count until manually removed

Config API properties (scaling-relevant)

PropertyDefaultDescription
min_count5Minimum instances
max_count100Maximum instances
scaling_factor0.8Autoscaling factor
model_utilization_autoscalerfalseEnable utilization-based algorithm
scaling_confignoneYAML blob with disabled, utilization_based sub-config
disable_scalingfalseKill switch (also available as global property)
external_capacitynoneExternal provider config with management mode
gpu_memory22GPU memory request (GB)
gpu_modelnoneSpecific GPU model requirement
dual_gpufalseRequires two GPUs
colo_tiernoneRestrict to colo tier
colo_regionnoneRestrict to colo region
tier“unknown-scheduler”Model tier (Tier-0, Tier-1, Tier-2)

Key Differences

AspectReplicateWorkers AI
Scaling signalQueue depth (backlog per instance) — real-timeInference time utilization — 15-minute ClickHouse windows
Loop frequency1 second5 minutes
Deployment creationOn demand from prediction trafficPre-provisioned via Config API + release manager
Scale-to-zeroYes, with idle timeoutNo idle-based scale-to-zero; min_count defaults to 5 but can be 0 per-model
Deployment pruningAutomatic (idle deployments deleted after timeout)None — models stay provisioned until manually removed
DampeningHPA-style: scaling policies, stabilization windows, hysteresisDeadband: upper/lower utilization bounds
OrchestratorDirect K8s API (Deployments)Cloud Chamber (internal) + K8s operator (external)
External capacityN/A (single K8s cluster)Multi-provider (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) with per-model management mode
Manual controlsFeature flags (cap-max-replicas, scale-to-zero-delay)Admin API endpoints, disable_scaling kill switch, Management::Manual mode
Scale-down on externalN/AConfig-driven; Management::Full supports both directions but targets are set manually, not utilization-driven
Algorithm selectionSingle algorithm (HPA-inspired)Three algorithms with priority ordering: request-count → EZ utilization → old utilization
Config sourceK8s annotations (from web’s DeployableConfig)Config API properties (key-value store)

Architectural contrast

Replicate’s autoscaler is reactive and fine-grained: predictions create deployments, queue depth drives scaling at 1-second resolution, idle deployments get pruned. The system assumes workloads are bursty and ephemeral.

Workers AI’s ai-scheduler is proactive and coarse-grained: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and external capacity management is heavily human-assisted. The system assumes a catalog of always-available models.