Deployment Management

Overview

Replicate’s autoscaler is reactive: predictions create K8s Deployments on demand, queue depth drives scaling at 1-second resolution, and idle deployments get pruned automatically. Workers AI’s ai-scheduler is proactive: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and there is no idle-based pruning. The systems reflect fundamentally different assumptions — bursty ephemeral workloads vs. an always-available model catalog.

Replicate: Autoscaler

The autoscaler runs in every GPU model serving kubernetes cluster. It manages k8s Deployment objects in the models or serving namespaces — creating them when prediction traffic appears, scaling replica counts based on queue depth, and pruning idle ones.

Source: cluster/pkg/autoscaler/autoscaler.go (fa8042d)

Concurrent loops

Loop	Interval	Purpose
Deployment Dispatcher	event-driven (BRPOP)	Creates/updates K8s Deployments when predictions arrive
Scaler	1 second	Adjusts replica counts based on queue depth
Scale State Snapshotter	1 second	Captures queue lengths + replica counts into Redis
Queue Pruner	1 hour	Deletes stuck requests older than 24 hours
Deployment Pruner	1 minute	Deletes K8s Deployments that have been idle too long

Deployment Dispatcher

┌───────────────┐
│ replicate/api │
└──────┬────────┘
       │ LPUSH
       ▼
┌──────────────────────┐
│ prediction-versions  │
│ (Redis list)         │
└──────┬───────────────┘
       │ BRPOP
       ▼
┌──────────────────────┐
│ Deployment           │
│ Dispatcher           │
│ (event-driven)       │
└──────┬───────────────┘
       │ create/update
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentDispatcher BRPOPs from the prediction-versions Redis queue. Each prediction triggers ensureDeployableDeployment, which creates or updates the K8s Deployment if the config has changed. Change detection uses a config hash comparison plus a template version serial.

Key behavior:

Deployments are created on demand — the first prediction for a version/deployment triggers K8s Deployment creation
Rate-limited to avoid overwhelming the K8s API
Config hash comparison means no-op if nothing changed

Scaler

┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────┬───────────────┘
       │ read replicas + queue lengths
       ▼
┌──────────────────────┐
│ Snapshotter (1s)     │
└──────┬───────────────┘
       │ write
       ▼
┌──────────────────────┐
│ Redis cache          │
│ (scale state)        │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Scaler (1s)          │
│ computeNewReplica    │
│ CountV2              │
└──────┬───────────────┘
       │ patch replicas
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startScaler runs every 1 second. For each deployable found in K8s, it loads the scale state from Redis cache and calls scaleDeployable().

The scaling algorithm (computeNewReplicaCountV2) is HPA-inspired:

desiredReplicas = currentReplicas × (metricValue / targetMetricValue)

Where the metric is backlog-per-instance:

backlogPerInstance = (queueLength + queueHeadroom) / instances

Three dampening mechanisms prevent oscillation:

Scaling policies — rate limits on scale-out and scale-in. Default scale-out: allow +5 count or +100% per minute (whichever is larger). Default scale-in: unrestricted rate.
Stabilization windows — the algorithm considers the min (for scale-in) or max (for scale-out) desired replica count over a time window. Default scale-out: 30 seconds. Default scale-in: 2 minutes.
Hysteresis — ignore small oscillations below a threshold (default 0.02).

Additional scaling features:

Slow start: cap at 5 replicas until the first pod reports ready
Scale-to-zero: supported, with configurable idle timeout delay. Gated by scale-to-zero-delay kill switch.
Override min replicas: per-deployable minimum via DeployableConfig
Emergency cap: cap-max-replicas feature flag

Scaling configuration

scaling.Config holds per-deployable scaling parameters:

Field	Default	Source
MetricTarget	`config.BacklogPerInstance` flag	CLI flag
Hysteresis	0.02	hardcoded
MinReplicas	0	`DeployableConfig.ScalingConfig`
MaxReplicas	`config.MaxReplicas` flag	CLI flag
ScaleOut behavior	30s stabilization, +5 or +100%/min	`algorithm_v2_defaults.go`
ScaleIn behavior	2min stabilization, no rate limit	`algorithm_v2_defaults.go`

Per-deployable overrides come from deployable.ScalingConfig, which is set via the web’s DeployableConfig and serialized into K8s annotations.

Deployment Pruner

┌──────────────────────┐
│ Redis cache          │
│ (last-request-time)  │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Deployment Pruner    │
│ (1min)               │
└──────┬───────────────┘
       │ delete idle
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentPruner runs every 1 minute. It deletes K8s Deployments that haven’t received a prediction since DeployableDeploymentDeleteInactiveAfter. Max 100 deletions per cycle. Fails safe if the latest request time is missing from Redis (skips the deployment rather than deleting it).

Queue Pruner

startQueuePruner runs every 1 hour. Deletes stuck requests older than 24 hours from per-deployable Redis streams. The API’s sweeper also cleans these streams at much shorter intervals (30s). See Queue Management for details on both cleanup mechanisms and how they overlap.

Workers AI: ai-scheduler

Workers AI deployment management is split across multiple systems. ai-scheduler is a Rust binary deployed to a core datacenter K8s cluster (pdx-c). It has multiple subcommands, each run as a separate K8s deployment: AutoScaler, Scheduler, AdminAPI, ReleaseManagerWatcher, ExternalNodesWatcher, RoutingUpdater. They share the scheduling crate as a library.

Source: ai-scheduler/ (89f8e0d)

Architecture overview

┌──────────────────────────────────────────────────────────┐
│                      ai-scheduler                        │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐ │
│  │ auto_scaling  │  │  admin_api   │  │ release_mgr    │ │
│  │ (5min loop)   │  │ (manual ops) │  │ _watcher       │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────────┘ │
│         │                 │                  │           │
│         └────────┬────────┴──────────────────┘           │
│                  ▼                                        │
│         ┌──────────────┐                                 │
│         │  scheduling  │  (action execution)             │
│         └──────┬───────┘                                 │
│                │                                         │
└────────────────┼─────────────────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
  Cloud Chamber      External K8s
  (internal GPU)     (OKE, CoreWeave, Nebius,
                      Lambda, GCP, Crusoe)
                          │
                          ▼
                  inference-kubernetes-engine
                  (Model CRD operator)

Three entry points produce scheduling actions:

auto_scaling — utilization-based autoscaler loop
admin_api — manual scaling endpoints for humans
release_manager_watcher — watches for software version changes, triggers rolling updates

All actions flow through the scheduling app, which applies them to either Cloud Chamber (internal capacity) or external K8s clusters via the external_nodes module.

Autoscaler (`auto_scaling`)

The autoscaler runs every 5 minutes. Each cycle:

Fetches model usage from ClickHouse (request counts, inference time per minute over 15-minute windows)
Fetches usage “forecast” from ClickHouse (same-time-last-week data, not a forecasting model)
Fetches utilization metrics from Prometheus (soft — errors produce empty data, not failures)
Loads model configuration from Config API
Gets current application state from Cloud Chamber
Fetches external endpoint health from Quicksilver and counts healthy deployments from Cloud Chamber
Handles external capacity scheduling (may emit ScheduleExternalModelApplications)
Computes desired instance count per model
Emits UpscaleModelApplications or DownscaleModelApplications actions

Scaling algorithms

Three algorithms coexist. Selection depends on per-model Config API properties:

1. Request-count-based (default fallback):

instances = avg_count_per_min × autoscaling_factor

Default autoscaling_factor is 0.8. This assumes each inference request takes ~1 minute. Crude, but serves as the baseline.

2. Utilization-based (model_utilization_autoscaler = true):

Computes utilization from cumulative inference time:

required_to_fit = inference_time_secs_per_min / 60 / max_concurrent_requests
utilization = required_to_fit / model_instances

Uses a deadband instead of dampening:

If utilization < 1 / (factor + 0.15) → downscale
If utilization > 1 / factor → upscale
Otherwise → no change

Minimum factor is clamped to 1.2 (20% overprovisioning floor). Takes the max of current and forecast inference time.

3. EZ utilization-based (scaling_config.utilization_based.active = true):

The newest algorithm. Configurable per-model via ScalingConfig:

scaling_config:
  disabled: false
  utilization_based:
    active: true
    min_utilization: 0.3
    max_utilization: 0.8
    utilization_scale: 1.0
    use_forecast: true
    use_out_of_cap: true
    prometheus: false

Features:

Configurable min/max utilization bounds (replaces hardcoded deadband)
Optional forecast-based scaling (use_forecast)
Out-of-capacity adjustment (use_out_of_cap): inflates measured utilization by (successful + out_of_cap) / successful to account for requests turned away. successful_per_min floored at 0.1 to avoid division by zero.
Safety cap: when OOC adjustment is active, measured utilization capped at 2× the sans-OOC value
Prometheus metrics as an alternative utilization source
Asymmetric instance base: upscale decisions use model_healthy_instances, downscale uses model_requested_instances
Unhealthy instance handling: adds +1 buffer when upscaling with unhealthy instances; gradually decrements requested count when >1 unhealthy instances exist

When active, this algorithm overrides the request-count-based algorithm. However, if model_utilization_autoscaler is also true, the older utilization-based algorithm takes final precedence — the priority order is: request-count → EZ utilization → old utilization.

Instance bounds

All algorithms clamp results to [min_count, max_count] from Config API:

Default min_count: 5 (set per autoscaler instance via CLI flag default_min_count)
Default max_count: 100
Downscale batch size capped at 10 per cycle

There is no idle-based scale-to-zero. Models stay provisioned at min_count (default 5, but can be set to 0 per-model).

Kill switches

disable_scaling — global Config API property, disables all scaling
scaling_config.disabled — per-model disable
Tier change grace period — skips models that recently changed tiers
Tier selector — autoscaler instances can be scoped to specific tier ranges

Scheduling actions

Actions are applied via Cloud Chamber API or external K8s:

Action	Target	Effect
`UpscaleModelApplications`	Cloud Chamber	PATCH application `instances` count up
`DownscaleModelApplications`	Cloud Chamber	PATCH application `instances` count down
`ScheduleExternalModelApplications`	External K8s	Create/patch Model CRD replicas
`CreateModelApplication`	Cloud Chamber	Create new CC application + set instances
`DeployModelApplicationToTiger`	Cloud Chamber	Deploy to Tiger (canary) environment
`DeleteDeployment`	Cloud Chamber	Delete specific deployment
`RemoveModelApplication`	Cloud Chamber	Delete entire application
`ModifyApplicationRegions`	Cloud Chamber	Modify region constraints
`ModifyApplicationSchedulingPriority`	Cloud Chamber	Modify scheduling priority
`ModifyApplicationAffinities`	Cloud Chamber	Modify colocation/affinity constraints

Instance count is clamped to 0–1400 per application. Up/downscale distributes changes round-robin across applications.

Actions are defined in the Action enum.

External capacity

External capacity (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) has a separate management model controlled by the ExternalCapacity config (Management enum):

#![allow(unused)]
fn main() {
pub enum Management {
    Disabled,   // no auto-management
    Manual,     // humans manage it entirely
    Upscaling,  // autoscaler can only scale UP
    Full,       // autoscaler can scale both directions
}
}

The autoscaler code has this comment:

NOTE: currently auto-scaler can only scale up external instances but not scale down. In order to scale down, use admin api endpoint: /models/schedule_externally

This comment may be stale — Management::Full supports both directions in the code. The real constraint is that the autoscaler’s scaling algorithms don’t dynamically compute external replica targets; external capacity is config-driven (expected_replicas), not utilization-driven.

The external path works as follows:

Infrastructure: K8s clusters provisioned via Terraform (oci-terraform/ repo — OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe)
Model operator: inference-kubernetes-engine watches Model CRDs and reconciles K8s Deployments/Services to match
Scaling up: autoscaler patches Model CRD spec.replicas via KubernetesProvider.schedule()
Scaling down: manual via admin API or the scale-down CLI tool in inference-kubernetes-engine, which gradually decrements replicas with a configurable interval and batch size

Admin API

Manual scaling endpoints behind is_workers_ai_team() access check:

GET /models/scale — upscale a model by amount (creates UpscaleModelApplications action; notably a GET for a mutating op)
POST /models/schedule — run the full scheduler loop for a single model
POST /models/schedule_externally — set external replica count
POST /models/schedule_externally_on_specific_provider — target specific provider
POST /models/remove — delete specific deployment by colo/metal
POST /models/delete_applications — delete all applications for a model, including external CRDs
GET /models/status — model status/debugging
Tiger management — create, list, delete canary deployments

Model provisioning

Unlike Replicate, Workers AI models are pre-provisioned rather than created on demand from inference traffic:

Model is registered in Config API with properties (min_count, max_count, software, gpu_memory, etc.)
release_manager_watcher detects the software version and creates Cloud Chamber applications
Autoscaler maintains instance count within [min_count, max_count] based on utilization
There is no equivalent of Replicate’s deployment pruner — models stay provisioned at min_count until manually removed

Config API properties (scaling-relevant)

Property	Default	Description
`min_count`	5	Minimum instances
`max_count`	100	Maximum instances
`scaling_factor`	0.8	Autoscaling factor
`model_utilization_autoscaler`	false	Enable utilization-based algorithm
`scaling_config`	none	YAML blob with `disabled`, `utilization_based` sub-config
`disable_scaling`	false	Kill switch (also available as global property)
`external_capacity`	none	External provider config with `management` mode
`gpu_memory`	22	GPU memory request (GB)
`gpu_model`	none	Specific GPU model requirement
`dual_gpu`	false	Requires two GPUs
`colo_tier`	none	Restrict to colo tier
`colo_region`	none	Restrict to colo region
`tier`	“unknown-scheduler”	Model tier (Tier-0, Tier-1, Tier-2)

Key Differences

Aspect	Replicate	Workers AI
Scaling signal	Queue depth (backlog per instance) — real-time	Inference time utilization — 15-minute ClickHouse windows
Loop frequency	1 second	5 minutes
Deployment creation	On demand from prediction traffic	Pre-provisioned via Config API + release manager
Scale-to-zero	Yes, with idle timeout	No idle-based scale-to-zero; `min_count` defaults to 5 but can be 0 per-model
Deployment pruning	Automatic (idle deployments deleted after timeout)	None — models stay provisioned until manually removed
Dampening	HPA-style: scaling policies, stabilization windows, hysteresis	Deadband: upper/lower utilization bounds
Orchestrator	Direct K8s API (Deployments)	Cloud Chamber (internal) + K8s operator (external)
External capacity	N/A (single K8s cluster)	Multi-provider (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) with per-model management mode
Manual controls	Feature flags (`cap-max-replicas`, `scale-to-zero-delay`)	Admin API endpoints, `disable_scaling` kill switch, `Management::Manual` mode
Scale-down on external	N/A	Config-driven; `Management::Full` supports both directions but targets are set manually, not utilization-driven
Algorithm selection	Single algorithm (HPA-inspired)	Three algorithms with priority ordering: request-count → EZ utilization → old utilization
Config source	K8s annotations (from web’s `DeployableConfig`)	Config API properties (key-value store)

Architectural contrast

Replicate’s autoscaler is reactive and fine-grained: predictions create deployments, queue depth drives scaling at 1-second resolution, idle deployments get pruned. The system assumes workloads are bursty and ephemeral.

Workers AI’s ai-scheduler is proactive and coarse-grained: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and external capacity management is heavily human-assisted. The system assumes a catalog of always-available models.

Keyboard shortcuts

Replicate vs Workers AI