Deployment Management
Overview
Replicate’s autoscaler is reactive: predictions create K8s Deployments on demand, queue depth drives scaling at 1-second resolution, and idle deployments get pruned automatically. Workers AI’s ai-scheduler is proactive: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and there is no idle-based pruning. The systems reflect fundamentally different assumptions — bursty ephemeral workloads vs. an always-available model catalog.
Replicate: Autoscaler
The autoscaler runs in every GPU model serving kubernetes cluster. It manages k8s
Deployment objects in the models or serving namespaces — creating them when prediction
traffic appears, scaling replica counts based on queue depth, and pruning idle ones.
Source: cluster/pkg/autoscaler/autoscaler.go
(fa8042d)
Concurrent loops
| Loop | Interval | Purpose |
|---|---|---|
| Deployment Dispatcher | event-driven (BRPOP) | Creates/updates K8s Deployments when predictions arrive |
| Scaler | 1 second | Adjusts replica counts based on queue depth |
| Scale State Snapshotter | 1 second | Captures queue lengths + replica counts into Redis |
| Queue Pruner | 1 hour | Deletes stuck requests older than 24 hours |
| Deployment Pruner | 1 minute | Deletes K8s Deployments that have been idle too long |
Deployment Dispatcher
┌───────────────┐
│ replicate/api │
└──────┬────────┘
│ LPUSH
▼
┌──────────────────────┐
│ prediction-versions │
│ (Redis list) │
└──────┬───────────────┘
│ BRPOP
▼
┌──────────────────────┐
│ Deployment │
│ Dispatcher │
│ (event-driven) │
└──────┬───────────────┘
│ create/update
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startDeploymentDispatcher BRPOPs from the
prediction-versions Redis queue. Each prediction triggers
ensureDeployableDeployment, which creates or
updates the K8s Deployment if the config has changed. Change detection uses a
config hash comparison plus a template version serial.
Key behavior:
- Deployments are created on demand — the first prediction for a version/deployment triggers K8s Deployment creation
- Rate-limited to avoid overwhelming the K8s API
- Config hash comparison means no-op if nothing changed
Scaler
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────┬───────────────┘
│ read replicas + queue lengths
▼
┌──────────────────────┐
│ Snapshotter (1s) │
└──────┬───────────────┘
│ write
▼
┌──────────────────────┐
│ Redis cache │
│ (scale state) │
└──────┬───────────────┘
│ read
▼
┌──────────────────────┐
│ Scaler (1s) │
│ computeNewReplica │
│ CountV2 │
└──────┬───────────────┘
│ patch replicas
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startScaler runs every 1 second. For each deployable found
in K8s, it loads the scale state from Redis cache and calls
scaleDeployable().
The scaling algorithm (computeNewReplicaCountV2)
is HPA-inspired:
desiredReplicas = currentReplicas × (metricValue / targetMetricValue)
Where the metric is backlog-per-instance:
backlogPerInstance = (queueLength + queueHeadroom) / instances
Three dampening mechanisms prevent oscillation:
- Scaling policies — rate limits on scale-out and scale-in. Default scale-out: allow +5 count or +100% per minute (whichever is larger). Default scale-in: unrestricted rate.
- Stabilization windows — the algorithm considers the min (for scale-in) or max (for scale-out) desired replica count over a time window. Default scale-out: 30 seconds. Default scale-in: 2 minutes.
- Hysteresis — ignore small oscillations below a threshold (default 0.02).
Additional scaling features:
- Slow start: cap at 5 replicas until the first pod reports ready
- Scale-to-zero: supported, with configurable idle timeout delay. Gated by
scale-to-zero-delaykill switch. - Override min replicas: per-deployable minimum via
DeployableConfig - Emergency cap:
cap-max-replicasfeature flag
Scaling configuration
scaling.Config holds per-deployable scaling parameters:
| Field | Default | Source |
|---|---|---|
| MetricTarget | config.BacklogPerInstance flag | CLI flag |
| Hysteresis | 0.02 | hardcoded |
| MinReplicas | 0 | DeployableConfig.ScalingConfig |
| MaxReplicas | config.MaxReplicas flag | CLI flag |
| ScaleOut behavior | 30s stabilization, +5 or +100%/min | algorithm_v2_defaults.go |
| ScaleIn behavior | 2min stabilization, no rate limit | algorithm_v2_defaults.go |
Per-deployable overrides come from deployable.ScalingConfig,
which is set via the web’s DeployableConfig and serialized
into K8s annotations.
Deployment Pruner
┌──────────────────────┐
│ Redis cache │
│ (last-request-time) │
└──────┬───────────────┘
│ read
▼
┌──────────────────────┐
│ Deployment Pruner │
│ (1min) │
└──────┬───────────────┘
│ delete idle
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startDeploymentPruner runs every 1 minute. It
deletes K8s Deployments that haven’t received a prediction since
DeployableDeploymentDeleteInactiveAfter.
Max 100 deletions per cycle. Fails safe if the latest request time is missing
from Redis (skips the deployment rather than deleting it).
Queue Pruner
startQueuePruner runs every 1 hour.
Deletes stuck requests older than 24 hours from per-deployable
Redis streams. The API’s sweeper also cleans these streams at much
shorter intervals (30s). See
Queue Management for details on both
cleanup mechanisms and how they overlap.
Workers AI: ai-scheduler
Workers AI deployment management is split across multiple systems.
ai-scheduler is a Rust binary deployed to a core datacenter K8s
cluster (pdx-c). It has multiple subcommands, each run as a
separate K8s deployment: AutoScaler, Scheduler, AdminAPI,
ReleaseManagerWatcher, ExternalNodesWatcher, RoutingUpdater.
They share the scheduling crate as a library.
Source: ai-scheduler/ (89f8e0d)
Architecture overview
┌──────────────────────────────────────────────────────────┐
│ ai-scheduler │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ auto_scaling │ │ admin_api │ │ release_mgr │ │
│ │ (5min loop) │ │ (manual ops) │ │ _watcher │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬─────────┘ │
│ │ │ │ │
│ └────────┬────────┴──────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ scheduling │ (action execution) │
│ └──────┬───────┘ │
│ │ │
└────────────────┼─────────────────────────────────────────┘
│
┌────────┴────────┐
▼ ▼
Cloud Chamber External K8s
(internal GPU) (OKE, CoreWeave, Nebius,
Lambda, GCP, Crusoe)
│
▼
inference-kubernetes-engine
(Model CRD operator)
Three entry points produce scheduling actions:
- auto_scaling — utilization-based autoscaler loop
- admin_api — manual scaling endpoints for humans
- release_manager_watcher — watches for software version changes, triggers rolling updates
All actions flow through the scheduling app, which applies them to either
Cloud Chamber (internal capacity) or external K8s clusters via the
external_nodes module.
Autoscaler (auto_scaling)
The autoscaler runs every 5 minutes. Each cycle:
- Fetches model usage from ClickHouse (request counts, inference time per minute over 15-minute windows)
- Fetches usage “forecast” from ClickHouse (same-time-last-week data, not a forecasting model)
- Fetches utilization metrics from Prometheus (soft — errors produce empty data, not failures)
- Loads model configuration from Config API
- Gets current application state from Cloud Chamber
- Fetches external endpoint health from Quicksilver and counts healthy deployments from Cloud Chamber
- Handles external capacity scheduling (may emit
ScheduleExternalModelApplications) - Computes desired instance count per model
- Emits
UpscaleModelApplicationsorDownscaleModelApplicationsactions
Scaling algorithms
Three algorithms coexist. Selection depends on per-model Config API properties:
1. Request-count-based (default fallback):
instances = avg_count_per_min × autoscaling_factor
Default autoscaling_factor is 0.8. This assumes each inference request takes
~1 minute. Crude, but serves as the baseline.
2. Utilization-based (model_utilization_autoscaler = true):
Computes utilization from cumulative inference time:
required_to_fit = inference_time_secs_per_min / 60 / max_concurrent_requests
utilization = required_to_fit / model_instances
Uses a deadband instead of dampening:
- If utilization <
1 / (factor + 0.15)→ downscale - If utilization >
1 / factor→ upscale - Otherwise → no change
Minimum factor is clamped to 1.2 (20% overprovisioning floor). Takes the max of current and forecast inference time.
3. EZ utilization-based (scaling_config.utilization_based.active = true):
The newest algorithm. Configurable per-model via ScalingConfig:
scaling_config:
disabled: false
utilization_based:
active: true
min_utilization: 0.3
max_utilization: 0.8
utilization_scale: 1.0
use_forecast: true
use_out_of_cap: true
prometheus: false
Features:
- Configurable min/max utilization bounds (replaces hardcoded deadband)
- Optional forecast-based scaling (
use_forecast) - Out-of-capacity adjustment (
use_out_of_cap): inflates measured utilization by(successful + out_of_cap) / successfulto account for requests turned away.successful_per_minfloored at 0.1 to avoid division by zero. - Safety cap: when OOC adjustment is active, measured utilization capped at 2× the sans-OOC value
- Prometheus metrics as an alternative utilization source
- Asymmetric instance base: upscale decisions use
model_healthy_instances, downscale usesmodel_requested_instances - Unhealthy instance handling: adds +1 buffer when upscaling with unhealthy instances; gradually decrements requested count when >1 unhealthy instances exist
When active, this algorithm overrides the request-count-based
algorithm. However, if model_utilization_autoscaler is also true,
the older utilization-based algorithm takes final precedence — the
priority order is: request-count → EZ utilization → old utilization.
Instance bounds
All algorithms clamp results to [min_count, max_count] from Config API:
- Default
min_count: 5 (set per autoscaler instance via CLI flagdefault_min_count) - Default
max_count: 100 - Downscale batch size capped at 10 per cycle
There is no idle-based scale-to-zero. Models stay provisioned at
min_count (default 5, but can be set to 0 per-model).
Kill switches
disable_scaling— global Config API property, disables all scalingscaling_config.disabled— per-model disable- Tier change grace period — skips models that recently changed tiers
- Tier selector — autoscaler instances can be scoped to specific tier ranges
Scheduling actions
Actions are applied via Cloud Chamber API or external K8s:
| Action | Target | Effect |
|---|---|---|
UpscaleModelApplications | Cloud Chamber | PATCH application instances count up |
DownscaleModelApplications | Cloud Chamber | PATCH application instances count down |
ScheduleExternalModelApplications | External K8s | Create/patch Model CRD replicas |
CreateModelApplication | Cloud Chamber | Create new CC application + set instances |
DeployModelApplicationToTiger | Cloud Chamber | Deploy to Tiger (canary) environment |
DeleteDeployment | Cloud Chamber | Delete specific deployment |
RemoveModelApplication | Cloud Chamber | Delete entire application |
ModifyApplicationRegions | Cloud Chamber | Modify region constraints |
ModifyApplicationSchedulingPriority | Cloud Chamber | Modify scheduling priority |
ModifyApplicationAffinities | Cloud Chamber | Modify colocation/affinity constraints |
Instance count is clamped to 0–1400 per application. Up/downscale distributes changes round-robin across applications.
Actions are defined in the Action enum.
External capacity
External capacity (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) has a separate
management model controlled by the ExternalCapacity config
(Management enum):
#![allow(unused)]
fn main() {
pub enum Management {
Disabled, // no auto-management
Manual, // humans manage it entirely
Upscaling, // autoscaler can only scale UP
Full, // autoscaler can scale both directions
}
}
The autoscaler code has this comment:
NOTE: currently auto-scaler can only scale up external instances but not scale down. In order to scale down, use admin api endpoint: /models/schedule_externally
This comment may be stale — Management::Full supports both
directions in the code. The real constraint is that the autoscaler’s
scaling algorithms don’t dynamically compute external replica targets;
external capacity is config-driven (expected_replicas), not
utilization-driven.
The external path works as follows:
- Infrastructure: K8s clusters provisioned via Terraform
(
oci-terraform/repo — OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) - Model operator:
inference-kubernetes-enginewatchesModelCRDs and reconciles K8s Deployments/Services to match - Scaling up: autoscaler patches Model CRD
spec.replicasviaKubernetesProvider.schedule() - Scaling down: manual via admin API or the
scale-downCLI tool in inference-kubernetes-engine, which gradually decrements replicas with a configurable interval and batch size
Admin API
Manual scaling endpoints behind is_workers_ai_team()
access check:
GET /models/scale— upscale a model by amount (createsUpscaleModelApplicationsaction; notably a GET for a mutating op)POST /models/schedule— run the full scheduler loop for a single modelPOST /models/schedule_externally— set external replica countPOST /models/schedule_externally_on_specific_provider— target specific providerPOST /models/remove— delete specific deployment by colo/metalPOST /models/delete_applications— delete all applications for a model, including external CRDsGET /models/status— model status/debugging- Tiger management — create, list, delete canary deployments
Model provisioning
Unlike Replicate, Workers AI models are pre-provisioned rather than created on demand from inference traffic:
- Model is registered in Config API with properties (
min_count,max_count,software,gpu_memory, etc.) release_manager_watcherdetects the software version and creates Cloud Chamber applications- Autoscaler maintains instance count within
[min_count, max_count]based on utilization - There is no equivalent of Replicate’s deployment pruner — models stay
provisioned at
min_countuntil manually removed
Config API properties (scaling-relevant)
| Property | Default | Description |
|---|---|---|
min_count | 5 | Minimum instances |
max_count | 100 | Maximum instances |
scaling_factor | 0.8 | Autoscaling factor |
model_utilization_autoscaler | false | Enable utilization-based algorithm |
scaling_config | none | YAML blob with disabled, utilization_based sub-config |
disable_scaling | false | Kill switch (also available as global property) |
external_capacity | none | External provider config with management mode |
gpu_memory | 22 | GPU memory request (GB) |
gpu_model | none | Specific GPU model requirement |
dual_gpu | false | Requires two GPUs |
colo_tier | none | Restrict to colo tier |
colo_region | none | Restrict to colo region |
tier | “unknown-scheduler” | Model tier (Tier-0, Tier-1, Tier-2) |
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Scaling signal | Queue depth (backlog per instance) — real-time | Inference time utilization — 15-minute ClickHouse windows |
| Loop frequency | 1 second | 5 minutes |
| Deployment creation | On demand from prediction traffic | Pre-provisioned via Config API + release manager |
| Scale-to-zero | Yes, with idle timeout | No idle-based scale-to-zero; min_count defaults to 5 but can be 0 per-model |
| Deployment pruning | Automatic (idle deployments deleted after timeout) | None — models stay provisioned until manually removed |
| Dampening | HPA-style: scaling policies, stabilization windows, hysteresis | Deadband: upper/lower utilization bounds |
| Orchestrator | Direct K8s API (Deployments) | Cloud Chamber (internal) + K8s operator (external) |
| External capacity | N/A (single K8s cluster) | Multi-provider (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) with per-model management mode |
| Manual controls | Feature flags (cap-max-replicas, scale-to-zero-delay) | Admin API endpoints, disable_scaling kill switch, Management::Manual mode |
| Scale-down on external | N/A | Config-driven; Management::Full supports both directions but targets are set manually, not utilization-driven |
| Algorithm selection | Single algorithm (HPA-inspired) | Three algorithms with priority ordering: request-count → EZ utilization → old utilization |
| Config source | K8s annotations (from web’s DeployableConfig) | Config API properties (key-value store) |
Architectural contrast
Replicate’s autoscaler is reactive and fine-grained: predictions create deployments, queue depth drives scaling at 1-second resolution, idle deployments get pruned. The system assumes workloads are bursty and ephemeral.
Workers AI’s ai-scheduler is proactive and coarse-grained: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and external capacity management is heavily human-assisted. The system assumes a catalog of always-available models.