Introduction

This is a technical comparison of Replicate and Cloudflare Workers AI. It covers how each system handles request routing, autoscaling, resource management, model loading, and observability. When possible, links to source code are included.

Audience

Engineers and engineering leaders on the Workers AI and adjacent ai-platform teams. Sections are written to be useful both as reference material for people building these systems and as context for people making decisions about them.

How to read this

Each section covers both platforms, ending with a “Key Differences” summary. Sections are self-contained — read them in order or jump to what’s relevant.

Scope

This comparison covers the inference serving path and supporting infrastructure: how requests arrive, get routed to models, execute, and return results. In general, it does not cover ancillary tooling or browser level user experience topics.

Request Lifecycle

Request Lifecycle Management

Overview

As one might expect, a given prediction or inference request touches different components on each platform. The diagrams below show the (simplified) full path from client to inference server and back.

Replicate

graph TD
    Client([Client])
    Web[web]

    subgraph GPU cluster
        API[api]
        Redis[(redis)]
        Director[director]
        Cog[cog]
    end

    Client -->|HTTP| API
    Web -.->|playground| API
    API -->|LPUSH / XADD| Redis
    Redis -->|XREADGROUP| Director
    Director -->|HTTP| Cog
    Cog -->|response| Director
    Director -->|webhook| API
    API -->|poll / webhook| Client

Key points: the request is asynchronous by default. The client submits a prediction, the API enqueues it, and Director picks it up later. The client polls, receives a webhook when the result is ready, or opts into “blocking” mode at the per-request level. Streaming predictions respond over SSE but still enter through the same queue.

Workers AI

graph TD
    Client([Client])
    WCE[worker-constellation-entry]
    CE[constellation-entry]
    CS[constellation-server]

    subgraph GPU cluster
        GPU[GPU container]
    end

    Client -->|HTTP| WCE
    WCE -->|HTTP| CE
    CE -->|pipefitter / HTTP| CS
    CS -->|HTTP| GPU
    GPU -->|response| CS
    CS -->|response| CE
    CE -->|response| WCE
    WCE -->|response| Client

Key points: the request is synchronous. The client holds an open HTTP connection while the request flows through the stack to a GPU container and back. There is no persistent queue — backpressure is handled by permit-based admission control and cross-colo retry. If no capacity is available, the client gets a 429.

Replicate Request Processing

The Replicate request path has three major phases: API-side routing and enqueue, transient queuing in Redis, and Director-side dequeue and execution. Durable state lives in PostgreSQL (via web), not in the queue. The sections below cover each phase in detail.

Behavior varies by workload type (see Workload Types appendix). All workload types share the same core queue and worker infrastructure but differ in configuration.

Queue and Message Handling

Each GPU provider’s managed Kubernetes cluster runs its own Redis cluster (with sentinel for HA). Director uses Redis Streams for request queuing within each cluster. Each deployable has a Queue field that specifies the Redis stream name from which Director pulls work (cluster/pkg/kubernetes/deployable_config.go:109-112).

Queue names use prefixes that identify the workload type, defined in api/internal/predictions/queues.go:

QueuePrefixDeploymentPrediction = "input:prediction:dp-"
QueuePrefixFunctionPrediction   = "input:prediction:fp-"
QueuePrefixHotswapPrediction    = "input:prediction:hp-"
QueuePrefixVersionPrediction    = "input:prediction:vp-"
QueuePrefixVersionTraining      = "input:training:vt-"

The full queue name combines the prefix with the deployable’s key. The key varies by workload type, as shown in web/models/logic.py:826-845:

Deployment predictions: deployment.key
Procedure predictions: procedure.key
Version trainings: version.key_for_trainings
Hotswap predictions: hotswap_base_version.key_for_hotswap_base_predictions
Version predictions: version.key_for_predictions

Prediction requests are routed to the correct GPU provider’s replicate/api deployment, which then pushes to the relevant Redis stream in that cluster. When writing to the queue, the API uses the queue name from the deployable metadata (api/internal/logic/prediction.go:239-301).

The queue client is created in director/director.go:276:

messageQueue, err := redis.NewMessageQueue(queueClient, cfg.RedisInputQueue,
    hostname+":"+instanceID, cfg.PreferSameStream, rdb)

Shuffle sharding: As of October 2024, all queues use shuffle sharding by default. The queue is divided into N virtual queues (Redis streams), and each tenant is assigned M of these as their “shard” based on a tenant key. When enqueueing, messages go to the shortest stream in the tenant’s shard. This provides fairness: heavy users are avoided by other tenants whose shards don’t fully overlap, preventing a single account’s large batch from blocking others (go/queue/queue.go:1-32). The old BasicQueueClient was removed in commit 9da00ca.

Queue affinity: The PreferSameStream configuration enables queue affinity where workers repeatedly consume from the same stream before checking others. This is only enabled for hotswap version predictions (set to True in replicate/web/models/logic.py:1243). All other workload types use False.

Code references:

Message queue implementation: director/redis/queue.go
Configuration: director/config.go
Queue affinity config: director/config.go:50

Worker Loop

Each Director instance runs worker goroutines that poll the queue. The number of workers is controlled by DIRECTOR_CONCURRENT_PREDICTIONS (default: 1). This is set by cluster when the model’s Cog config specifies concurrency.max > 1 (cluster/pkg/kubernetes/deployable.go:1149-1153). Most models run with the default of 1 worker.

The worker loop (director/worker.go:134) continuously:

Polls Redis Stream for messages using consumer groups
Checks if prediction execution is allowed (spending validation, principal checks)
Calculates effective deadline from prediction metadata, deployment lifetime, and timeout config
Creates a Tracker instance for state management
Executes prediction via Executor interface (which calls into Cog)
Handles terminal states (success, failure, cancellation)
Acknowledges message in Redis

The worker loop behavior is the same across all workload types. What differs is the job kind passed to the executor:

Predictions (deployment, version): DIRECTOR_JOB_KIND=prediction (default)
Trainings: DIRECTOR_JOB_KIND=training
Procedures (function predictions): DIRECTOR_JOB_KIND=procedure

Note: Internally, Director converts procedure to prediction when sending to Cog because the API doesn’t yet support the procedure value (director/worker.go:731-737).

Code references:

Worker implementation: director/worker.go
Execution validation: director/worker.go:34-48
Deadline calculation: director/worker.go:63-107
Job kind configuration: director/config.go:37

Tracker and State Transitions

The Tracker (director/tracker.go) manages state transitions and notifications. It defines four webhook event types:

start - Prediction begins execution
output - Intermediate outputs during execution
logs - Log messages during execution
completed - Prediction reaches terminal state

The Tracker maintains:

Prediction state (struct at director/tracker.go:36-69)
Subscriber functions that receive state updates
Pending webhook events (if notifications are suppressed temporarily)
Upload tracking for async file completions
Moderation state (if content moderation is enabled)

Webhooks are sent to the Replicate API at state transitions. The webhook sender uses a PreemptingExecutor (director/executor.go) that may drop intermediate webhooks if new updates arrive before the previous one is sent - this is intentional since the latest state supersedes earlier ones. Terminal webhooks are always delivered.

Code references:

Tracker struct: director/tracker.go:36-69
Webhook event types: cog/types.go:45-55

Streaming

Director supports Server-Sent Events (SSE) for streaming outputs. Streaming is enabled when the API includes a stream_publish_url in the prediction’s internal metadata (cog/types.go:230). The StreamClient (director/streaming.go) publishes output events to this URL as they occur.

Concurrency Control

Director supports concurrent predictions within a single container via DIRECTOR_CONCURRENT_PREDICTIONS. See Concurrency Deep Dive for details on configuration and usage patterns.

Workers AI Request Processing

Routing Architecture

Workers AI requests flow through multiple components:

worker-constellation-entry: Pipeline Worker (TypeScript) that receives API requests from SDK or REST API. Handles input validation, model lookup, and request formatting before forwarding to constellation-entry (apps/worker-constellation-entry/src/worker.ts).
constellation-entry: Rust service running on every metal. Fetches model config from QS, selects target colos using routing algorithms, resolves constellation-server addresses, and forwards via pipefitter (gitlab.cfdata.org/cloudflare/ai/constellation-entry).
constellation-server: Colo-level load balancer (Rust). Manages permits and request queues per model, selects GPU container via service discovery, handles inference forwarding and cross-colo failover (gitlab.cfdata.org/cloudflare/ai/constellation-server).
GPU container: Inference server (Triton, TGI, TEI) running the model.

Wiki references:

System Overview - Component descriptions
Smart(er) routing - Routing architecture
Constellation Server Updates Q3 2025 - Detailed request flow

constellation-entry Routing

constellation-entry fetches model configuration from Quicksilver (distributed KV store), which includes the list of colos where the model is deployed (src/model_config.rs). It then applies a routing algorithm to select target colos. Routing algorithms include:

Random: Random selection among available colos (default)
LatencySensitive: Prefer nearby colos, with step-based selection
ForcedColo: Override to a specific colo (for debugging/testing)
ForcedResourceFlow: Use resource-flow routing (src/resource_flow.rs)

Routing uses a step-based algorithm (StepBasedRouter) that generates a prioritized list of colos. The algorithm can be customized per-model via routing_params in the model config.

Once colos are selected, constellation-entry resolves constellation-server addresses via QS or DNS (src/cs_resolver.rs) and forwards the request via pipefitter (src/pipefitter_transport.rs).

Code references:

Repository: gitlab.cfdata.org/cloudflare/ai/constellation-entry
Main entry point: src/main.rs
Colo resolver: src/cs_resolver.rs

Wiki references:

Coalescing Routing Algorithms

constellation-server Load Balancing

constellation-server runs in each colo and handles:

Request handling (src/main.rs, server/src/lib.rs):

Listens on Unix domain socket (for pipefitter) and TCP port (src/cli.rs:23,31)
Extracts routing metadata: model ID, account ID, fallback routing info
Tracks distributed tracing spans
Emits metrics to Ready Analytics

Endpoint load balancing and concurrency (server/src/endpoint_lb.rs):

Manages EndpointGroup per model with permit-based concurrency control
Per-model concurrency limits configured via Config API
Permit handling in server/src/endpoint_lb/permits.rs
Explicit request queue (ModelRequestQueue) with FIFO or Fair algorithms (server/src/endpoint_lb/queue.rs)

Service discovery (service-discovery/src/lib.rs):

Finds available GPU container endpoints via Consul (service-discovery/src/consul.rs)
Maintains world map of available endpoints

Inference forwarding (server/src/inference/mod.rs):

Regular HTTP requests: Buffers and forwards to GPU container with retry logic
WebSocket requests: Upgrades connection and relays bidirectionally (no retries)
Backend-specific handlers for Triton, TGI, TEI in server/src/inference/

Failover and forwarding (server/src/colo_forwarding.rs):

If target returns error or is unavailable, may forward to different colo
Tracks inflight requests via headers (cf-ai-requests-inflight) to coordinate across instances
For pipefitter: Returns 500 with headers telling pipefitter to retry different colos
Handles graceful draining: continues accepting requests while failing health checks

Code references:

Repository: gitlab.cfdata.org/cloudflare/ai/constellation-server
CLI args (timeouts, interfaces): src/cli.rs

Wiki references:

Constellation Server Updates Q3 2025 - Detailed request flow
Reliable constellation-server failover
Constellation server graceful restarts

Queuing and Backpressure

When GPU containers are busy, backpressure propagates through the stack:

constellation-server: find_upstream() attempts to acquire a permit for a GPU endpoint. If all permits are taken, the request enters a bounded local colo queue (ModelRequestQueue). The queue is a future that resolves when a permit frees up (via tokio::sync::Notify). If the queue is also full, the request gets NoSlotsAvailable → ServerError::OutOfCapacity. OOC may also trigger forwarding to another constellation-server in the same colo (server/src/endpoint_lb/endpoint_group.rs:200-277).
constellation-entry: Receives OOC from constellation-server (error codes 3033, 3040, 3047, 3048). Has a retry loop with separate budgets for errors (up to 3) and OOC (varies by request size). For small requests, pipefitter handles cross-colo forwarding. For large requests (>256 KiB), constellation-entry retries across target colos itself. All OOC variants are mapped to generic 3040 before returning to the client (proxy_server.rs:1509-1600, errors.rs:62-67).
worker-constellation-entry: Has an optional retry loop on 429/424 responses, but outOfCapacityRetries defaults to 0 (configurable per-model via sdkConfig). By default, OOC goes straight to the client (ai/session.ts:101,137-141).
Client: Receives HTTP 429 with "Capacity temporarily exceeded, please try again." (internal code 3040).

There is no persistent queue in the Workers AI stack. Backpressure is handled through permit-based admission control, bounded in-memory queues, cross-colo retry, and ultimately rejection to the client. This contrasts with Replicate’s Redis Streams-based queue, where requests persist until a Director instance dequeues them.

State and Observability

Workers AI observability has two main channels from constellation-server:

Ready Analytics (ready-analytics/src/): Per-inference events sent via logfwdr (Unix socket). Each InferenceEvent includes:

Request timing: request_time_ms, inference_time_ms, queuing_time_ms, time_to_first_token
Billing: neurons, cost metric values
Routing: colo_id, source_colo_id, colo_path, upstream_address
Queue state: local_queue_size
Flags: streamed, beta, external endpoint, queued in colo

See ready-analytics/src/inference_event.rs for full field list.

Prometheus metrics (metrics/src/lib.rs): Local metrics scraped by monitoring. Includes:

requests (by status), errors (by internal/HTTP code)
total_permits, mean_used_permits, mean_queue_size, full_fraction (per model)
infer_per_backend, infer_per_health_state
forward_failures (per target colo)
backoff_removed_endpoints (endpoints in backoff state)

Note: “Workers Analytics” in the public docs refers to analytics for Workers code execution, not GPU inference. GPU inference observability flows through Ready Analytics.

Streaming

constellation-server supports two streaming mechanisms:

HTTP streaming (server/src/inference/pipe_http.rs): For PipeHttp backends, the response body can be InferenceBody::Streaming (line 222-230), which passes the Incoming body through without buffering via chunked transfer encoding. The permit is held until the connection completes (line 396-397 for HTTP/1, line 472 for HTTP/2). If the backend sends SSE-formatted data (e.g., TGI’s streaming responses), constellation-server passes it through transparently without parsing.

WebSocket (server/src/inference/mod.rs:180-400): Bidirectional WebSocket connections are handled by handle_ws_inference_request. After the HTTP upgrade, handle_websocket_request relays messages between client and inference server. A max connection duration is enforced (max_connection_duration_s, line 332-338).

Loopback SSE (server/src/loopback/): Separate from client-facing streaming. constellation-server maintains SSE connections to inference servers that support async batch processing. These connections receive completion events (InferenceSuccess, InferenceError, WebsocketFinished, GpuFree) which trigger permit release and async queue acknowledgment. See loopback/sse.rs for the SSE client implementation.

Concurrency Control

constellation-server uses permit-based concurrency control per model via EndpointLoadBalancer (server/src/endpoint_lb.rs). Concurrency limits are configured in Config API. When local queuing is enabled, requests wait in an explicit ModelRequestQueue (FIFO or Fair algorithm) until a permit becomes available. Different models can have different concurrency limits on the same server.

Timeouts & Deadlines

Replicate Timeout System

Timeout behavior spans multiple components:

replicate/web - defines timeout values per workload type in deployable metadata
replicate/api - can inject per-prediction deadlines into prediction metadata
replicate/cluster - translates deployable config into DIRECTOR_* environment variables when creating K8s deployments; provides default values when the deployable config doesn’t specify them (pkg/config/config.go)
director - reads env vars and enforces timeouts during execution

Configuration

Three timeout parameters control prediction lifecycle. Each flows from replicate/web through replicate/cluster to director as an environment variable. Director reads them as flags (director/config.go:52-58).

Model Setup Timeout (`DIRECTOR_MODEL_SETUP_TIMEOUT`)

Time allowed for model container initialization. Applied during health check polling while waiting for the container to become ready. Measures time from StatusStarting to completion. If exceeded, Director fails the prediction with setup logs and reports to the setup run endpoint (director/director.go:718-830).

Resolution chain:

web: Model.setup_timeout property — returns DEFAULT_MODEL_SETUP_TIMEOUT (10 minutes) when the underlying DB field is None (models/model.py:1038-1042). Stored on DeployableConfig.setup_timeout, then serialized as model_setup_timeout with a multiplier and bonus applied (api_serializers.py:1678-1681).
cluster: Reads deployable.ModelSetupTimeout. If nonzero, uses it; otherwise falls back to config.ModelSetupTimeout (10 minutes) (pkg/config/config.go:74, pkg/kubernetes/deployable.go:1191-1201).
director: Reads DIRECTOR_MODEL_SETUP_TIMEOUT env var. Own flag default is 0 (disabled), but cluster always sets it.

Prediction Timeout (`DIRECTOR_PREDICT_TIMEOUT`)

Time allowed for prediction execution, starting from when execution begins (not including queue time). Acts as a duration-based fallback when no explicit deadline is set. Director enforces a minimum of 30 minutes if configured to 0 or negative (director/director.go:285-288).

Resolution chain:

web: Model.prediction_timeout (nullable duration). When set, stored on DeployableConfig.run_timeout and serialized as prediction_timeout (models/model.py:1010-1017, api_serializers.py:1702-1703). When None, the serialized prediction_timeout is omitted.
cluster: predictTimeout() resolves the value with this priority (pkg/kubernetes/deployable.go:1886-1906):
1. deployable.PredictionTimeout (from web, if set)
2. Hardcoded per-account override map userSpecificTimeouts (pkg/kubernetes/deployable.go:88-116)
3. config.PredictTimeoutSeconds (30 minutes) (pkg/config/config.go:71)
4. For training workloads, uses the higher of the above and TrainTimeoutSeconds (24 hours)
A mirrored per-account map exists in web’s USER_SPECIFIC_TIMEOUTS (models/prediction.py:60-107) to prevent terminate_stuck_predictions from killing predictions that have these extended timeouts.
director: Reads DIRECTOR_PREDICT_TIMEOUT env var. Own flag default is 1800s (30 minutes), but cluster always sets it.

Max Run Lifetime (`DIRECTOR_MAX_RUN_LIFETIME`)

Maximum time for the entire prediction lifecycle including queue time and setup. Calculated from prediction.CreatedAt timestamp.

There are two paths that produce a max run lifetime constraint — one via the deployable config (env var on the Director pod) and one via a per-request header that becomes a deadline on the prediction itself.

Deployable config path (env var):

web: DeployableConfig.max_run_lifetime — defaults to DEFAULT_MAX_RUN_LIFETIME (24 hours) at the DB level (models/deployable_config.py:259-261). For deployment predictions, deployment.max_run_lifetime can override this (logic.py:1175-1176). Serialized as max_run_lifetime (api_serializers.py:1675-1676).
cluster: Reads deployable.MaxRunLifetime. If nonzero, uses it; otherwise falls back to config.MaxRunLifetime (24 hours) (pkg/config/config.go:75, pkg/kubernetes/deployable.go:1203-1213).
director: Reads DIRECTOR_MAX_RUN_LIFETIME env var. Own flag default is 0 (disabled), but cluster always sets it.

Per-request path (Cancel-After header):

api: The Cancel-After HTTP header on a prediction request is parsed as a duration (Go-style like 5m or bare seconds like 300). Minimum 5 seconds (server/v1_prediction_handler.go:233-276).
api: calculateEffectiveDeadline() picks the shorter of the request’s Cancel-After value and the deployable metadata’s max_run_lifetime, computes an absolute deadline from prediction creation time, and sets it on prediction.InternalMetadata.Deadline (logic/prediction.go:1076-1109).
director: This deadline is the “per-prediction deadline” checked first in the deadline priority (see Deadline Calculation below).

Deadline Calculation and Enforcement

Director calculates an effective deadline at dequeue time using this priority (director/worker.go:63-107):

Per-prediction deadline (prediction.InternalMetadata.Deadline)
Deployment deadline (prediction.CreatedAt + MaxRunLifetime, if configured)
Prediction timeout (fallback)

The earliest applicable deadline wins. At execution time (director/worker.go:516-528):

deadline, source, ok := getEffectiveDeadline(prediction)
var timeoutDuration time.Duration

if ok {
    timeoutDuration = time.Until(deadline)
} else {
    timeoutDuration = d.predictionTimeout
}

timeoutTimer := time.After(timeoutDuration)

Timeout outcomes vary based on when and why the deadline is exceeded:

Before execution (deadline already passed at dequeue):

Per-prediction deadline → aborted
Deployment deadline → failed

During execution (timer fires while running):

Per-prediction deadline → canceled
Deployment deadline → failed
Prediction timeout (fallback) → failed

Timeout Behavior Examples

Scenario 1: Version prediction with deployment deadline

DIRECTOR_MAX_RUN_LIFETIME=300 (5 minutes)
DIRECTOR_PREDICT_TIMEOUT=1800 (30 minutes)
Prediction created at T=0, dequeued at T=60s
Effective deadline: T=0 + 300s = T=300s
Execution timeout: 300s - 60s = 240s remaining

Scenario 2: Prediction with explicit deadline

Per-prediction deadline: T=180s (3 minutes from creation)
DIRECTOR_MAX_RUN_LIFETIME=600 (10 minutes)
Prediction created at T=0, dequeued at T=30s
Effective deadline: T=180s (per-prediction deadline wins)
Execution timeout: 180s - 30s = 150s remaining

Scenario 3: Training with no deadlines

DIRECTOR_MAX_RUN_LIFETIME=0 (disabled)
DIRECTOR_PREDICT_TIMEOUT=1800 (30 minutes)
No per-prediction deadline
Effective deadline: None (uses fallback)
Execution timeout: 1800s from execution start

Code references:

Timeout configuration: director/config.go:52-58
Deadline calculation: director/worker.go:63-107
Execution timeout: director/worker.go:516-528
Setup timeout: director/director.go:718-830

Workers AI Timeout System

Workers AI has a simpler timeout model than Replicate, but the request passes through multiple layers, each with its own timeout behavior.

Request Path Layers

A request from the internet traverses three services before reaching a GPU container:

worker-constellation-entry (TypeScript Worker): The SDK-facing entry point. Calls constellation-entry via a Workers service binding (this.binding.fetch(...)) with no explicit timeout (ai/session.ts:131). See note below on Workers runtime limits.
constellation-entry (Rust, runs on metal): Wraps the entire request to constellation-server in a hardcoded 4-minute tokio::time::timeout (proxy_server.rs:99, proxy_server.rs:831-845). On timeout, returns EntryError::ForwardTimeout. This applies to both HTTP and WebSocket paths.
constellation-server (Rust, runs on GPU nodes): Enforces per-model inference timeout (see below).

The 4-minute constellation-entry timeout acts as a hard ceiling. If a model’s inference_timeout in constellation-server exceeds 240 seconds, constellation-entry will kill the connection before constellation-server’s own timeout fires. This means constellation-entry is the effective upper bound for any single inference request, unless constellation-entry’s constant is changed.

Workers runtime limits: worker-constellation-entry is a standard Worker deployed on Cloudflare’s internal AI account (not a pipeline First Party Worker). It does not set limits.cpu_ms in its wrangler.toml, so it gets the default 30-second CPU time limit. However, CPU time only counts active processing — time spent awaiting the service binding fetch to constellation-entry is I/O wait and does not count (docs). Since the Worker is almost entirely I/O-bound, the CPU limit is unlikely to be reached. Wall-clock duration has no hard cap for HTTP requests, though Workers can be evicted during runtime restarts (~once/week, 30-second drain). In practice, the Workers runtime layer is transparent for timeout purposes — constellation-entry’s 4-minute timeout fires well before any platform limit would.

Configuration Levels

constellation-server Default (constellation-server/src/cli.rs:107-109):

--infer-request-timeout-secs: CLI argument, default 30 seconds
Acts as the minimum timeout floor for all inference requests
Applied globally across all models served by the constellation-server instance

Per-Model Configuration (model-repository/src/config.rs:143):

inference_timeout: Model-specific timeout in seconds (default: 0)
Configured via Config API (Deus) and fetched from Quicksilver (distributed KV store)
Part of WorkersAiModelConfig structure
Zero value means “use constellation-server default”

Timeout Resolution

When processing an inference request, constellation-server resolves the timeout (server/src/lib.rs):

#![allow(unused)]
fn main() {
let min_timeout = state.infer_request_timeout_secs as u64;
let mut inference_timeout = Duration::from_secs(state.infer_request_timeout_secs as u64);

if let Some(model_config) = state.endpoint_lb.get_model_config(&model_id) {
    inference_timeout = Duration::from_secs(
        model_config.ai_config.inference_timeout.max(min_timeout)
    );
}
}

The effective timeout is: max(model_config.inference_timeout, infer_request_timeout_secs). This ensures:

Models can extend their timeout beyond the 30-second default
Models cannot reduce their timeout below the constellation-server minimum
Unconfigured models (inference_timeout=0) use the 30-second default

Enforcement

constellation-server enforces the timeout by wrapping the entire inference call in tokio::time::timeout(inference_timeout, handle_inference_request(...)) (server/src/lib.rs:588-599). When the timeout elapses, the future is dropped, which closes the connection to the GPU container. The error maps to ServerError::InferenceTimeout (line 603).

The other end of the dropped connection varies by model backend (service-discovery/src/lib.rs:39-47):

Triton: Upstream NVIDIA tritonserver binary (launched via model-greeter). Disconnect behavior depends on Triton’s own implementation.
TGI/TEI: Upstream HuggingFace binaries. Disconnect behavior depends on the upstream Rust server implementation.
PipeHttp/PipeHttpLlm: Custom Python framework (WorkersAiApp in ai_catalog_common, FastAPI-based). The synchronous HTTP path (_generate) does not explicitly cancel the raw_generate coroutine on client disconnect — the inference runs to completion (catalog/common/.../workers_ai_app/app.py:880-896). WebSocket connections do cancel processing tasks in their finally block (catalog/common/.../workers_ai_app/app.py:536-553).

The constellation-server timeout covers:

Time waiting in the constellation-server queue (permit acquisition)
Forwarding to the GPU container
Inference execution in the backend (Triton, TGI, TEI)
Response transmission back through constellation-server

But constellation-entry’s 4-minute timeout wraps the entire round-trip to constellation-server, so it is the effective ceiling regardless of per-model config.

Unlike Director’s system, there is no separate setup timeout. Model containers are managed by the orchestrator (K8s) independently of request processing. Container initialization and readiness are handled by service discovery and health checks, not request-level timeouts.

Timeout Behavior Examples

Scenario 1: Text generation model with default timeout

Model config: inference_timeout=0 (not configured)
constellation-server: infer_request_timeout_secs=30
Effective timeout: 30 seconds

Scenario 2: Image generation model with extended timeout

Model config: inference_timeout=120 (2 minutes)
constellation-server: infer_request_timeout_secs=30
Effective timeout: 120 seconds (model config wins)

Scenario 3: Model timeout exceeds constellation-entry ceiling

Model config: inference_timeout=300 (5 minutes)
constellation-server: infer_request_timeout_secs=30
constellation-entry: 240 seconds (hardcoded)
constellation-server effective timeout: 300 seconds
Actual outcome: constellation-entry kills the connection at 240 seconds

Scenario 4: Attempted timeout reduction (not allowed)

Model config: inference_timeout=10 (10 seconds)
constellation-server: infer_request_timeout_secs=30
Effective timeout: 30 seconds (constellation-server minimum enforced)

Code references:

worker-constellation-entry service binding call (no timeout): ai/session.ts:131
constellation-entry 4-minute timeout: proxy_server.rs:99, proxy_server.rs:831-845
constellation-server CLI configuration: cli.rs:107-109
Model config structure: model-repository/src/config.rs:100-150
Timeout resolution and enforcement: server/src/lib.rs

Key Differences

Complexity and Granularity:

Director: Multi-phase timeout system with separate controls for setup (model initialization) and execution (inference), plus deployment-level lifetime limits
Workers AI: Two-layer timeout — constellation-entry imposes a hard 4-minute ceiling, constellation-server applies per-model timeouts within that ceiling

Deadline Calculation:

Director: Priority-based deadline system with per-prediction, deployment, and fallback timeouts. Calculates earliest applicable deadline at dequeue time, accounting for time already spent in queue
Workers AI: Simple max() operation between model config and constellation-server default, applied at request receipt time

Queue Time Handling:

Director: MaxRunLifetime includes queue time; execution timeout accounts for time already elapsed since creation
Workers AI: Timeout starts when constellation-server receives the request, inherently includes queue time

Setup vs Runtime:

Director: Explicit ModelSetupTimeout for container initialization, separate from prediction execution timeout
Workers AI: No request-level setup timeout; container lifecycle managed independently by orchestrator

Configuration Source:

Director: Environment variables set by cluster service during deployment creation
Workers AI: CLI argument (constellation-server) + distributed config store (Quicksilver/Config API)

Default Values:

Director: 30 minutes prediction timeout (generous for long-running inference), setup and lifetime timeouts disabled by default
Workers AI: 30 seconds base timeout (optimized for fast inference), extended per-model via config, hard ceiling of 4 minutes at constellation-entry

Minimum Enforcement:

Director: 30-minute minimum enforced for prediction timeout if misconfigured
Workers AI: constellation-server minimum enforced as floor, models can only extend

Use Case Alignment:

Director: Designed for variable-length workloads including multi-hour training runs; flexible per-deployment configuration
Workers AI: Optimized for inference workloads with known latency profiles; centralized model-specific configuration

Job Types

Replicate Job Types

Director supports three job kinds, configured per-deployment via DIRECTOR_JOB_KIND (director/config.go:37):

prediction (default)
training
procedure

The job kind determines which cog HTTP endpoint Director calls, how webhooks are structured, and what timeout behavior applies.

How Job Kind Is Set

Cluster sets DIRECTOR_JOB_KIND when generating the Director pod spec (pkg/kubernetes/deployable.go:1268-1280):

isProcedureMode() → procedure
isTraining() → training
Neither → default prediction

Where:

isProcedureMode() = UseProcedureMode || UseUnifiedPipelineNodePool (deployable_config.go:234)
isTraining() = TrainingMode != nil && *TrainingMode (deployable_config.go:321)

DeployableKind (Web Side)

The web app distinguishes four deployable kinds (deployable_config.py:114-120):

DeployableKind	Director JobKind	Redis Queue Pattern	Key Pattern
`DEPLOYMENT_PREDICTION`	`prediction`	`input:prediction:{key}`	`dp-{uuid4_hex}`
`FUNCTION_PREDICTION`	`procedure`	`input:prediction:{key}` (or `UNIFIED_PIPELINE_QUEUE`)	`fp-{hash}`
`VERSION_PREDICTION`	`prediction` or `procedure`	`input:prediction:{key}` (or `UNIFIED_PIPELINE_QUEUE`)	`vp-{docker_image_id[:32]}`
`VERSION_TRAINING`	`training`	`input:training:{key}`	`vt-{docker_image_id[:32]}`

Key prefixes: dp- = deployment, fp- = function/pipeline, vp- = version prediction, vt- = version training. FUNCTION_PREDICTION is used for pipelines (procedures) — see the Procedures section below.

Queue names are generated by DeployableConfig.queue (deployable_config.py:565-589).

OfficialModel and ModelDeployment

Some models expose a “versionless API” — users run predictions against a model name (POST /v1/models/{owner}/{name}/predictions) rather than a specific version hash.

OfficialModel (current production) — Ties a Model to a backing Deployment and a BillingConfig (official_model.py). From the end user’s perspective, an OfficialModel behaves like a Deployment — it has scaling, hardware selection, and version pinning — but the user interacts with it via the model name, not a deployment name. The actual infrastructure (K8s deployment, Director pods, Redis queues) runs through the backing Deployment.

ModelDeployment (stalled replacement) — Intended to replace OfficialModel by associating a Model with a Deployment (and optionally a Procedure) via a labeled relationship (model_deployment.py). The migration from OfficialModel to ModelDeployment was started but the work stalled and was rolled back. Both entities exist in the codebase and the dispatch logic checks for a default ModelDeployment first, falling back to OfficialModel (api_internal_views.py:303-326). TODO comments in the code (e.g. “Drop is_official from here once OfficialModels are gone” at model.py:1322) reflect the intended direction.

A model uses_versionless_api when it has either an OfficialModel or a default ModelDeployment (model.py:1321-1323).

Predictions

The default job kind. Director sends PUT /predictions/{id} to cog (cog/client.go:183-184).

Three prediction subtypes exist, distinguished by how the deployable is created:

Version predictions — Direct runs against a model version. Created when a user runs a prediction against a specific version ID. Key: vp-{docker_image_id[:32]} (version.py:618-619).

Deployment predictions — Runs through a named deployment. The deployment owns the scaling config, hardware selection, and version pinning. Kind is DEPLOYMENT_PREDICTION with key prefix dp- (deployment.py:611, deployment.py:1024-1027).

Hotswap predictions — A version that has additional_weights and a base_version. The base container stays running and loads different weights (e.g. LoRA adapters) at runtime instead of booting a new container per version.

Hotswap requirements (version.py:586-594):

Version has additional_weights
Version has a base_version
Version’s base docker image matches the base version’s docker image
Base version’s model is public, image is non-virtual, and image accepts replicate weights

Hotswap key: hp-{base_version.docker_image_id[:32]} (version.py:605-608). Multiple hotswappable versions sharing the same base version share the same key, so they share the same Director pool.

Hotswap uses prefer_same_stream=True (logic.py:1243), which tells Director’s Redis message queue to prefer consuming from the same stream shard repeatedly (redis/queue.go:248-251). This increases the chance that a Director instance serves the same version’s weights consecutively, avoiding unnecessary weight reloads. PreferSameStream is incompatible with batching (MaxConcurrentPredictions > 1) (config.go:90-92).

Training

DIRECTOR_JOB_KIND=training. Director sends PUT /trainings/{id} (if cog >= 0.11.6), otherwise falls back to PUT /predictions/{id} (cog/client.go:279-311).

Training-specific behaviors:

Queue prefix is input:training: instead of input:prediction:
Timeout uses the higher of the prediction timeout and TrainTimeoutSeconds (24 hours) (deployable.go:1901)
Prediction.Destination field is present (specifies where trained weights go) (cog/types.go:182)
Concurrency is always 1 (deployable_config.py:417-418)
Webhook payload wraps the prediction in a Job with Kind: "training" and the prediction in the Training field instead of Prediction (worker.go:746-749)

Training is set via DeployableConfig.training_mode which maps to DeployableKind.VERSION_TRAINING (deployable_config.py:591-595).

Procedures

DIRECTOR_JOB_KIND=procedure. Director sends PUT /procedures/{id} (cog/client.go:54).

Procedures are multi-step workflows (pipelines) that run user-provided Python code on a shared runtime image (pipelines-runtime). Unlike predictions and trainings, the container image is not user-built — it’s a standard runtime that loads user code at execution time.

Procedure source loading:

Before executing, Director downloads the procedure source archive (tar.gz) from a signed URL, extracts it to a local cache directory, and rewrites the procedure_source_url in the prediction context to a file:// path (worker.go:540-548). The procedures.Manager handles download, caching (keyed by URL hash), and extraction (internal/procedures/manager.go).

Webhook compatibility hack:

When sending webhooks back to the API, Director overwrites the procedure job kind to prediction because the API doesn’t understand the procedure kind yet (worker.go:734-741).

Procedure subtypes in web:

Procedure — persistent, owned by a user/org, tied to a model. Uses DeployableKind.FUNCTION_PREDICTION (procedure.py:230).
EphemeralProcedure — temporary/draft, used for iteration. Source archives expire after 12 hours. Uses DeployableKind.VERSION_PREDICTION (procedure.py:42).

Both can route to the UNIFIED_PIPELINE_QUEUE if configured (deployable_config.py:573-584), which is a shared warm compute pool.

Workers AI Job Types

Workers AI has a single job type: inference. There is no training, no procedure/pipeline execution, and no multi-step workflow at the request level.

Task Types (Schema-Level Only)

worker-constellation-entry defines 14 task types in taskMappings (catalog.ts):

text-generation, text-classification, text-embeddings, text-to-image, text-to-speech
automatic-speech-recognition, image-classification, image-to-text, image-text-to-text
object-detection, translation, summarization, multimodal-embeddings
dumb-pipe (passthrough, no input/output transformation)

Each task type is a class that defines:

Input schema validation and preprocessing
Payload generation (transforming user input to backend format)
Output postprocessing

The task type is determined by the model’s task_slug from the config API. The infrastructure path (constellation-entry → constellation-server → GPU container) is identical regardless of task type. Task types are purely a worker-constellation-entry concern — constellation-entry and constellation-server are task-agnostic.

LoRA at Inference Time

constellation-server supports LoRA adapter loading for Triton-backed models (inference/triton.rs:142-214). When a request includes a lora parameter:

Finetune config is fetched from the config API (Deus) via the model-repository crate
Config is validated: finetune must exist for the current model, all required assets must be present
lora_config and lora_name are injected as additional Triton inference inputs

This is inference-time adapter loading, not training. The base model stays loaded and the adapter is applied per-request. There is no LoRA-aware routing or stickiness — constellation-server’s endpoint selection and permit system are unaware of the lora field, so consecutive requests for the same adapter may hit different GPU containers.

Async Queue

Two conditions must be met for async execution:

The model must have use_async_queue enabled in the config API (Deus) — a per-model property, default false (config.rs:134)
The requester must opt in per-request via options.queueRequest: true in the JSON body (ai.ts:133-144)

When both are true, the request is POSTed to the async-queue-worker service binding (queue.ts:23). The response is either 200 (completed immediately) or 202 (queued). To retrieve results, the requester sends a follow-up request with inputs.request_id set, which triggers a GET to the queue service (ai.ts:148-153).

Limitations:

Polling only — no webhook/callback support
No streaming — inputs.stream is explicitly deleted when queueRequest is true (ai.ts:135)

This contrasts with Replicate where every prediction is async by default (webhook-driven), and Prefer: wait is the opt-in for synchronous behavior. Workers AI is synchronous by default, with async as an opt-in that requires both model-level enablement and per-request signaling.

Key Differences

Job type diversity:

Replicate: Three distinct job kinds (prediction, training, procedure) with different cog endpoints, timeout behaviors, queue prefixes, and webhook structures
Workers AI: Single job type (inference) with task-type differentiation only at the input/output schema level

Training support:

Replicate: First-class training support with dedicated queue prefix, extended timeouts (24h), destination field for trained weights, and separate cog endpoint
Workers AI: No training capability. LoRA adapters are loaded at inference time from pre-trained weights, not trained on the platform

Multi-step workflows:

Replicate: Procedures allow user-provided Python code to run on a shared runtime, enabling multi-model orchestration and custom logic. Source code is downloaded and cached per-execution
Workers AI: No equivalent. Multi-step orchestration is left to the caller

Weight loading model:

Replicate: Hotswap mechanism allows a base container to serve multiple versions by loading different weights at runtime. Uses prefer_same_stream to optimize for cache locality across a shared Director pool
Workers AI: LoRA adapters are injected per-request at the Triton level. No concept of a shared base container serving multiple “versions” — each model ID maps to a fixed deployment

Job type awareness:

Replicate: The same infrastructure (cluster, Director, cog) handles all three job kinds. Job kind is a configuration axis that changes queue prefixes, cog endpoint paths, timeout calculations, and webhook payload structure — but the systems are the same
Workers AI: Task type is a thin layer in worker-constellation-entry only. Everything downstream (constellation-entry, constellation-server, GPU containers) is task-agnostic

Deployment Management

Replicate: Autoscaler

The autoscaler runs in every GPU model serving kubernetes cluster. It manages k8s Deployment objects in the models or serving namespaces — creating them when prediction traffic appears, scaling replica counts based on queue depth, and pruning idle ones.

Source: cluster/pkg/autoscaler/autoscaler.go (fa8042d)

Concurrent loops

Loop	Interval	Purpose
Deployment Dispatcher	event-driven (BRPOP)	Creates/updates K8s Deployments when predictions arrive
Scaler	1 second	Adjusts replica counts based on queue depth
Scale State Snapshotter	1 second	Captures queue lengths + replica counts into Redis
Queue Pruner	1 hour	Deletes stuck requests older than 24 hours
Deployment Pruner	1 minute	Deletes K8s Deployments that have been idle too long

Deployment Dispatcher

┌───────────────┐
│ replicate/api │
└──────┬────────┘
       │ LPUSH
       ▼
┌──────────────────────┐
│ prediction-versions  │
│ (Redis list)         │
└──────┬───────────────┘
       │ BRPOP
       ▼
┌──────────────────────┐
│ Deployment           │
│ Dispatcher           │
│ (event-driven)       │
└──────┬───────────────┘
       │ create/update
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentDispatcher BRPOPs from the prediction-versions Redis queue. Each prediction triggers ensureDeployableDeployment, which creates or updates the K8s Deployment if the config has changed. Change detection uses a config hash comparison plus a template version serial.

Key behavior:

Deployments are created on demand — the first prediction for a version/deployment triggers K8s Deployment creation
Rate-limited to avoid overwhelming the K8s API
Config hash comparison means no-op if nothing changed

Scaler

┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────┬───────────────┘
       │ read replicas + queue lengths
       ▼
┌──────────────────────┐
│ Snapshotter (1s)     │
└──────┬───────────────┘
       │ write
       ▼
┌──────────────────────┐
│ Redis cache          │
│ (scale state)        │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Scaler (1s)          │
│ computeNewReplica    │
│ CountV2              │
└──────┬───────────────┘
       │ patch replicas
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startScaler runs every 1 second. For each deployable found in K8s, it loads the scale state from Redis cache and calls scaleDeployable().

The scaling algorithm (computeNewReplicaCountV2) is HPA-inspired:

desiredReplicas = currentReplicas × (metricValue / targetMetricValue)

Where the metric is backlog-per-instance:

backlogPerInstance = (queueLength + queueHeadroom) / instances

Three dampening mechanisms prevent oscillation:

Scaling policies — rate limits on scale-out and scale-in. Default scale-out: allow +5 count or +100% per minute (whichever is larger). Default scale-in: unrestricted rate.
Stabilization windows — the algorithm considers the min (for scale-in) or max (for scale-out) desired replica count over a time window. Default scale-out: 30 seconds. Default scale-in: 2 minutes.
Hysteresis — ignore small oscillations below a threshold (default 0.02).

Additional scaling features:

Slow start: cap at 5 replicas until the first pod reports ready
Scale-to-zero: supported, with configurable idle timeout delay. Gated by scale-to-zero-delay kill switch.
Override min replicas: per-deployable minimum via DeployableConfig
Emergency cap: cap-max-replicas feature flag

Scaling configuration

scaling.Config holds per-deployable scaling parameters:

Field	Default	Source
MetricTarget	`config.BacklogPerInstance` flag	CLI flag
Hysteresis	0.02	hardcoded
MinReplicas	0	`DeployableConfig.ScalingConfig`
MaxReplicas	`config.MaxReplicas` flag	CLI flag
ScaleOut behavior	30s stabilization, +5 or +100%/min	`algorithm_v2_defaults.go`
ScaleIn behavior	2min stabilization, no rate limit	`algorithm_v2_defaults.go`

Per-deployable overrides come from deployable.ScalingConfig, which is set via the web’s DeployableConfig and serialized into K8s annotations.

Deployment Pruner

┌──────────────────────┐
│ Redis cache          │
│ (last-request-time)  │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Deployment Pruner    │
│ (1min)               │
└──────┬───────────────┘
       │ delete idle
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentPruner runs every 1 minute. It deletes K8s Deployments that haven’t received a prediction since DeployableDeploymentDeleteInactiveAfter. Max 100 deletions per cycle. Fails safe if the latest request time is missing from Redis (skips the deployment rather than deleting it).

Queue Pruner

┌──────────────────────┐
│ Redis streams        │
│ (dp-*, vp-*, …)     │
└──────┬───────────────┘
       │ scan for entries > 24h
       ▼
┌──────────────────────┐
│ Queue Pruner (1h)    │
└──────┬───────────────┘
       │ XACK + XDEL
       ▼
┌──────────────────────┐
│ Redis streams        │
│ (dp-*, vp-*, …)     │
└──────────────────────┘

startQueuePruner runs every 1 hour. Deletes stuck requests older than 24 hours from the per-deployable prediction request streams (e.g. dp-*, vp-*) — both pending (claimed but unprocessed) and unprocessed entries. Note: replicate/api’s sweeper also removes past-deadline messages from these same streams at much shorter intervals.

Workers AI: ai-scheduler

Workers AI deployment management is split across multiple systems. ai-scheduler is a Rust binary deployed to a core datacenter K8s cluster (pdx-c). It has multiple subcommands, each run as a separate K8s deployment: AutoScaler, Scheduler, AdminAPI, ReleaseManagerWatcher, ExternalNodesWatcher, RoutingUpdater. They share the scheduling crate as a library.

Source: ai-scheduler/ (89f8e0d)

Architecture overview

┌──────────────────────────────────────────────────────────┐
│                      ai-scheduler                        │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐ │
│  │ auto_scaling  │  │  admin_api   │  │ release_mgr    │ │
│  │ (5min loop)   │  │ (manual ops) │  │ _watcher       │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────────┘ │
│         │                 │                  │           │
│         └────────┬────────┴──────────────────┘           │
│                  ▼                                        │
│         ┌──────────────┐                                 │
│         │  scheduling  │  (action execution)             │
│         └──────┬───────┘                                 │
│                │                                         │
└────────────────┼─────────────────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
  Cloud Chamber      External K8s
  (internal GPU)     (OKE, CoreWeave, Nebius,
                      Lambda, GCP, Crusoe)
                          │
                          ▼
                  inference-kubernetes-engine
                  (Model CRD operator)

Three entry points produce scheduling actions:

auto_scaling — utilization-based autoscaler loop
admin_api — manual scaling endpoints for humans
release_manager_watcher — watches for software version changes, triggers rolling updates

All actions flow through the scheduling app, which applies them to either Cloud Chamber (internal capacity) or external K8s clusters via the external_nodes module.

Autoscaler (`auto_scaling`)

The autoscaler runs every 5 minutes. Each cycle:

Fetches model usage from ClickHouse (request counts, inference time per minute over 15-minute windows)
Fetches usage “forecast” from ClickHouse (same-time-last-week data, not a forecasting model)
Fetches utilization metrics from Prometheus (soft — errors produce empty data, not failures)
Loads model configuration from Config API
Gets current application state from Cloud Chamber
Fetches external endpoint health from Quicksilver and counts healthy deployments from Cloud Chamber
Handles external capacity scheduling (may emit ScheduleExternalModelApplications)
Computes desired instance count per model
Emits UpscaleModelApplications or DownscaleModelApplications actions

Scaling algorithms

Three algorithms coexist. Selection depends on per-model Config API properties:

1. Request-count-based (default fallback):

instances = avg_count_per_min × autoscaling_factor

Default autoscaling_factor is 0.8. This assumes each inference request takes ~1 minute. Crude, but serves as the baseline.

2. Utilization-based (model_utilization_autoscaler = true):

Computes utilization from cumulative inference time:

required_to_fit = inference_time_secs_per_min / 60 / max_concurrent_requests
utilization = required_to_fit / model_instances

Uses a deadband instead of dampening:

If utilization < 1 / (factor + 0.15) → downscale
If utilization > 1 / factor → upscale
Otherwise → no change

Minimum factor is clamped to 1.2 (20% overprovisioning floor). Takes the max of current and forecast inference time.

3. EZ utilization-based (scaling_config.utilization_based.active = true):

The newest algorithm. Configurable per-model via ScalingConfig:

scaling_config:
  disabled: false
  utilization_based:
    active: true
    min_utilization: 0.3
    max_utilization: 0.8
    utilization_scale: 1.0
    use_forecast: true
    use_out_of_cap: true
    prometheus: false

Features:

Configurable min/max utilization bounds (replaces hardcoded deadband)
Optional forecast-based scaling (use_forecast)
Out-of-capacity adjustment (use_out_of_cap): inflates measured utilization by (successful + out_of_cap) / successful to account for requests turned away. successful_per_min floored at 0.1 to avoid division by zero.
Safety cap: when OOC adjustment is active, measured utilization capped at 2× the sans-OOC value
Prometheus metrics as an alternative utilization source
Asymmetric instance base: upscale decisions use model_healthy_instances, downscale uses model_requested_instances
Unhealthy instance handling: adds +1 buffer when upscaling with unhealthy instances; gradually decrements requested count when >1 unhealthy instances exist

When active, this algorithm overrides the request-count-based algorithm. However, if model_utilization_autoscaler is also true, the older utilization-based algorithm takes final precedence — the priority order is: request-count → EZ utilization → old utilization.

Instance bounds

All algorithms clamp results to [min_count, max_count] from Config API:

Default min_count: 5 (set per autoscaler instance via CLI flag default_min_count)
Default max_count: 100
Downscale batch size capped at 10 per cycle

There is no idle-based scale-to-zero. Models stay provisioned at min_count (default 5, but can be set to 0 per-model).

Kill switches

disable_scaling — global Config API property, disables all scaling
scaling_config.disabled — per-model disable
Tier change grace period — skips models that recently changed tiers
Tier selector — autoscaler instances can be scoped to specific tier ranges

Scheduling actions

Actions are applied via Cloud Chamber API or external K8s:

Action	Target	Effect
`UpscaleModelApplications`	Cloud Chamber	PATCH application `instances` count up
`DownscaleModelApplications`	Cloud Chamber	PATCH application `instances` count down
`ScheduleExternalModelApplications`	External K8s	Create/patch Model CRD replicas
`CreateModelApplication`	Cloud Chamber	Create new CC application + set instances
`DeployModelApplicationToTiger`	Cloud Chamber	Deploy to Tiger (canary) environment
`DeleteDeployment`	Cloud Chamber	Delete specific deployment
`RemoveModelApplication`	Cloud Chamber	Delete entire application
`ModifyApplicationRegions`	Cloud Chamber	Modify region constraints
`ModifyApplicationSchedulingPriority`	Cloud Chamber	Modify scheduling priority
`ModifyApplicationAffinities`	Cloud Chamber	Modify colocation/affinity constraints

Instance count is clamped to 0–1400 per application. Up/downscale distributes changes round-robin across applications.

Actions are defined in the Action enum.

External capacity

External capacity (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) has a separate management model controlled by the ExternalCapacity config (Management enum):

#![allow(unused)]
fn main() {
pub enum Management {
    Disabled,   // no auto-management
    Manual,     // humans manage it entirely
    Upscaling,  // autoscaler can only scale UP
    Full,       // autoscaler can scale both directions
}
}

The autoscaler code has this comment:

NOTE: currently auto-scaler can only scale up external instances but not scale down. In order to scale down, use admin api endpoint: /models/schedule_externally

This comment may be stale — Management::Full supports both directions in the code. The real constraint is that the autoscaler’s scaling algorithms don’t dynamically compute external replica targets; external capacity is config-driven (expected_replicas), not utilization-driven.

The external path works as follows:

Infrastructure: K8s clusters provisioned via Terraform (oci-terraform/ repo — OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe)
Model operator: inference-kubernetes-engine watches Model CRDs and reconciles K8s Deployments/Services to match
Scaling up: autoscaler patches Model CRD spec.replicas via KubernetesProvider.schedule()
Scaling down: manual via admin API or the scale-down CLI tool in inference-kubernetes-engine, which gradually decrements replicas with a configurable interval and batch size

Admin API

Manual scaling endpoints behind is_workers_ai_team() access check:

GET /models/scale — upscale a model by amount (creates UpscaleModelApplications action; notably a GET for a mutating op)
POST /models/schedule — run the full scheduler loop for a single model
POST /models/schedule_externally — set external replica count
POST /models/schedule_externally_on_specific_provider — target specific provider
POST /models/remove — delete specific deployment by colo/metal
POST /models/delete_applications — delete all applications for a model, including external CRDs
GET /models/status — model status/debugging
Tiger management — create, list, delete canary deployments

Model provisioning

Unlike Replicate, Workers AI models are pre-provisioned rather than created on demand from inference traffic:

Model is registered in Config API with properties (min_count, max_count, software, gpu_memory, etc.)
release_manager_watcher detects the software version and creates Cloud Chamber applications
Autoscaler maintains instance count within [min_count, max_count] based on utilization
There is no equivalent of Replicate’s deployment pruner — models stay provisioned at min_count until manually removed

Config API properties (scaling-relevant)

Property	Default	Description
`min_count`	5	Minimum instances
`max_count`	100	Maximum instances
`scaling_factor`	0.8	Autoscaling factor
`model_utilization_autoscaler`	false	Enable utilization-based algorithm
`scaling_config`	none	YAML blob with `disabled`, `utilization_based` sub-config
`disable_scaling`	false	Kill switch (also available as global property)
`external_capacity`	none	External provider config with `management` mode
`gpu_memory`	22	GPU memory request (GB)
`gpu_model`	none	Specific GPU model requirement
`dual_gpu`	false	Requires two GPUs
`colo_tier`	none	Restrict to colo tier
`colo_region`	none	Restrict to colo region
`tier`	“unknown-scheduler”	Model tier (Tier-0, Tier-1, Tier-2)

Key Differences

Aspect	Replicate	Workers AI
Scaling signal	Queue depth (backlog per instance) — real-time	Inference time utilization — 15-minute ClickHouse windows
Loop frequency	1 second	5 minutes
Deployment creation	On demand from prediction traffic	Pre-provisioned via Config API + release manager
Scale-to-zero	Yes, with idle timeout	No idle-based scale-to-zero; `min_count` defaults to 5 but can be 0 per-model
Deployment pruning	Automatic (idle deployments deleted after timeout)	None — models stay provisioned until manually removed
Dampening	HPA-style: scaling policies, stabilization windows, hysteresis	Deadband: upper/lower utilization bounds
Orchestrator	Direct K8s API (Deployments)	Cloud Chamber (internal) + K8s operator (external)
External capacity	N/A (single K8s cluster)	Multi-provider (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) with per-model management mode
Manual controls	Feature flags (`cap-max-replicas`, `scale-to-zero-delay`)	Admin API endpoints, `disable_scaling` kill switch, `Management::Manual` mode
Scale-down on external	N/A	Config-driven; `Management::Full` supports both directions but targets are set manually, not utilization-driven
Algorithm selection	Single algorithm (HPA-inspired)	Three algorithms with priority ordering: request-count → EZ utilization → old utilization
Config source	K8s annotations (from web’s `DeployableConfig`)	Config API properties (key-value store)

Architectural contrast

Replicate’s autoscaler is reactive and fine-grained: predictions create deployments, queue depth drives scaling at 1-second resolution, idle deployments get pruned. The system assumes workloads are bursty and ephemeral.

Workers AI’s ai-scheduler is proactive and coarse-grained: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and external capacity management is heavily human-assisted. The system assumes a catalog of always-available models.

Workload Types Overview

Workload Types in Replicate

Replicate handles several distinct workload types, each with different characteristics and configurations. Understanding these distinctions is essential for comparing with Workers AI, which has a more uniform workload model.

Overview

Replicate defines four primary workload types via the DeployableKind enum (replicate/web/models/models/deployable_config.py:114):

class DeployableKind(models.TextChoices):
    DEPLOYMENT_PREDICTION = "deployment-prediction", "deployment-prediction"
    FUNCTION_PREDICTION = "function-prediction", "function-prediction"
    VERSION_PREDICTION = "version-prediction", "version-prediction"
    VERSION_TRAINING = "version-training", "version-training"

Each workload type has its own deployable_metadata_for_* function in replicate/web/models/logic.py that generates the appropriate configuration.

1. Deployment Predictions

What: Predictions running on a Deployment - a stable, long-lived identifier that routes to a backing model version. The backing version can be changed over time and doesn’t need to be owned by the same account as the deployment.

Characteristics:

Persistent deployment entity (configuration/routing), but infrastructure can scale to 0 replicas
Custom configuration per deployment that can override many version-level settings
Uses dedicated deployment key for consistent routing

Configuration: deployable_metadata_for_deployment_prediction()

Code references:

Deployment model: replicate/web/models/models/deployment.py
Kind validation: logic.py:1156 - asserts kind == DeployableKind.DEPLOYMENT_PREDICTION and not deployment.used_by_model

Queue behavior: Standard shuffle-sharded queues per deployment

2. Function Predictions (Pipelines/Procedures)

What: Predictions for multi-step workflows (Replicate Pipelines). Function predictions run on a shared container image with CPU-only hardware that “swaps” in procedure source code at prediction time. Similar to hotswaps but specifically for CPU-only Python code rather than GPU model weights.

Characteristics:

Run on CPU hardware (no GPU)
Share a base container image across procedures
Procedure source code is swapped in at runtime (analogous to weight swapping in hotswaps)
Part of a larger multi-step workflow
Uses AbstractProcedure model, not Version

Configuration: deployable_metadata_for_procedure_prediction()

Code references:

Procedure model: replicate/web/models/models/procedure.py
Kind validation: logic.py:1165 - when kind == DeployableKind.FUNCTION_PREDICTION, asserts deployment.used_by_model

Director configuration: Director runs procedures with DIRECTOR_JOB_KIND=procedure (director/config.go:37)

Queue behavior: Standard queues, no special routing

3. Version Predictions

What: Predictions running directly on a model version (not through a deployment).

Characteristics:

Ephemeral infrastructure (scaled up/down based on demand)
Configuration comes from version’s current_prediction_deployable_config
Two sub-types: normal and hotswap (see below)

Configuration: deployable_metadata_for_version_prediction()

3a. Normal Version Predictions

What: Standard version predictions without hotswapping.

Characteristics:

One container image per version
Standard queue routing
prefer_same_stream = False (default)

Code: logic.py:1214-1222

if not version.is_hotswappable:
    metadata = DeployableConfigSerializer(
        version.current_prediction_deployable_config
    ).data

3b. Hotswap Version Predictions

What: Versions that share a base container but load different weights at runtime. Multiple “hotswap versions” can run on the same pod by swapping weights instead of restarting containers.

Characteristics:

Multiple versions share the same base Docker image
Each version has additional_weights (weights loaded at runtime)
Base version must be public, non-virtual, and accept Replicate weights
prefer_same_stream = True - workers preferentially consume from the same stream to optimize weight locality
Uses base version’s deployment key for infrastructure sharing

When a version is hotswappable (version.py:586):

def is_hotswappable(self) -> bool:
    if not self.additional_weights:
        return False
    if not self.base_version:
        return False
    if self.base_docker_image_id != self.base_version.docker_image_relation_id:
        return False
    return self.base_version.is_valid_hotswap_base

Configuration: logic.py:1229-1244

deployable_config_fields = {
    ...
    "docker_image": version.base_version.docker_image_relation,
    "fuse_config": None,
    "key": version.base_version.key_for_hotswap_base_predictions,
    "prefer_same_stream": True,  # KEY DIFFERENCE
}

Queue behavior:

Shuffle-sharded queues (like all workloads)
DIRECTOR_PREFER_SAME_STREAM=true set for hotswap versions
Director workers repeatedly consume from the same stream before checking others
Optimizes for weight caching locality (reduce weight download/loading overhead)

Why prefer_same_stream matters:

Hotswap versions load different weights into the same container
If a worker has already loaded weights for version A, it’s faster to keep processing version A predictions
Without prefer_same_stream, workers round-robin across all streams, losing weight locality benefits

Code references:

Version model: replicate/web/models/models/version.py
Hotswap validation: version.py:586-595
Queue affinity: director/config.go:50
Redis implementation: director/redis/queue.go

4. Version Trainings

What: Training jobs for versions marked as trainable.

Characteristics:

Uses separate current_training_deployable_config (not prediction config)
Longer timeouts and different resource requirements
Different billing model
Director runs with DIRECTOR_JOB_KIND=training

Configuration: deployable_metadata_for_version_training()

Code references:

Training config: logic.py:1307-1308
Director job kind: director/config.go:37

Summary Table

Workload Type	Infrastructure	Config Source	`prefer_same_stream`	Director Job Kind
Deployment Prediction	Persistent	Deployment config	`False`	`prediction`
Function Prediction	Ephemeral	Procedure config	`False`	`procedure`
Version Prediction (normal)	Ephemeral	Version prediction config	`False`	`prediction`
Version Prediction (hotswap)	Shared base	Base version config + weights	`True`	`prediction`
Version Training	Ephemeral	Version training config	`False`	`training`

Key Takeaways

Replicate has explicit workload type separation with different configurations and behaviors per type
Hotswap predictions are unique - they’re the only workload using prefer_same_stream=True for weight locality optimization
Function predictions use code swapping - similar to hotswaps but for CPU-only Python code instead of GPU model weights
Each workload type has distinct characteristics - different infrastructure models, configuration sources, and queue behaviors

These workload distinctions will be referenced throughout the document when discussing queue behavior, timeouts, scaling, and other operational characteristics.

Keyboard shortcuts

Replicate vs Workers AI