Introduction
This is a technical comparison of Replicate and Cloudflare Workers AI. It covers how each system handles request routing, autoscaling, resource management, model loading, and observability. When possible, links to source code are included.
Audience
Engineers and engineering leaders on the Workers AI and adjacent ai-platform teams. Sections are written to be useful both as reference material for people building these systems and as context for people making decisions about them.
How to read this
Each section covers both platforms, ending with a “Key Differences” summary. Sections are self-contained — read them in order or jump to what’s relevant.
Scope
This comparison covers the inference serving path and supporting infrastructure: how requests arrive, get routed to models, execute, and return results. In general, it does not cover ancillary tooling or browser level user experience topics.
Request Lifecycle
Request Lifecycle Management
Overview
As one might expect, a given prediction or inference request touches different components on each platform. The diagrams below show the (simplified) full path from client to inference server and back.
Replicate
graph TD
Client([Client])
Web[web]
subgraph GPU cluster
API[api]
Redis[(redis)]
Director[director]
Cog[cog]
end
Client -->|HTTP| API
Web -.->|playground| API
API -->|LPUSH / XADD| Redis
Redis -->|XREADGROUP| Director
Director -->|HTTP| Cog
Cog -->|response| Director
Director -->|webhook| API
API -->|poll / webhook| Client
Key points: the request is asynchronous by default. The client submits a prediction, the API enqueues it, and Director picks it up later. The client polls, receives a webhook when the result is ready, or opts into “blocking” mode at the per-request level. Streaming predictions respond over SSE but still enter through the same queue.
Workers AI
graph TD
Client([Client])
WCE[worker-constellation-entry]
CE[constellation-entry]
CS[constellation-server]
subgraph GPU cluster
GPU[GPU container]
end
Client -->|HTTP| WCE
WCE -->|HTTP| CE
CE -->|pipefitter / HTTP| CS
CS -->|HTTP| GPU
GPU -->|response| CS
CS -->|response| CE
CE -->|response| WCE
WCE -->|response| Client
Key points: the request is synchronous. The client holds an open HTTP connection while the request flows through the stack to a GPU container and back. There is no persistent queue — backpressure is handled by permit-based admission control and cross-colo retry. If no capacity is available, the client gets a 429.
Replicate Request Processing
The Replicate request path has three major phases: API-side routing and enqueue, transient queuing in Redis, and Director-side dequeue and execution. Durable state lives in PostgreSQL (via web), not in the queue. The sections below cover each phase in detail.
Behavior varies by workload type (see Workload Types appendix). All workload types share the same core queue and worker infrastructure but differ in configuration.
Queue and Message Handling
Each GPU provider’s managed Kubernetes cluster runs its own Redis cluster (with
sentinel for HA). Director uses Redis Streams for request queuing within each cluster.
Each deployable has a Queue field that specifies the Redis stream name from which
Director pulls work
(cluster/pkg/kubernetes/deployable_config.go:109-112).
Queue names use prefixes that identify the workload type, defined in
api/internal/predictions/queues.go:
QueuePrefixDeploymentPrediction = "input:prediction:dp-"
QueuePrefixFunctionPrediction = "input:prediction:fp-"
QueuePrefixHotswapPrediction = "input:prediction:hp-"
QueuePrefixVersionPrediction = "input:prediction:vp-"
QueuePrefixVersionTraining = "input:training:vt-"
The full queue name combines the prefix with the deployable’s key. The key varies by
workload type, as shown in
web/models/logic.py:826-845:
- Deployment predictions:
deployment.key - Procedure predictions:
procedure.key - Version trainings:
version.key_for_trainings - Hotswap predictions:
hotswap_base_version.key_for_hotswap_base_predictions - Version predictions:
version.key_for_predictions
Prediction requests are routed to the correct GPU provider’s replicate/api deployment,
which then pushes to the relevant Redis stream in that cluster. When writing to
the queue, the API uses the queue name from the deployable metadata
(api/internal/logic/prediction.go:239-301).
The queue client is created in
director/director.go:276:
messageQueue, err := redis.NewMessageQueue(queueClient, cfg.RedisInputQueue,
hostname+":"+instanceID, cfg.PreferSameStream, rdb)
Shuffle sharding: As of October 2024, all queues use shuffle sharding by default.
The queue is divided into N virtual queues (Redis streams), and each tenant is assigned M
of these as their “shard” based on a tenant key. When enqueueing, messages go to the
shortest stream in the tenant’s shard. This provides fairness: heavy users are avoided by
other tenants whose shards don’t fully overlap, preventing a single account’s large batch
from blocking others
(go/queue/queue.go:1-32).
The old BasicQueueClient was removed in commit 9da00ca.
Queue affinity: The PreferSameStream configuration enables queue affinity where
workers repeatedly consume from the same stream before checking others. This is only
enabled for hotswap version predictions (set to True in
replicate/web/models/logic.py:1243).
All other workload types use False.
Code references:
- Message queue implementation:
director/redis/queue.go - Configuration:
director/config.go - Queue affinity config:
director/config.go:50
Worker Loop
Each Director instance runs worker goroutines that poll the queue. The number of workers
is controlled by DIRECTOR_CONCURRENT_PREDICTIONS (default: 1). This is set by cluster
when the model’s Cog config specifies concurrency.max > 1
(cluster/pkg/kubernetes/deployable.go:1149-1153).
Most models run with the default of 1 worker.
The worker loop
(director/worker.go:134)
continuously:
- Polls Redis Stream for messages using consumer groups
- Checks if prediction execution is allowed (spending validation, principal checks)
- Calculates effective deadline from prediction metadata, deployment lifetime, and timeout config
- Creates a Tracker instance for state management
- Executes prediction via Executor interface (which calls into Cog)
- Handles terminal states (success, failure, cancellation)
- Acknowledges message in Redis
The worker loop behavior is the same across all workload types. What differs is the job kind passed to the executor:
- Predictions (deployment, version):
DIRECTOR_JOB_KIND=prediction(default) - Trainings:
DIRECTOR_JOB_KIND=training - Procedures (function predictions):
DIRECTOR_JOB_KIND=procedure
Note: Internally, Director converts procedure to prediction when sending to Cog
because the API doesn’t yet support the procedure value
(director/worker.go:731-737).
Code references:
- Worker implementation:
director/worker.go - Execution validation:
director/worker.go:34-48 - Deadline calculation:
director/worker.go:63-107 - Job kind configuration:
director/config.go:37
Tracker and State Transitions
The Tracker
(director/tracker.go)
manages state transitions and notifications. It defines four webhook event types:
start- Prediction begins executionoutput- Intermediate outputs during executionlogs- Log messages during executioncompleted- Prediction reaches terminal state
The Tracker maintains:
- Prediction state (struct at
director/tracker.go:36-69) - Subscriber functions that receive state updates
- Pending webhook events (if notifications are suppressed temporarily)
- Upload tracking for async file completions
- Moderation state (if content moderation is enabled)
Webhooks are sent to the Replicate API at state transitions. The webhook sender uses a
PreemptingExecutor
(director/executor.go)
that may drop intermediate webhooks if new updates arrive before the previous one is
sent - this is intentional since the latest state supersedes earlier ones. Terminal
webhooks are always delivered.
Code references:
- Tracker struct:
director/tracker.go:36-69 - Webhook event types:
cog/types.go:45-55
Streaming
Director supports Server-Sent Events (SSE) for streaming outputs. Streaming is enabled
when the API includes a stream_publish_url in the prediction’s internal metadata
(cog/types.go:230).
The StreamClient
(director/streaming.go)
publishes output events to this URL as they occur.
Concurrency Control
Director supports concurrent predictions within a single container via
DIRECTOR_CONCURRENT_PREDICTIONS. See Concurrency Deep Dive
for details on configuration and usage patterns.
Workers AI Request Processing
Routing Architecture
Workers AI requests flow through multiple components:
-
worker-constellation-entry: Pipeline Worker (TypeScript) that receives API requests from SDK or REST API. Handles input validation, model lookup, and request formatting before forwarding to constellation-entry (
apps/worker-constellation-entry/src/worker.ts). -
constellation-entry: Rust service running on every metal. Fetches model config from QS, selects target colos using routing algorithms, resolves constellation-server addresses, and forwards via pipefitter (
gitlab.cfdata.org/cloudflare/ai/constellation-entry). -
constellation-server: Colo-level load balancer (Rust). Manages permits and request queues per model, selects GPU container via service discovery, handles inference forwarding and cross-colo failover (
gitlab.cfdata.org/cloudflare/ai/constellation-server). -
GPU container: Inference server (Triton, TGI, TEI) running the model.
Wiki references:
- System Overview - Component descriptions
- Smart(er) routing - Routing architecture
- Constellation Server Updates Q3 2025 - Detailed request flow
constellation-entry Routing
constellation-entry fetches model configuration from Quicksilver (distributed KV
store), which includes the list of colos where the model is deployed
(src/model_config.rs).
It then applies a routing algorithm to select target colos. Routing algorithms include:
- Random: Random selection among available colos (default)
- LatencySensitive: Prefer nearby colos, with step-based selection
- ForcedColo: Override to a specific colo (for debugging/testing)
- ForcedResourceFlow: Use resource-flow routing
(
src/resource_flow.rs)
Routing uses a step-based algorithm (StepBasedRouter) that generates a prioritized list
of colos. The algorithm can be customized per-model via routing_params in the model
config.
Once colos are selected, constellation-entry resolves constellation-server addresses via
QS or DNS
(src/cs_resolver.rs)
and forwards the request via pipefitter
(src/pipefitter_transport.rs).
Code references:
- Repository:
gitlab.cfdata.org/cloudflare/ai/constellation-entry - Main entry point:
src/main.rs - Colo resolver:
src/cs_resolver.rs
Wiki references:
constellation-server Load Balancing
constellation-server runs in each colo and handles:
Request handling
(src/main.rs,
server/src/lib.rs):
- Listens on Unix domain socket (for pipefitter) and TCP port
(
src/cli.rs:23,31) - Extracts routing metadata: model ID, account ID, fallback routing info
- Tracks distributed tracing spans
- Emits metrics to Ready Analytics
Endpoint load balancing and concurrency
(server/src/endpoint_lb.rs):
- Manages
EndpointGroupper model with permit-based concurrency control - Per-model concurrency limits configured via Config API
- Permit handling in
server/src/endpoint_lb/permits.rs - Explicit request queue (
ModelRequestQueue) with FIFO or Fair algorithms (server/src/endpoint_lb/queue.rs)
Service discovery
(service-discovery/src/lib.rs):
- Finds available GPU container endpoints via Consul
(
service-discovery/src/consul.rs) - Maintains world map of available endpoints
Inference forwarding
(server/src/inference/mod.rs):
- Regular HTTP requests: Buffers and forwards to GPU container with retry logic
- WebSocket requests: Upgrades connection and relays bidirectionally (no retries)
- Backend-specific handlers for Triton, TGI, TEI in
server/src/inference/
Failover and forwarding
(server/src/colo_forwarding.rs):
- If target returns error or is unavailable, may forward to different colo
- Tracks inflight requests via headers (
cf-ai-requests-inflight) to coordinate across instances - For pipefitter: Returns 500 with headers telling pipefitter to retry different colos
- Handles graceful draining: continues accepting requests while failing health checks
Code references:
- Repository:
gitlab.cfdata.org/cloudflare/ai/constellation-server - CLI args (timeouts, interfaces):
src/cli.rs
Wiki references:
- Constellation Server Updates Q3 2025 - Detailed request flow
- Reliable constellation-server failover
- Constellation server graceful restarts
Queuing and Backpressure
When GPU containers are busy, backpressure propagates through the stack:
-
constellation-server:
find_upstream()attempts to acquire a permit for a GPU endpoint. If all permits are taken, the request enters a bounded local colo queue (ModelRequestQueue). The queue is a future that resolves when a permit frees up (viatokio::sync::Notify). If the queue is also full, the request getsNoSlotsAvailable→ServerError::OutOfCapacity. OOC may also trigger forwarding to another constellation-server in the same colo (server/src/endpoint_lb/endpoint_group.rs:200-277). -
constellation-entry: Receives OOC from constellation-server (error codes 3033, 3040, 3047, 3048). Has a retry loop with separate budgets for errors (up to 3) and OOC (varies by request size). For small requests, pipefitter handles cross-colo forwarding. For large requests (>256 KiB), constellation-entry retries across target colos itself. All OOC variants are mapped to generic 3040 before returning to the client (
proxy_server.rs:1509-1600,errors.rs:62-67). -
worker-constellation-entry: Has an optional retry loop on 429/424 responses, but
outOfCapacityRetriesdefaults to 0 (configurable per-model viasdkConfig). By default, OOC goes straight to the client (ai/session.ts:101,137-141). -
Client: Receives HTTP 429 with
"Capacity temporarily exceeded, please try again."(internal code 3040).
There is no persistent queue in the Workers AI stack. Backpressure is handled through permit-based admission control, bounded in-memory queues, cross-colo retry, and ultimately rejection to the client. This contrasts with Replicate’s Redis Streams-based queue, where requests persist until a Director instance dequeues them.
State and Observability
Workers AI observability has two main channels from constellation-server:
Ready Analytics
(ready-analytics/src/):
Per-inference events sent via logfwdr (Unix socket). Each InferenceEvent includes:
- Request timing:
request_time_ms,inference_time_ms,queuing_time_ms,time_to_first_token - Billing:
neurons, cost metric values - Routing:
colo_id,source_colo_id,colo_path,upstream_address - Queue state:
local_queue_size - Flags: streamed, beta, external endpoint, queued in colo
See
ready-analytics/src/inference_event.rs
for full field list.
Prometheus metrics
(metrics/src/lib.rs):
Local metrics scraped by monitoring. Includes:
requests(by status),errors(by internal/HTTP code)total_permits,mean_used_permits,mean_queue_size,full_fraction(per model)infer_per_backend,infer_per_health_stateforward_failures(per target colo)backoff_removed_endpoints(endpoints in backoff state)
Note: “Workers Analytics” in the public docs refers to analytics for Workers code execution, not GPU inference. GPU inference observability flows through Ready Analytics.
Streaming
constellation-server supports two streaming mechanisms:
HTTP streaming
(server/src/inference/pipe_http.rs):
For PipeHttp backends, the response body can be InferenceBody::Streaming (line
222-230), which passes the Incoming body through without buffering via chunked transfer
encoding. The permit is held until the connection completes (line 396-397 for HTTP/1,
line 472 for HTTP/2). If the backend sends SSE-formatted data (e.g., TGI’s streaming
responses), constellation-server passes it through transparently without parsing.
WebSocket
(server/src/inference/mod.rs:180-400):
Bidirectional WebSocket connections are handled by handle_ws_inference_request. After
the HTTP upgrade, handle_websocket_request relays messages between client and inference
server. A max connection duration is enforced (max_connection_duration_s, line 332-338).
Loopback SSE
(server/src/loopback/):
Separate from client-facing streaming. constellation-server maintains SSE connections to
inference servers that support async batch processing. These connections receive
completion events (InferenceSuccess, InferenceError, WebsocketFinished, GpuFree)
which trigger permit release and async queue acknowledgment. See
loopback/sse.rs
for the SSE client implementation.
Concurrency Control
constellation-server uses permit-based concurrency control per model via
EndpointLoadBalancer
(server/src/endpoint_lb.rs).
Concurrency limits are configured in Config API. When local queuing is enabled, requests
wait in an explicit ModelRequestQueue (FIFO or Fair algorithm) until a permit becomes
available. Different models can have different concurrency limits on the same server.
Timeouts & Deadlines
Timeouts & Deadlines
Replicate Timeout System
Timeout behavior spans multiple components:
- replicate/web - defines timeout values per workload type in deployable metadata
- replicate/api - can inject per-prediction deadlines into prediction metadata
- replicate/cluster - translates deployable config into
DIRECTOR_*environment variables when creating K8s deployments; provides default values when the deployable config doesn’t specify them (pkg/config/config.go) - director - reads env vars and enforces timeouts during execution
Configuration
Three timeout parameters control prediction lifecycle. Each flows from replicate/web
through replicate/cluster to director as an environment variable. Director reads them as
flags
(director/config.go:52-58).
Model Setup Timeout (DIRECTOR_MODEL_SETUP_TIMEOUT)
Time allowed for model container initialization. Applied during health check polling
while waiting for the container to become ready. Measures time from StatusStarting to
completion. If exceeded, Director fails the prediction with setup logs and reports to the
setup run endpoint
(director/director.go:718-830).
Resolution chain:
- web:
Model.setup_timeoutproperty — returnsDEFAULT_MODEL_SETUP_TIMEOUT(10 minutes) when the underlying DB field is None (models/model.py:1038-1042). Stored onDeployableConfig.setup_timeout, then serialized asmodel_setup_timeoutwith a multiplier and bonus applied (api_serializers.py:1678-1681). - cluster: Reads
deployable.ModelSetupTimeout. If nonzero, uses it; otherwise falls back toconfig.ModelSetupTimeout(10 minutes) (pkg/config/config.go:74,pkg/kubernetes/deployable.go:1191-1201). - director: Reads
DIRECTOR_MODEL_SETUP_TIMEOUTenv var. Own flag default is 0 (disabled), but cluster always sets it.
Prediction Timeout (DIRECTOR_PREDICT_TIMEOUT)
Time allowed for prediction execution, starting from when execution begins (not including
queue time). Acts as a duration-based fallback when no explicit deadline is set. Director
enforces a minimum of 30 minutes if configured to 0 or negative
(director/director.go:285-288).
Resolution chain:
-
web:
Model.prediction_timeout(nullable duration). When set, stored onDeployableConfig.run_timeoutand serialized asprediction_timeout(models/model.py:1010-1017,api_serializers.py:1702-1703). When None, the serializedprediction_timeoutis omitted. -
cluster:
predictTimeout()resolves the value with this priority (pkg/kubernetes/deployable.go:1886-1906):deployable.PredictionTimeout(from web, if set)- Hardcoded per-account override map
userSpecificTimeouts(pkg/kubernetes/deployable.go:88-116) config.PredictTimeoutSeconds(30 minutes) (pkg/config/config.go:71)- For training workloads, uses the higher of the above and
TrainTimeoutSeconds(24 hours)
A mirrored per-account map exists in web’s
USER_SPECIFIC_TIMEOUTS(models/prediction.py:60-107) to preventterminate_stuck_predictionsfrom killing predictions that have these extended timeouts. -
director: Reads
DIRECTOR_PREDICT_TIMEOUTenv var. Own flag default is 1800s (30 minutes), but cluster always sets it.
Max Run Lifetime (DIRECTOR_MAX_RUN_LIFETIME)
Maximum time for the entire prediction lifecycle including queue time and setup.
Calculated from prediction.CreatedAt timestamp.
There are two paths that produce a max run lifetime constraint — one via the deployable config (env var on the Director pod) and one via a per-request header that becomes a deadline on the prediction itself.
Deployable config path (env var):
- web:
DeployableConfig.max_run_lifetime— defaults toDEFAULT_MAX_RUN_LIFETIME(24 hours) at the DB level (models/deployable_config.py:259-261). For deployment predictions,deployment.max_run_lifetimecan override this (logic.py:1175-1176). Serialized asmax_run_lifetime(api_serializers.py:1675-1676). - cluster: Reads
deployable.MaxRunLifetime. If nonzero, uses it; otherwise falls back toconfig.MaxRunLifetime(24 hours) (pkg/config/config.go:75,pkg/kubernetes/deployable.go:1203-1213). - director: Reads
DIRECTOR_MAX_RUN_LIFETIMEenv var. Own flag default is 0 (disabled), but cluster always sets it.
Per-request path (Cancel-After header):
- api: The
Cancel-AfterHTTP header on a prediction request is parsed as a duration (Go-style like5mor bare seconds like300). Minimum 5 seconds (server/v1_prediction_handler.go:233-276). - api:
calculateEffectiveDeadline()picks the shorter of the request’sCancel-Aftervalue and the deployable metadata’smax_run_lifetime, computes an absolute deadline from prediction creation time, and sets it onprediction.InternalMetadata.Deadline(logic/prediction.go:1076-1109). - director: This deadline is the “per-prediction deadline” checked first in the deadline priority (see Deadline Calculation below).
Deadline Calculation and Enforcement
Director calculates an effective deadline at dequeue time using this priority
(director/worker.go:63-107):
- Per-prediction deadline (
prediction.InternalMetadata.Deadline) - Deployment deadline (
prediction.CreatedAt + MaxRunLifetime, if configured) - Prediction timeout (fallback)
The earliest applicable deadline wins. At execution time
(director/worker.go:516-528):
deadline, source, ok := getEffectiveDeadline(prediction)
var timeoutDuration time.Duration
if ok {
timeoutDuration = time.Until(deadline)
} else {
timeoutDuration = d.predictionTimeout
}
timeoutTimer := time.After(timeoutDuration)
Timeout outcomes vary based on when and why the deadline is exceeded:
Before execution (deadline already passed at dequeue):
- Per-prediction deadline → aborted
- Deployment deadline → failed
During execution (timer fires while running):
- Per-prediction deadline → canceled
- Deployment deadline → failed
- Prediction timeout (fallback) → failed
Timeout Behavior Examples
Scenario 1: Version prediction with deployment deadline
DIRECTOR_MAX_RUN_LIFETIME=300(5 minutes)DIRECTOR_PREDICT_TIMEOUT=1800(30 minutes)- Prediction created at T=0, dequeued at T=60s
- Effective deadline: T=0 + 300s = T=300s
- Execution timeout: 300s - 60s = 240s remaining
Scenario 2: Prediction with explicit deadline
- Per-prediction deadline: T=180s (3 minutes from creation)
DIRECTOR_MAX_RUN_LIFETIME=600(10 minutes)- Prediction created at T=0, dequeued at T=30s
- Effective deadline: T=180s (per-prediction deadline wins)
- Execution timeout: 180s - 30s = 150s remaining
Scenario 3: Training with no deadlines
DIRECTOR_MAX_RUN_LIFETIME=0(disabled)DIRECTOR_PREDICT_TIMEOUT=1800(30 minutes)- No per-prediction deadline
- Effective deadline: None (uses fallback)
- Execution timeout: 1800s from execution start
Code references:
- Timeout configuration:
director/config.go:52-58 - Deadline calculation:
director/worker.go:63-107 - Execution timeout:
director/worker.go:516-528 - Setup timeout:
director/director.go:718-830
Workers AI Timeout System
Workers AI has a simpler timeout model than Replicate, but the request passes through multiple layers, each with its own timeout behavior.
Request Path Layers
A request from the internet traverses three services before reaching a GPU container:
- worker-constellation-entry (TypeScript Worker): The SDK-facing entry point. Calls
constellation-entry via a Workers service binding (
this.binding.fetch(...)) with no explicit timeout (ai/session.ts:131). See note below on Workers runtime limits. - constellation-entry (Rust, runs on metal): Wraps the entire request to
constellation-server in a hardcoded 4-minute
tokio::time::timeout(proxy_server.rs:99,proxy_server.rs:831-845). On timeout, returnsEntryError::ForwardTimeout. This applies to both HTTP and WebSocket paths. - constellation-server (Rust, runs on GPU nodes): Enforces per-model inference timeout (see below).
The 4-minute constellation-entry timeout acts as a hard ceiling. If a model’s
inference_timeout in constellation-server exceeds 240 seconds, constellation-entry will
kill the connection before constellation-server’s own timeout fires. This means
constellation-entry is the effective upper bound for any single inference request, unless
constellation-entry’s constant is changed.
Workers runtime limits: worker-constellation-entry is a standard Worker deployed on
Cloudflare’s internal AI account (not a pipeline First Party Worker). It does not set
limits.cpu_ms in its
wrangler.toml,
so it gets the default 30-second CPU time limit. However, CPU time only counts active
processing — time spent awaiting the service binding fetch to constellation-entry is
I/O wait and does not count
(docs). Since the
Worker is almost entirely I/O-bound, the CPU limit is unlikely to be reached. Wall-clock
duration has no hard cap for HTTP requests, though Workers can be evicted during runtime
restarts (~once/week, 30-second drain). In practice, the Workers runtime layer is
transparent for timeout purposes — constellation-entry’s 4-minute timeout fires well
before any platform limit would.
Configuration Levels
constellation-server Default
(constellation-server/src/cli.rs:107-109):
--infer-request-timeout-secs: CLI argument, default 30 seconds- Acts as the minimum timeout floor for all inference requests
- Applied globally across all models served by the constellation-server instance
Per-Model Configuration
(model-repository/src/config.rs:143):
inference_timeout: Model-specific timeout in seconds (default: 0)- Configured via Config API (Deus) and fetched from Quicksilver (distributed KV store)
- Part of
WorkersAiModelConfigstructure - Zero value means “use constellation-server default”
Timeout Resolution
When processing an inference request, constellation-server resolves the timeout
(server/src/lib.rs):
#![allow(unused)]
fn main() {
let min_timeout = state.infer_request_timeout_secs as u64;
let mut inference_timeout = Duration::from_secs(state.infer_request_timeout_secs as u64);
if let Some(model_config) = state.endpoint_lb.get_model_config(&model_id) {
inference_timeout = Duration::from_secs(
model_config.ai_config.inference_timeout.max(min_timeout)
);
}
}
The effective timeout is:
max(model_config.inference_timeout, infer_request_timeout_secs). This ensures:
- Models can extend their timeout beyond the 30-second default
- Models cannot reduce their timeout below the constellation-server minimum
- Unconfigured models (inference_timeout=0) use the 30-second default
Enforcement
constellation-server enforces the timeout by wrapping the entire inference call in
tokio::time::timeout(inference_timeout, handle_inference_request(...))
(server/src/lib.rs:588-599).
When the timeout elapses, the future is dropped, which closes the connection to the GPU
container. The error maps to ServerError::InferenceTimeout (line 603).
The other end of the dropped connection varies by model backend
(service-discovery/src/lib.rs:39-47):
- Triton: Upstream NVIDIA
tritonserverbinary (launched viamodel-greeter). Disconnect behavior depends on Triton’s own implementation. - TGI/TEI: Upstream HuggingFace binaries. Disconnect behavior depends on the upstream Rust server implementation.
- PipeHttp/PipeHttpLlm: Custom Python framework (
WorkersAiAppinai_catalog_common, FastAPI-based). The synchronous HTTP path (_generate) does not explicitly cancel theraw_generatecoroutine on client disconnect — the inference runs to completion (catalog/common/.../workers_ai_app/app.py:880-896). WebSocket connections do cancel processing tasks in theirfinallyblock (catalog/common/.../workers_ai_app/app.py:536-553).
The constellation-server timeout covers:
- Time waiting in the constellation-server queue (permit acquisition)
- Forwarding to the GPU container
- Inference execution in the backend (Triton, TGI, TEI)
- Response transmission back through constellation-server
But constellation-entry’s 4-minute timeout wraps the entire round-trip to constellation-server, so it is the effective ceiling regardless of per-model config.
Unlike Director’s system, there is no separate setup timeout. Model containers are managed by the orchestrator (K8s) independently of request processing. Container initialization and readiness are handled by service discovery and health checks, not request-level timeouts.
Timeout Behavior Examples
Scenario 1: Text generation model with default timeout
- Model config:
inference_timeout=0(not configured) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 30 seconds
Scenario 2: Image generation model with extended timeout
- Model config:
inference_timeout=120(2 minutes) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 120 seconds (model config wins)
Scenario 3: Model timeout exceeds constellation-entry ceiling
- Model config:
inference_timeout=300(5 minutes) - constellation-server:
infer_request_timeout_secs=30 - constellation-entry: 240 seconds (hardcoded)
- constellation-server effective timeout: 300 seconds
- Actual outcome: constellation-entry kills the connection at 240 seconds
Scenario 4: Attempted timeout reduction (not allowed)
- Model config:
inference_timeout=10(10 seconds) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 30 seconds (constellation-server minimum enforced)
Code references:
- worker-constellation-entry service binding call (no timeout):
ai/session.ts:131 - constellation-entry 4-minute timeout:
proxy_server.rs:99,proxy_server.rs:831-845 - constellation-server CLI configuration:
cli.rs:107-109 - Model config structure:
model-repository/src/config.rs:100-150 - Timeout resolution and enforcement:
server/src/lib.rs
Key Differences
Complexity and Granularity:
- Director: Multi-phase timeout system with separate controls for setup (model initialization) and execution (inference), plus deployment-level lifetime limits
- Workers AI: Two-layer timeout — constellation-entry imposes a hard 4-minute ceiling, constellation-server applies per-model timeouts within that ceiling
Deadline Calculation:
- Director: Priority-based deadline system with per-prediction, deployment, and fallback timeouts. Calculates earliest applicable deadline at dequeue time, accounting for time already spent in queue
- Workers AI: Simple max() operation between model config and constellation-server default, applied at request receipt time
Queue Time Handling:
- Director:
MaxRunLifetimeincludes queue time; execution timeout accounts for time already elapsed since creation - Workers AI: Timeout starts when constellation-server receives the request, inherently includes queue time
Setup vs Runtime:
- Director: Explicit
ModelSetupTimeoutfor container initialization, separate from prediction execution timeout - Workers AI: No request-level setup timeout; container lifecycle managed independently by orchestrator
Configuration Source:
- Director: Environment variables set by cluster service during deployment creation
- Workers AI: CLI argument (constellation-server) + distributed config store (Quicksilver/Config API)
Default Values:
- Director: 30 minutes prediction timeout (generous for long-running inference), setup and lifetime timeouts disabled by default
- Workers AI: 30 seconds base timeout (optimized for fast inference), extended per-model via config, hard ceiling of 4 minutes at constellation-entry
Minimum Enforcement:
- Director: 30-minute minimum enforced for prediction timeout if misconfigured
- Workers AI: constellation-server minimum enforced as floor, models can only extend
Use Case Alignment:
- Director: Designed for variable-length workloads including multi-hour training runs; flexible per-deployment configuration
- Workers AI: Optimized for inference workloads with known latency profiles; centralized model-specific configuration
Job Types
Job Types
Replicate Job Types
Director supports three job kinds, configured per-deployment via DIRECTOR_JOB_KIND
(director/config.go:37):
prediction(default)trainingprocedure
The job kind determines which cog HTTP endpoint Director calls, how webhooks are structured, and what timeout behavior applies.
How Job Kind Is Set
Cluster sets DIRECTOR_JOB_KIND when generating the Director pod spec
(pkg/kubernetes/deployable.go:1268-1280):
isProcedureMode()→procedureisTraining()→training- Neither → default
prediction
Where:
isProcedureMode()=UseProcedureMode || UseUnifiedPipelineNodePool(deployable_config.go:234)isTraining()=TrainingMode != nil && *TrainingMode(deployable_config.go:321)
DeployableKind (Web Side)
The web app distinguishes four deployable kinds
(deployable_config.py:114-120):
| DeployableKind | Director JobKind | Redis Queue Pattern | Key Pattern |
|---|---|---|---|
DEPLOYMENT_PREDICTION | prediction | input:prediction:{key} | dp-{uuid4_hex} |
FUNCTION_PREDICTION | procedure | input:prediction:{key} (or UNIFIED_PIPELINE_QUEUE) | fp-{hash} |
VERSION_PREDICTION | prediction or procedure | input:prediction:{key} (or UNIFIED_PIPELINE_QUEUE) | vp-{docker_image_id[:32]} |
VERSION_TRAINING | training | input:training:{key} | vt-{docker_image_id[:32]} |
Key prefixes: dp- = deployment, fp- = function/pipeline, vp- = version prediction,
vt- = version training. FUNCTION_PREDICTION is used for pipelines (procedures) — see
the Procedures section below.
Queue names are generated by DeployableConfig.queue
(deployable_config.py:565-589).
OfficialModel and ModelDeployment
Some models expose a “versionless API” — users run predictions against a model name
(POST /v1/models/{owner}/{name}/predictions) rather than a specific version hash.
OfficialModel (current production) — Ties a Model to a backing Deployment and a
BillingConfig
(official_model.py).
From the end user’s perspective, an OfficialModel behaves like a Deployment — it has
scaling, hardware selection, and version pinning — but the user interacts with it via the
model name, not a deployment name. The actual infrastructure (K8s deployment, Director
pods, Redis queues) runs through the backing Deployment.
ModelDeployment (stalled replacement) — Intended to replace OfficialModel by
associating a Model with a Deployment (and optionally a Procedure) via a labeled
relationship
(model_deployment.py).
The migration from OfficialModel to ModelDeployment was started but the work stalled and
was rolled back. Both entities exist in the codebase and the dispatch logic checks for a
default ModelDeployment first, falling back to OfficialModel
(api_internal_views.py:303-326).
TODO comments in the code (e.g. “Drop is_official from here once OfficialModels are
gone” at
model.py:1322)
reflect the intended direction.
A model uses_versionless_api when it has either an OfficialModel or a default
ModelDeployment
(model.py:1321-1323).
Predictions
The default job kind. Director sends PUT /predictions/{id} to cog
(cog/client.go:183-184).
Three prediction subtypes exist, distinguished by how the deployable is created:
Version predictions — Direct runs against a model version. Created when a user runs a
prediction against a specific version ID. Key: vp-{docker_image_id[:32]}
(version.py:618-619).
Deployment predictions — Runs through a named deployment. The deployment owns the
scaling config, hardware selection, and version pinning. Kind is DEPLOYMENT_PREDICTION
with key prefix dp-
(deployment.py:611,
deployment.py:1024-1027).
Hotswap predictions — A version that has additional_weights and a base_version.
The base container stays running and loads different weights (e.g. LoRA adapters) at
runtime instead of booting a new container per version.
Hotswap requirements
(version.py:586-594):
- Version has
additional_weights - Version has a
base_version - Version’s base docker image matches the base version’s docker image
- Base version’s model is public, image is non-virtual, and image accepts replicate weights
Hotswap key: hp-{base_version.docker_image_id[:32]}
(version.py:605-608).
Multiple hotswappable versions sharing the same base version share the same key, so they
share the same Director pool.
Hotswap uses prefer_same_stream=True
(logic.py:1243),
which tells Director’s Redis message queue to prefer consuming from the same stream shard
repeatedly
(redis/queue.go:248-251).
This increases the chance that a Director instance serves the same version’s weights
consecutively, avoiding unnecessary weight reloads. PreferSameStream is incompatible
with batching (MaxConcurrentPredictions > 1)
(config.go:90-92).
Training
DIRECTOR_JOB_KIND=training. Director sends PUT /trainings/{id} (if cog >= 0.11.6),
otherwise falls back to PUT /predictions/{id}
(cog/client.go:279-311).
Training-specific behaviors:
- Queue prefix is
input:training:instead ofinput:prediction: - Timeout uses the higher of the prediction timeout and
TrainTimeoutSeconds(24 hours) (deployable.go:1901) Prediction.Destinationfield is present (specifies where trained weights go) (cog/types.go:182)- Concurrency is always 1
(
deployable_config.py:417-418) - Webhook payload wraps the prediction in a
JobwithKind: "training"and the prediction in theTrainingfield instead ofPrediction(worker.go:746-749)
Training is set via DeployableConfig.training_mode which maps to
DeployableKind.VERSION_TRAINING
(deployable_config.py:591-595).
Procedures
DIRECTOR_JOB_KIND=procedure. Director sends PUT /procedures/{id}
(cog/client.go:54).
Procedures are multi-step workflows (pipelines) that run user-provided Python code on a
shared runtime image (pipelines-runtime). Unlike predictions and trainings, the
container image is not user-built — it’s a standard runtime that loads user code at
execution time.
Procedure source loading:
Before executing, Director downloads the procedure source archive (tar.gz) from a signed
URL, extracts it to a local cache directory, and rewrites the
procedure_source_url in the prediction context to a file:// path
(worker.go:540-548).
The procedures.Manager handles download, caching (keyed by URL hash), and extraction
(internal/procedures/manager.go).
Webhook compatibility hack:
When sending webhooks back to the API, Director overwrites the procedure job kind to
prediction because the API doesn’t understand the procedure kind yet
(worker.go:734-741).
Procedure subtypes in web:
Procedure— persistent, owned by a user/org, tied to a model. UsesDeployableKind.FUNCTION_PREDICTION(procedure.py:230).EphemeralProcedure— temporary/draft, used for iteration. Source archives expire after 12 hours. UsesDeployableKind.VERSION_PREDICTION(procedure.py:42).
Both can route to the UNIFIED_PIPELINE_QUEUE if configured
(deployable_config.py:573-584),
which is a shared warm compute pool.
Workers AI Job Types
Workers AI has a single job type: inference. There is no training, no procedure/pipeline execution, and no multi-step workflow at the request level.
Task Types (Schema-Level Only)
worker-constellation-entry defines 14 task types in taskMappings
(catalog.ts):
text-generation,text-classification,text-embeddings,text-to-image,text-to-speechautomatic-speech-recognition,image-classification,image-to-text,image-text-to-textobject-detection,translation,summarization,multimodal-embeddingsdumb-pipe(passthrough, no input/output transformation)
Each task type is a class that defines:
- Input schema validation and preprocessing
- Payload generation (transforming user input to backend format)
- Output postprocessing
The task type is determined by the model’s task_slug from the config API. The
infrastructure path (constellation-entry → constellation-server → GPU container) is
identical regardless of task type. Task types are purely a worker-constellation-entry
concern — constellation-entry and constellation-server are task-agnostic.
LoRA at Inference Time
constellation-server supports LoRA adapter loading for Triton-backed models
(inference/triton.rs:142-214).
When a request includes a lora parameter:
- Finetune config is fetched from the config API (Deus) via the
model-repositorycrate - Config is validated: finetune must exist for the current model, all required assets must be present
lora_configandlora_nameare injected as additional Triton inference inputs
This is inference-time adapter loading, not training. The base model stays loaded and the
adapter is applied per-request. There is no LoRA-aware routing or stickiness —
constellation-server’s endpoint selection and permit system are unaware of the lora
field, so consecutive requests for the same adapter may hit different GPU containers.
Async Queue
Two conditions must be met for async execution:
- The model must have
use_async_queueenabled in the config API (Deus) — a per-model property, defaultfalse(config.rs:134) - The requester must opt in per-request via
options.queueRequest: truein the JSON body (ai.ts:133-144)
When both are true, the request is POSTed to the async-queue-worker service binding
(queue.ts:23).
The response is either 200 (completed immediately) or 202 (queued). To retrieve results,
the requester sends a follow-up request with inputs.request_id set, which triggers a GET
to the queue service
(ai.ts:148-153).
Limitations:
- Polling only — no webhook/callback support
- No streaming —
inputs.streamis explicitly deleted whenqueueRequestis true (ai.ts:135)
This contrasts with Replicate where every prediction is async by default (webhook-driven),
and Prefer: wait is the opt-in for synchronous behavior. Workers AI is synchronous by
default, with async as an opt-in that requires both model-level enablement and per-request
signaling.
Key Differences
Job type diversity:
- Replicate: Three distinct job kinds (prediction, training, procedure) with different cog endpoints, timeout behaviors, queue prefixes, and webhook structures
- Workers AI: Single job type (inference) with task-type differentiation only at the input/output schema level
Training support:
- Replicate: First-class training support with dedicated queue prefix, extended timeouts (24h), destination field for trained weights, and separate cog endpoint
- Workers AI: No training capability. LoRA adapters are loaded at inference time from pre-trained weights, not trained on the platform
Multi-step workflows:
- Replicate: Procedures allow user-provided Python code to run on a shared runtime, enabling multi-model orchestration and custom logic. Source code is downloaded and cached per-execution
- Workers AI: No equivalent. Multi-step orchestration is left to the caller
Weight loading model:
- Replicate: Hotswap mechanism allows a base container to serve multiple versions by
loading different weights at runtime. Uses
prefer_same_streamto optimize for cache locality across a shared Director pool - Workers AI: LoRA adapters are injected per-request at the Triton level. No concept of a shared base container serving multiple “versions” — each model ID maps to a fixed deployment
Job type awareness:
- Replicate: The same infrastructure (cluster, Director, cog) handles all three job kinds. Job kind is a configuration axis that changes queue prefixes, cog endpoint paths, timeout calculations, and webhook payload structure — but the systems are the same
- Workers AI: Task type is a thin layer in worker-constellation-entry only. Everything downstream (constellation-entry, constellation-server, GPU containers) is task-agnostic
Deployment Management
Replicate: Autoscaler
The autoscaler runs in every GPU model serving kubernetes cluster. It manages k8s
Deployment objects in the models or serving namespaces — creating them when prediction
traffic appears, scaling replica counts based on queue depth, and pruning idle ones.
Source: cluster/pkg/autoscaler/autoscaler.go
(fa8042d)
Concurrent loops
| Loop | Interval | Purpose |
|---|---|---|
| Deployment Dispatcher | event-driven (BRPOP) | Creates/updates K8s Deployments when predictions arrive |
| Scaler | 1 second | Adjusts replica counts based on queue depth |
| Scale State Snapshotter | 1 second | Captures queue lengths + replica counts into Redis |
| Queue Pruner | 1 hour | Deletes stuck requests older than 24 hours |
| Deployment Pruner | 1 minute | Deletes K8s Deployments that have been idle too long |
Deployment Dispatcher
┌───────────────┐
│ replicate/api │
└──────┬────────┘
│ LPUSH
▼
┌──────────────────────┐
│ prediction-versions │
│ (Redis list) │
└──────┬───────────────┘
│ BRPOP
▼
┌──────────────────────┐
│ Deployment │
│ Dispatcher │
│ (event-driven) │
└──────┬───────────────┘
│ create/update
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startDeploymentDispatcher BRPOPs from the
prediction-versions Redis queue. Each prediction triggers
ensureDeployableDeployment, which creates or
updates the K8s Deployment if the config has changed. Change detection uses a
config hash comparison plus a template version serial.
Key behavior:
- Deployments are created on demand — the first prediction for a version/deployment triggers K8s Deployment creation
- Rate-limited to avoid overwhelming the K8s API
- Config hash comparison means no-op if nothing changed
Scaler
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────┬───────────────┘
│ read replicas + queue lengths
▼
┌──────────────────────┐
│ Snapshotter (1s) │
└──────┬───────────────┘
│ write
▼
┌──────────────────────┐
│ Redis cache │
│ (scale state) │
└──────┬───────────────┘
│ read
▼
┌──────────────────────┐
│ Scaler (1s) │
│ computeNewReplica │
│ CountV2 │
└──────┬───────────────┘
│ patch replicas
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startScaler runs every 1 second. For each deployable found
in K8s, it loads the scale state from Redis cache and calls
scaleDeployable().
The scaling algorithm (computeNewReplicaCountV2)
is HPA-inspired:
desiredReplicas = currentReplicas × (metricValue / targetMetricValue)
Where the metric is backlog-per-instance:
backlogPerInstance = (queueLength + queueHeadroom) / instances
Three dampening mechanisms prevent oscillation:
- Scaling policies — rate limits on scale-out and scale-in. Default scale-out: allow +5 count or +100% per minute (whichever is larger). Default scale-in: unrestricted rate.
- Stabilization windows — the algorithm considers the min (for scale-in) or max (for scale-out) desired replica count over a time window. Default scale-out: 30 seconds. Default scale-in: 2 minutes.
- Hysteresis — ignore small oscillations below a threshold (default 0.02).
Additional scaling features:
- Slow start: cap at 5 replicas until the first pod reports ready
- Scale-to-zero: supported, with configurable idle timeout delay. Gated by
scale-to-zero-delaykill switch. - Override min replicas: per-deployable minimum via
DeployableConfig - Emergency cap:
cap-max-replicasfeature flag
Scaling configuration
scaling.Config holds per-deployable scaling parameters:
| Field | Default | Source |
|---|---|---|
| MetricTarget | config.BacklogPerInstance flag | CLI flag |
| Hysteresis | 0.02 | hardcoded |
| MinReplicas | 0 | DeployableConfig.ScalingConfig |
| MaxReplicas | config.MaxReplicas flag | CLI flag |
| ScaleOut behavior | 30s stabilization, +5 or +100%/min | algorithm_v2_defaults.go |
| ScaleIn behavior | 2min stabilization, no rate limit | algorithm_v2_defaults.go |
Per-deployable overrides come from deployable.ScalingConfig,
which is set via the web’s DeployableConfig and serialized
into K8s annotations.
Deployment Pruner
┌──────────────────────┐
│ Redis cache │
│ (last-request-time) │
└──────┬───────────────┘
│ read
▼
┌──────────────────────┐
│ Deployment Pruner │
│ (1min) │
└──────┬───────────────┘
│ delete idle
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startDeploymentPruner runs every 1 minute. It
deletes K8s Deployments that haven’t received a prediction since
DeployableDeploymentDeleteInactiveAfter.
Max 100 deletions per cycle. Fails safe if the latest request time is missing
from Redis (skips the deployment rather than deleting it).
Queue Pruner
┌──────────────────────┐
│ Redis streams │
│ (dp-*, vp-*, …) │
└──────┬───────────────┘
│ scan for entries > 24h
▼
┌──────────────────────┐
│ Queue Pruner (1h) │
└──────┬───────────────┘
│ XACK + XDEL
▼
┌──────────────────────┐
│ Redis streams │
│ (dp-*, vp-*, …) │
└──────────────────────┘
startQueuePruner runs every 1 hour. Deletes stuck requests older
than 24 hours from the per-deployable prediction request streams (e.g. dp-*, vp-*) —
both pending (claimed but unprocessed) and unprocessed entries. Note: replicate/api’s
sweeper also removes past-deadline messages from these same streams at much shorter
intervals.
Workers AI: ai-scheduler
Workers AI deployment management is split across multiple systems.
ai-scheduler is a Rust binary deployed to a core datacenter K8s
cluster (pdx-c). It has multiple subcommands, each run as a
separate K8s deployment: AutoScaler, Scheduler, AdminAPI,
ReleaseManagerWatcher, ExternalNodesWatcher, RoutingUpdater.
They share the scheduling crate as a library.
Source: ai-scheduler/ (89f8e0d)
Architecture overview
┌──────────────────────────────────────────────────────────┐
│ ai-scheduler │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ auto_scaling │ │ admin_api │ │ release_mgr │ │
│ │ (5min loop) │ │ (manual ops) │ │ _watcher │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬─────────┘ │
│ │ │ │ │
│ └────────┬────────┴──────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ scheduling │ (action execution) │
│ └──────┬───────┘ │
│ │ │
└────────────────┼─────────────────────────────────────────┘
│
┌────────┴────────┐
▼ ▼
Cloud Chamber External K8s
(internal GPU) (OKE, CoreWeave, Nebius,
Lambda, GCP, Crusoe)
│
▼
inference-kubernetes-engine
(Model CRD operator)
Three entry points produce scheduling actions:
- auto_scaling — utilization-based autoscaler loop
- admin_api — manual scaling endpoints for humans
- release_manager_watcher — watches for software version changes, triggers rolling updates
All actions flow through the scheduling app, which applies them to either
Cloud Chamber (internal capacity) or external K8s clusters via the
external_nodes module.
Autoscaler (auto_scaling)
The autoscaler runs every 5 minutes. Each cycle:
- Fetches model usage from ClickHouse (request counts, inference time per minute over 15-minute windows)
- Fetches usage “forecast” from ClickHouse (same-time-last-week data, not a forecasting model)
- Fetches utilization metrics from Prometheus (soft — errors produce empty data, not failures)
- Loads model configuration from Config API
- Gets current application state from Cloud Chamber
- Fetches external endpoint health from Quicksilver and counts healthy deployments from Cloud Chamber
- Handles external capacity scheduling (may emit
ScheduleExternalModelApplications) - Computes desired instance count per model
- Emits
UpscaleModelApplicationsorDownscaleModelApplicationsactions
Scaling algorithms
Three algorithms coexist. Selection depends on per-model Config API properties:
1. Request-count-based (default fallback):
instances = avg_count_per_min × autoscaling_factor
Default autoscaling_factor is 0.8. This assumes each inference request takes
~1 minute. Crude, but serves as the baseline.
2. Utilization-based (model_utilization_autoscaler = true):
Computes utilization from cumulative inference time:
required_to_fit = inference_time_secs_per_min / 60 / max_concurrent_requests
utilization = required_to_fit / model_instances
Uses a deadband instead of dampening:
- If utilization <
1 / (factor + 0.15)→ downscale - If utilization >
1 / factor→ upscale - Otherwise → no change
Minimum factor is clamped to 1.2 (20% overprovisioning floor). Takes the max of current and forecast inference time.
3. EZ utilization-based (scaling_config.utilization_based.active = true):
The newest algorithm. Configurable per-model via ScalingConfig:
scaling_config:
disabled: false
utilization_based:
active: true
min_utilization: 0.3
max_utilization: 0.8
utilization_scale: 1.0
use_forecast: true
use_out_of_cap: true
prometheus: false
Features:
- Configurable min/max utilization bounds (replaces hardcoded deadband)
- Optional forecast-based scaling (
use_forecast) - Out-of-capacity adjustment (
use_out_of_cap): inflates measured utilization by(successful + out_of_cap) / successfulto account for requests turned away.successful_per_minfloored at 0.1 to avoid division by zero. - Safety cap: when OOC adjustment is active, measured utilization capped at 2× the sans-OOC value
- Prometheus metrics as an alternative utilization source
- Asymmetric instance base: upscale decisions use
model_healthy_instances, downscale usesmodel_requested_instances - Unhealthy instance handling: adds +1 buffer when upscaling with unhealthy instances; gradually decrements requested count when >1 unhealthy instances exist
When active, this algorithm overrides the request-count-based
algorithm. However, if model_utilization_autoscaler is also true,
the older utilization-based algorithm takes final precedence — the
priority order is: request-count → EZ utilization → old utilization.
Instance bounds
All algorithms clamp results to [min_count, max_count] from Config API:
- Default
min_count: 5 (set per autoscaler instance via CLI flagdefault_min_count) - Default
max_count: 100 - Downscale batch size capped at 10 per cycle
There is no idle-based scale-to-zero. Models stay provisioned at
min_count (default 5, but can be set to 0 per-model).
Kill switches
disable_scaling— global Config API property, disables all scalingscaling_config.disabled— per-model disable- Tier change grace period — skips models that recently changed tiers
- Tier selector — autoscaler instances can be scoped to specific tier ranges
Scheduling actions
Actions are applied via Cloud Chamber API or external K8s:
| Action | Target | Effect |
|---|---|---|
UpscaleModelApplications | Cloud Chamber | PATCH application instances count up |
DownscaleModelApplications | Cloud Chamber | PATCH application instances count down |
ScheduleExternalModelApplications | External K8s | Create/patch Model CRD replicas |
CreateModelApplication | Cloud Chamber | Create new CC application + set instances |
DeployModelApplicationToTiger | Cloud Chamber | Deploy to Tiger (canary) environment |
DeleteDeployment | Cloud Chamber | Delete specific deployment |
RemoveModelApplication | Cloud Chamber | Delete entire application |
ModifyApplicationRegions | Cloud Chamber | Modify region constraints |
ModifyApplicationSchedulingPriority | Cloud Chamber | Modify scheduling priority |
ModifyApplicationAffinities | Cloud Chamber | Modify colocation/affinity constraints |
Instance count is clamped to 0–1400 per application. Up/downscale distributes changes round-robin across applications.
Actions are defined in the Action enum.
External capacity
External capacity (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) has a separate
management model controlled by the ExternalCapacity config
(Management enum):
#![allow(unused)]
fn main() {
pub enum Management {
Disabled, // no auto-management
Manual, // humans manage it entirely
Upscaling, // autoscaler can only scale UP
Full, // autoscaler can scale both directions
}
}
The autoscaler code has this comment:
NOTE: currently auto-scaler can only scale up external instances but not scale down. In order to scale down, use admin api endpoint: /models/schedule_externally
This comment may be stale — Management::Full supports both
directions in the code. The real constraint is that the autoscaler’s
scaling algorithms don’t dynamically compute external replica targets;
external capacity is config-driven (expected_replicas), not
utilization-driven.
The external path works as follows:
- Infrastructure: K8s clusters provisioned via Terraform
(
oci-terraform/repo — OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) - Model operator:
inference-kubernetes-enginewatchesModelCRDs and reconciles K8s Deployments/Services to match - Scaling up: autoscaler patches Model CRD
spec.replicasviaKubernetesProvider.schedule() - Scaling down: manual via admin API or the
scale-downCLI tool in inference-kubernetes-engine, which gradually decrements replicas with a configurable interval and batch size
Admin API
Manual scaling endpoints behind is_workers_ai_team()
access check:
GET /models/scale— upscale a model by amount (createsUpscaleModelApplicationsaction; notably a GET for a mutating op)POST /models/schedule— run the full scheduler loop for a single modelPOST /models/schedule_externally— set external replica countPOST /models/schedule_externally_on_specific_provider— target specific providerPOST /models/remove— delete specific deployment by colo/metalPOST /models/delete_applications— delete all applications for a model, including external CRDsGET /models/status— model status/debugging- Tiger management — create, list, delete canary deployments
Model provisioning
Unlike Replicate, Workers AI models are pre-provisioned rather than created on demand from inference traffic:
- Model is registered in Config API with properties (
min_count,max_count,software,gpu_memory, etc.) release_manager_watcherdetects the software version and creates Cloud Chamber applications- Autoscaler maintains instance count within
[min_count, max_count]based on utilization - There is no equivalent of Replicate’s deployment pruner — models stay
provisioned at
min_countuntil manually removed
Config API properties (scaling-relevant)
| Property | Default | Description |
|---|---|---|
min_count | 5 | Minimum instances |
max_count | 100 | Maximum instances |
scaling_factor | 0.8 | Autoscaling factor |
model_utilization_autoscaler | false | Enable utilization-based algorithm |
scaling_config | none | YAML blob with disabled, utilization_based sub-config |
disable_scaling | false | Kill switch (also available as global property) |
external_capacity | none | External provider config with management mode |
gpu_memory | 22 | GPU memory request (GB) |
gpu_model | none | Specific GPU model requirement |
dual_gpu | false | Requires two GPUs |
colo_tier | none | Restrict to colo tier |
colo_region | none | Restrict to colo region |
tier | “unknown-scheduler” | Model tier (Tier-0, Tier-1, Tier-2) |
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Scaling signal | Queue depth (backlog per instance) — real-time | Inference time utilization — 15-minute ClickHouse windows |
| Loop frequency | 1 second | 5 minutes |
| Deployment creation | On demand from prediction traffic | Pre-provisioned via Config API + release manager |
| Scale-to-zero | Yes, with idle timeout | No idle-based scale-to-zero; min_count defaults to 5 but can be 0 per-model |
| Deployment pruning | Automatic (idle deployments deleted after timeout) | None — models stay provisioned until manually removed |
| Dampening | HPA-style: scaling policies, stabilization windows, hysteresis | Deadband: upper/lower utilization bounds |
| Orchestrator | Direct K8s API (Deployments) | Cloud Chamber (internal) + K8s operator (external) |
| External capacity | N/A (single K8s cluster) | Multi-provider (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) with per-model management mode |
| Manual controls | Feature flags (cap-max-replicas, scale-to-zero-delay) | Admin API endpoints, disable_scaling kill switch, Management::Manual mode |
| Scale-down on external | N/A | Config-driven; Management::Full supports both directions but targets are set manually, not utilization-driven |
| Algorithm selection | Single algorithm (HPA-inspired) | Three algorithms with priority ordering: request-count → EZ utilization → old utilization |
| Config source | K8s annotations (from web’s DeployableConfig) | Config API properties (key-value store) |
Architectural contrast
Replicate’s autoscaler is reactive and fine-grained: predictions create deployments, queue depth drives scaling at 1-second resolution, idle deployments get pruned. The system assumes workloads are bursty and ephemeral.
Workers AI’s ai-scheduler is proactive and coarse-grained: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and external capacity management is heavily human-assisted. The system assumes a catalog of always-available models.
Workload Types Overview
Workload Types in Replicate
Replicate handles several distinct workload types, each with different characteristics and configurations. Understanding these distinctions is essential for comparing with Workers AI, which has a more uniform workload model.
Overview
Replicate defines four primary workload types via the DeployableKind enum
(replicate/web/models/models/deployable_config.py:114):
class DeployableKind(models.TextChoices):
DEPLOYMENT_PREDICTION = "deployment-prediction", "deployment-prediction"
FUNCTION_PREDICTION = "function-prediction", "function-prediction"
VERSION_PREDICTION = "version-prediction", "version-prediction"
VERSION_TRAINING = "version-training", "version-training"
Each workload type has its own deployable_metadata_for_* function in
replicate/web/models/logic.py
that generates the appropriate configuration.
1. Deployment Predictions
What: Predictions running on a Deployment - a stable, long-lived identifier that routes to a backing model version. The backing version can be changed over time and doesn’t need to be owned by the same account as the deployment.
Characteristics:
- Persistent deployment entity (configuration/routing), but infrastructure can scale to 0 replicas
- Custom configuration per deployment that can override many version-level settings
- Uses dedicated deployment key for consistent routing
Configuration:
deployable_metadata_for_deployment_prediction()
Code references:
- Deployment model:
replicate/web/models/models/deployment.py - Kind validation:
logic.py:1156- assertskind == DeployableKind.DEPLOYMENT_PREDICTIONandnot deployment.used_by_model
Queue behavior: Standard shuffle-sharded queues per deployment
2. Function Predictions (Pipelines/Procedures)
What: Predictions for multi-step workflows (Replicate Pipelines). Function predictions run on a shared container image with CPU-only hardware that “swaps” in procedure source code at prediction time. Similar to hotswaps but specifically for CPU-only Python code rather than GPU model weights.
Characteristics:
- Run on CPU hardware (no GPU)
- Share a base container image across procedures
- Procedure source code is swapped in at runtime (analogous to weight swapping in hotswaps)
- Part of a larger multi-step workflow
- Uses
AbstractProceduremodel, notVersion
Configuration:
deployable_metadata_for_procedure_prediction()
Code references:
- Procedure model:
replicate/web/models/models/procedure.py - Kind validation:
logic.py:1165- whenkind == DeployableKind.FUNCTION_PREDICTION, assertsdeployment.used_by_model
Director configuration: Director runs procedures with DIRECTOR_JOB_KIND=procedure
(director/config.go:37)
Queue behavior: Standard queues, no special routing
3. Version Predictions
What: Predictions running directly on a model version (not through a deployment).
Characteristics:
- Ephemeral infrastructure (scaled up/down based on demand)
- Configuration comes from version’s
current_prediction_deployable_config - Two sub-types: normal and hotswap (see below)
Configuration:
deployable_metadata_for_version_prediction()
3a. Normal Version Predictions
What: Standard version predictions without hotswapping.
Characteristics:
- One container image per version
- Standard queue routing
prefer_same_stream = False(default)
Code:
logic.py:1214-1222
if not version.is_hotswappable:
metadata = DeployableConfigSerializer(
version.current_prediction_deployable_config
).data
3b. Hotswap Version Predictions
What: Versions that share a base container but load different weights at runtime. Multiple “hotswap versions” can run on the same pod by swapping weights instead of restarting containers.
Characteristics:
- Multiple versions share the same base Docker image
- Each version has
additional_weights(weights loaded at runtime) - Base version must be public, non-virtual, and accept Replicate weights
prefer_same_stream = True- workers preferentially consume from the same stream to optimize weight locality- Uses base version’s deployment key for infrastructure sharing
When a version is hotswappable
(version.py:586):
def is_hotswappable(self) -> bool:
if not self.additional_weights:
return False
if not self.base_version:
return False
if self.base_docker_image_id != self.base_version.docker_image_relation_id:
return False
return self.base_version.is_valid_hotswap_base
Configuration:
logic.py:1229-1244
deployable_config_fields = {
...
"docker_image": version.base_version.docker_image_relation,
"fuse_config": None,
"key": version.base_version.key_for_hotswap_base_predictions,
"prefer_same_stream": True, # KEY DIFFERENCE
}
Queue behavior:
- Shuffle-sharded queues (like all workloads)
DIRECTOR_PREFER_SAME_STREAM=trueset for hotswap versions- Director workers repeatedly consume from the same stream before checking others
- Optimizes for weight caching locality (reduce weight download/loading overhead)
Why prefer_same_stream matters:
- Hotswap versions load different weights into the same container
- If a worker has already loaded weights for version A, it’s faster to keep processing version A predictions
- Without
prefer_same_stream, workers round-robin across all streams, losing weight locality benefits
Code references:
- Version model:
replicate/web/models/models/version.py - Hotswap validation:
version.py:586-595 - Queue affinity:
director/config.go:50 - Redis implementation:
director/redis/queue.go
4. Version Trainings
What: Training jobs for versions marked as trainable.
Characteristics:
- Uses separate
current_training_deployable_config(not prediction config) - Longer timeouts and different resource requirements
- Different billing model
- Director runs with
DIRECTOR_JOB_KIND=training
Configuration:
deployable_metadata_for_version_training()
Code references:
- Training config:
logic.py:1307-1308 - Director job kind:
director/config.go:37
Summary Table
| Workload Type | Infrastructure | Config Source | prefer_same_stream | Director Job Kind |
|---|---|---|---|---|
| Deployment Prediction | Persistent | Deployment config | False | prediction |
| Function Prediction | Ephemeral | Procedure config | False | procedure |
| Version Prediction (normal) | Ephemeral | Version prediction config | False | prediction |
| Version Prediction (hotswap) | Shared base | Base version config + weights | True | prediction |
| Version Training | Ephemeral | Version training config | False | training |
Key Takeaways
- Replicate has explicit workload type separation with different configurations and behaviors per type
- Hotswap predictions are unique - they’re the only workload using
prefer_same_stream=Truefor weight locality optimization - Function predictions use code swapping - similar to hotswaps but for CPU-only Python code instead of GPU model weights
- Each workload type has distinct characteristics - different infrastructure models, configuration sources, and queue behaviors
These workload distinctions will be referenced throughout the document when discussing queue behavior, timeouts, scaling, and other operational characteristics.