Introduction
This is a technical comparison of Replicate and Cloudflare Workers AI. It covers how each system handles request routing, autoscaling, resource management, model loading, and observability. When possible, links to source code are included.
Audience
Engineers and engineering leaders on the Workers AI and adjacent ai-platform teams. Sections are written to be useful both as reference material for people building these systems and as context for people making decisions about them.
How to read this
Each section covers both platforms, ending with a “Key Differences” summary. Sections are self-contained — read them in order or jump to what’s relevant.
Scope
This comparison covers the inference serving path and supporting infrastructure: how requests arrive, get routed to models, execute, and return results. In general, it does not cover ancillary tooling or browser level user experience topics.
Request Lifecycle
Request Lifecycle Management
Overview
As one might expect, a given prediction or inference request touches different components on each platform. The diagrams below show the (simplified) full path from client to inference server and back.
Replicate
graph TD
Client([Client])
Web[web]
subgraph GPU cluster
API[api]
Redis[(redis)]
Director[director]
Cog[cog]
end
Client -->|HTTP| API
Web -.->|playground| API
API -->|LPUSH / XADD| Redis
Redis -->|XREADGROUP| Director
Director -->|HTTP| Cog
Cog -->|response| Director
Director -->|webhook| API
API -->|poll / webhook| Client
Key points: the request is asynchronous by default. The client submits a prediction, the API enqueues it, and Director picks it up later. The client polls, receives a webhook when the result is ready, or opts into “blocking” mode at the per-request level. Streaming predictions respond over SSE but still enter through the same queue.
Workers AI
graph TD
Client([Client])
WCE[worker-constellation-entry]
CE[constellation-entry]
CS[constellation-server]
subgraph GPU cluster
GPU[GPU container]
end
Client -->|HTTP| WCE
WCE -->|HTTP| CE
CE -->|pipefitter / HTTP| CS
CS -->|HTTP| GPU
GPU -->|response| CS
CS -->|response| CE
CE -->|response| WCE
WCE -->|response| Client
Key points: the request is synchronous. The client holds an open HTTP connection while the request flows through the stack to a GPU container and back. There is no persistent queue — requests wait in bounded in-memory queues or get rejected with a 429 if no capacity is available.
Replicate Request Processing
The Replicate request path has three major phases: API-side routing and enqueue, transient queuing in Redis, and Director-side dequeue and execution. Durable state lives in PostgreSQL (via web), not in the queue. The sections below cover each phase in detail.
Behavior varies by workload type (see Workload Types appendix). All workload types share the same core queue and worker infrastructure but differ in configuration.
Queue and Message Handling
Each GPU provider’s managed Kubernetes cluster runs its own Redis
cluster (with sentinel for HA). Director uses Redis Streams with
consumer groups for request queuing. Each deployable has its own
stream — the API pushes prediction requests to the stream, and
Director pods consume from it. Queue names combine a workload-type
prefix (input:prediction:dp-, input:prediction:fp-, etc.) with
the deployable’s key.
Queues use shuffle sharding by default (since October 2024) — each
tenant is assigned a subset of stream shards, providing fairness by
preventing one account’s batch from blocking others. Hotswap
predictions enable PreferSameStream for weight cache locality.
See Queue Management for full details on queue internals: MessageQueue fetcher/claimer goroutines, sharded queues, blue/green migration via MultiMessageQueue, and stuck message cleanup.
Worker Loop
Each Director instance runs worker goroutines that poll the queue. The number of workers
is controlled by DIRECTOR_CONCURRENT_PREDICTIONS (default: 1). This is set by cluster
when the model’s Cog config specifies concurrency.max > 1
(cluster/pkg/kubernetes/deployable.go:1149-1153).
Most models run with the default of 1 worker.
The worker loop
(director/worker.go:134)
continuously:
- Polls Redis Stream for messages using consumer groups
- Checks if prediction execution is allowed (spending validation, principal checks)
- Calculates effective deadline from prediction metadata, deployment lifetime, and timeout config
- Creates a Tracker instance for state management
- Executes prediction via Executor interface (which calls into Cog)
- Handles terminal states (success, failure, cancellation)
- Acknowledges message in Redis
The worker loop behavior is the same across all workload types. What differs is the job kind passed to the executor:
- Predictions (deployment, version):
DIRECTOR_JOB_KIND=prediction(default) - Trainings:
DIRECTOR_JOB_KIND=training - Procedures (function predictions):
DIRECTOR_JOB_KIND=procedure
Note: Internally, Director converts procedure to prediction when sending to Cog
because the API doesn’t yet support the procedure value
(director/worker.go:731-737).
Code references:
- Worker implementation:
director/worker.go - Execution validation:
director/worker.go:34-48 - Deadline calculation:
director/worker.go:63-107 - Job kind configuration:
director/config.go:37
Tracker and State Transitions
The Tracker
(director/tracker.go)
manages state transitions and notifications. It defines four webhook event types:
start- Prediction begins executionoutput- Intermediate outputs during executionlogs- Log messages during executioncompleted- Prediction reaches terminal state
The Tracker maintains:
- Prediction state (struct at
director/tracker.go:36-69) - Subscriber functions that receive state updates
- Pending webhook events (if notifications are suppressed temporarily)
- Upload tracking for async file completions
- Moderation state (if content moderation is enabled)
Webhooks are sent to the Replicate API at state transitions. The webhook sender uses a
PreemptingExecutor
(director/executor.go)
that may drop intermediate webhooks if new updates arrive before the previous one is
sent - this is intentional since the latest state supersedes earlier ones. Terminal
webhooks are always delivered.
Code references:
- Tracker struct:
director/tracker.go:36-69 - Webhook event types:
cog/types.go:45-55
Streaming
Director supports Server-Sent Events (SSE) for streaming outputs. Streaming is enabled
when the API includes a stream_publish_url in the prediction’s internal metadata
(cog/types.go:230).
The StreamClient
(director/streaming.go)
publishes output events to this URL as they occur.
Concurrency Control
Director supports concurrent predictions within a single container via
DIRECTOR_CONCURRENT_PREDICTIONS. See Concurrency Deep Dive
for details on configuration and usage patterns.
Workers AI Request Processing
Routing Architecture
Workers AI requests flow through multiple components:
-
worker-constellation-entry: Pipeline Worker (TypeScript) that receives API requests from SDK or REST API. Handles input validation, model lookup, and request formatting before forwarding to constellation-entry (
apps/worker-constellation-entry/src/worker.ts). -
constellation-entry: Rust service running on every metal. Fetches model config from QS, selects target colos using routing algorithms, resolves constellation-server addresses, and forwards via pipefitter or HTTP (
gitlab.cfdata.org/cloudflare/ai/constellation-entry). -
constellation-server: Colo-level load balancer (Rust). Manages permits and request queues per model, selects GPU container via service discovery, handles inference forwarding and cross-colo failover (
gitlab.cfdata.org/cloudflare/ai/constellation-server). -
GPU container: Inference server running the model.
Wiki references:
- System Overview - Component descriptions
- Smart(er) routing - Routing architecture
- Constellation Server Updates Q3 2025 - Detailed request flow
constellation-entry Routing
constellation-entry fetches model configuration from Quicksilver (distributed KV
store), which includes the list of colos where the model is deployed
(src/model_config.rs).
It then applies a routing algorithm to select target colos. Routing algorithms include:
- Random: Random selection among available colos (default)
- LatencySensitive: Prefer nearby colos, with step-based selection
- ForcedColo: Override to a specific colo (for debugging/testing)
- ForcedResourceFlow: Use resource-flow routing
(
src/resource_flow.rs)
Routing uses a step-based algorithm (StepBasedRouter) that generates a prioritized list
of colos. The algorithm can be customized per-model via routing_params in the model
config.
Once colos are selected, constellation-entry resolves constellation-server addresses via
QS or DNS
(src/cs_resolver.rs)
and forwards the request via pipefitter or HTTP
(src/pipefitter_transport.rs).
Code references:
- Repository:
gitlab.cfdata.org/cloudflare/ai/constellation-entry - Main entry point:
src/main.rs - Colo resolver:
src/cs_resolver.rs
Wiki references:
constellation-server Load Balancing
constellation-server runs in each colo and handles:
Request handling
(src/main.rs,
server/src/lib.rs):
- Listens on Unix domain socket (for pipefitter) and TCP port
(
src/cli.rs:23,31) - Extracts routing metadata: model ID, account ID, fallback routing info
- Tracks distributed tracing spans
- Emits metrics to Ready Analytics
Endpoint load balancing and concurrency
(server/src/endpoint_lb.rs):
- Manages
EndpointGroupper model with permit-based concurrency control - Per-model concurrency limits configured via Config API
- Permit handling in
server/src/endpoint_lb/permits.rs - Explicit request queue (
ModelRequestQueue) with FIFO or Fair algorithms (server/src/endpoint_lb/queue.rs)
Service discovery
(service-discovery/src/lib.rs):
- Finds available GPU container endpoints via Consul
(
service-discovery/src/consul.rs) - Maintains world map of available endpoints
Inference forwarding
(server/src/inference/mod.rs):
- Regular HTTP requests: Buffers and forwards to GPU container with retry logic
- WebSocket requests: Upgrades connection and relays bidirectionally (no retries)
- Backend-specific handlers for Triton, TGI, TEI, and generic HTTP
(PipeHttp — used for vLLM and others) in
server/src/inference/
Failover and forwarding
(server/src/colo_forwarding.rs):
- If target returns error or is unavailable, may forward to different colo
- Tracks inflight requests via headers (
cf-ai-requests-inflight) to coordinate across instances - For pipefitter: Returns 500 with headers telling pipefitter to retry different colos
- Handles graceful draining: continues accepting requests while failing health checks
Code references:
- Repository:
gitlab.cfdata.org/cloudflare/ai/constellation-server - CLI args (timeouts, interfaces):
src/cli.rs
Wiki references:
- Constellation Server Updates Q3 2025 - Detailed request flow
- Reliable constellation-server failover
- Constellation server graceful restarts
Queuing and Backpressure
When GPU containers are busy, backpressure propagates through the stack:
-
constellation-server:
find_upstream()attempts to acquire a permit for a GPU endpoint. If all permits are taken, the request enters a bounded local colo queue (ModelRequestQueue). The queue is a future that resolves when a permit frees up (viatokio::sync::Notify). If the queue is also full, the request getsNoSlotsAvailable→ServerError::OutOfCapacity. OOC may also trigger forwarding to another constellation-server in the same colo (server/src/endpoint_lb/endpoint_group.rs:200-277). -
constellation-entry: Receives OOC from constellation-server (error codes 3033, 3040, 3047, 3048). Has a retry loop with separate budgets for errors (up to 3) and OOC (varies by request size). For small requests, pipefitter handles cross-colo forwarding. For large requests (>256 KiB), constellation-entry retries across target colos itself. All OOC variants are mapped to generic 3040 before returning to the client (
proxy_server.rs:1509-1600,errors.rs:62-67). -
worker-constellation-entry: Has an optional retry loop on 429/424 responses, but
outOfCapacityRetriesdefaults to 0 (configurable per-model viasdkConfig). By default, OOC goes straight to the client (ai/session.ts:101,137-141). -
Client: Receives HTTP 429 with
"Capacity temporarily exceeded, please try again."(internal code 3040).
There is no persistent queue in the Workers AI stack. Backpressure is handled through permit-based admission control, bounded in-memory queues, cross-colo retry, and ultimately rejection to the client. This contrasts with Replicate’s Redis Streams-based queue, where requests persist until a Director instance dequeues them.
State and Observability
Workers AI observability has two main channels from constellation-server:
Ready Analytics
(ready-analytics/src/):
Per-inference events sent via logfwdr (Unix socket). Each InferenceEvent includes:
- Request timing:
request_time_ms,inference_time_ms,queuing_time_ms,time_to_first_token - Billing:
neurons, cost metric values - Routing:
colo_id,source_colo_id,colo_path,upstream_address - Queue state:
local_queue_size - Flags: streamed, beta, external endpoint, queued in colo
See
ready-analytics/src/inference_event.rs
for full field list.
Prometheus metrics
(metrics/src/lib.rs):
Local metrics scraped by monitoring. Includes:
requests(by status),errors(by internal/HTTP code)total_permits,mean_used_permits,mean_queue_size,full_fraction(per model)infer_per_backend,infer_per_health_stateforward_failures(per target colo)backoff_removed_endpoints(endpoints in backoff state)
Note: “Workers Analytics” in the public docs refers to analytics for Workers code execution, not GPU inference. GPU inference observability flows through Ready Analytics.
Streaming
constellation-server supports two streaming mechanisms:
HTTP streaming
(server/src/inference/pipe_http.rs):
For PipeHttp backends, the response body can be InferenceBody::Streaming (line
222-230), which passes the Incoming body through without buffering via chunked transfer
encoding. The permit is held until the connection completes (line 396-397 for HTTP/1,
line 472 for HTTP/2). If the backend sends SSE-formatted data (e.g., TGI’s streaming
responses), constellation-server passes it through transparently without parsing.
WebSocket
(server/src/inference/mod.rs:180-400):
Bidirectional WebSocket connections are handled by handle_ws_inference_request. After
the HTTP upgrade, handle_websocket_request relays messages between client and inference
server. A max connection duration is enforced (max_connection_duration_s, line 332-338).
Loopback SSE
(server/src/loopback/):
Separate from client-facing streaming. constellation-server maintains SSE connections to
inference servers that support async batch processing. These connections receive
completion events (InferenceSuccess, InferenceError, WebsocketFinished, GpuFree)
which trigger permit release and async queue acknowledgment. See
loopback/sse.rs
for the SSE client implementation.
Concurrency Control
constellation-server uses permit-based concurrency control per model via
EndpointLoadBalancer
(server/src/endpoint_lb.rs).
Concurrency limits are configured in Config API. When local queuing is enabled, requests
wait in an explicit ModelRequestQueue (FIFO or Fair algorithm) until a permit becomes
available. Different models can have different concurrency limits on the same server.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Request model | Asynchronous by default (poll/webhook); sync opt-in via Prefer: wait | Synchronous by default; async opt-in via queueRequest |
| Queue persistence | Redis Streams — requests survive component restarts | No persistent queue — bounded in-memory queues, rejection on overflow |
| Backpressure | Queue absorbs bursts; Director dequeues when ready | Permit-based admission → local queue → cross-colo retry → 429 to client |
| State management | Director Tracker with webhook notifications (start, output, logs, completed) | Stateless request-response; observability via Ready Analytics events |
| Streaming | SSE via stream_publish_url; Director publishes output events | HTTP chunked transfer passthrough; WebSocket relay; loopback SSE for async batch |
| Routing | API routes to correct GPU cluster’s Redis; Director pulls from local queue | constellation-entry selects colos via routing algorithms; constellation-server selects GPU endpoint |
| Concurrency | DIRECTOR_CONCURRENT_PREDICTIONS per container (default 1) | Permit-based per model in constellation-server; FIFO or Fair queue algorithm |
| Fairness | Per-tenant shuffle sharding at queue level | Per-account round-robin in fair queue; per-account rate limiting |
| Transport | HTTP between Director and Cog | Pipefitter (small requests) or HTTP (>256 KiB) between constellation-entry and constellation-server |
| Failure handling | Webhook terminal states (failed/canceled); message ACK in Redis | Error codes propagated through stack; OOC triggers cross-colo retry |
Timeouts & Deadlines
Timeouts & Deadlines
Overview
Both platforms enforce timeouts on inference requests, but the designs reflect different assumptions about workload duration. Replicate has a multi-phase timeout system with separate controls for model setup, prediction execution, and overall lifetime — designed for workloads ranging from seconds to 24-hour training runs. Workers AI has a two-layer system: a per-model inference timeout in constellation-server (default 30 seconds) under a hard 4-minute ceiling in constellation-entry — optimized for fast, predictable inference.
Replicate Timeout System
Timeout behavior spans multiple components:
- replicate/web - defines timeout values per workload type in deployable metadata
- replicate/api - can inject per-prediction deadlines into prediction metadata
- replicate/cluster - translates deployable config into
DIRECTOR_*environment variables when creating K8s deployments; provides default values when the deployable config doesn’t specify them (pkg/config/config.go) - director - reads env vars and enforces timeouts during execution
Configuration
Three timeout parameters control prediction lifecycle. Each flows from replicate/web
through replicate/cluster to director as an environment variable. Director reads them as
flags
(director/config.go:52-58).
Model Setup Timeout (DIRECTOR_MODEL_SETUP_TIMEOUT)
Time allowed for model container initialization. Applied during health check polling
while waiting for the container to become ready. Measures time from StatusStarting to
completion. If exceeded, Director fails the prediction with setup logs and reports to the
setup run endpoint
(director/director.go:718-830).
Resolution chain:
- web:
Model.setup_timeoutproperty — returnsDEFAULT_MODEL_SETUP_TIMEOUT(10 minutes) when the underlying DB field is None (models/model.py:1038-1042). Stored onDeployableConfig.setup_timeout, then serialized asmodel_setup_timeoutwith a multiplier and bonus applied (api_serializers.py:1678-1681). - cluster: Reads
deployable.ModelSetupTimeout. If nonzero, uses it; otherwise falls back toconfig.ModelSetupTimeout(10 minutes) (pkg/config/config.go:74,pkg/kubernetes/deployable.go:1191-1201). - director: Reads
DIRECTOR_MODEL_SETUP_TIMEOUTenv var. Own flag default is 0 (disabled), but cluster always sets it.
Prediction Timeout (DIRECTOR_PREDICT_TIMEOUT)
Time allowed for prediction execution, starting from when execution begins (not including
queue time). Acts as a duration-based fallback when no explicit deadline is set. Director
enforces a minimum of 30 minutes if configured to 0 or negative
(director/director.go:285-288).
Resolution chain:
-
web:
Model.prediction_timeout(nullable duration). When set, stored onDeployableConfig.run_timeoutand serialized asprediction_timeout(models/model.py:1010-1017,api_serializers.py:1702-1703). When None, the serializedprediction_timeoutis omitted. -
cluster:
predictTimeout()resolves the value with this priority (pkg/kubernetes/deployable.go:1886-1906):deployable.PredictionTimeout(from web, if set)- Hardcoded per-account override map
userSpecificTimeouts(pkg/kubernetes/deployable.go:88-116) config.PredictTimeoutSeconds(30 minutes) (pkg/config/config.go:71)- For training workloads, uses the higher of the above and
TrainTimeoutSeconds(24 hours)
A mirrored per-account map exists in web’s
USER_SPECIFIC_TIMEOUTS(models/prediction.py:60-107) to preventterminate_stuck_predictionsfrom killing predictions that have these extended timeouts. -
director: Reads
DIRECTOR_PREDICT_TIMEOUTenv var. Own flag default is 1800s (30 minutes), but cluster always sets it.
Max Run Lifetime (DIRECTOR_MAX_RUN_LIFETIME)
Maximum time for the entire prediction lifecycle including queue time and setup.
Calculated from prediction.CreatedAt timestamp.
There are two paths that produce a max run lifetime constraint — one via the deployable config (env var on the Director pod) and one via a per-request header that becomes a deadline on the prediction itself.
Deployable config path (env var):
- web:
DeployableConfig.max_run_lifetime— defaults toDEFAULT_MAX_RUN_LIFETIME(24 hours) at the DB level (models/deployable_config.py:259-261). For deployment predictions,deployment.max_run_lifetimecan override this (logic.py:1175-1176). Serialized asmax_run_lifetime(api_serializers.py:1675-1676). - cluster: Reads
deployable.MaxRunLifetime. If nonzero, uses it; otherwise falls back toconfig.MaxRunLifetime(24 hours) (pkg/config/config.go:75,pkg/kubernetes/deployable.go:1203-1213). - director: Reads
DIRECTOR_MAX_RUN_LIFETIMEenv var. Own flag default is 0 (disabled), but cluster always sets it.
Per-request path (Cancel-After header):
- api: The
Cancel-AfterHTTP header on a prediction request is parsed as a duration (Go-style like5mor bare seconds like300). Minimum 5 seconds (server/v1_prediction_handler.go:233-276). - api:
calculateEffectiveDeadline()picks the shorter of the request’sCancel-Aftervalue and the deployable metadata’smax_run_lifetime, computes an absolute deadline from prediction creation time, and sets it onprediction.InternalMetadata.Deadline(logic/prediction.go:1076-1109). - director: This deadline is the “per-prediction deadline” checked first in the deadline priority (see Deadline Calculation below).
Deadline Calculation and Enforcement
Director calculates an effective deadline at dequeue time using this priority
(director/worker.go:63-107):
- Per-prediction deadline (
prediction.InternalMetadata.Deadline) - Deployment deadline (
prediction.CreatedAt + MaxRunLifetime, if configured) - Prediction timeout (fallback)
The earliest applicable deadline wins. At execution time
(director/worker.go:516-528):
deadline, source, ok := getEffectiveDeadline(prediction)
var timeoutDuration time.Duration
if ok {
timeoutDuration = time.Until(deadline)
} else {
timeoutDuration = d.predictionTimeout
}
timeoutTimer := time.After(timeoutDuration)
Timeout outcomes vary based on when and why the deadline is exceeded:
Before execution (deadline already passed at dequeue):
- Per-prediction deadline → aborted
- Deployment deadline → failed
During execution (timer fires while running):
- Per-prediction deadline → canceled
- Deployment deadline → failed
- Prediction timeout (fallback) → failed
Timeout Behavior Examples
Scenario 1: Version prediction with deployment deadline
DIRECTOR_MAX_RUN_LIFETIME=300(5 minutes)DIRECTOR_PREDICT_TIMEOUT=1800(30 minutes)- Prediction created at T=0, dequeued at T=60s
- Effective deadline: T=0 + 300s = T=300s
- Execution timeout: 300s - 60s = 240s remaining
Scenario 2: Prediction with explicit deadline
- Per-prediction deadline: T=180s (3 minutes from creation)
DIRECTOR_MAX_RUN_LIFETIME=600(10 minutes)- Prediction created at T=0, dequeued at T=30s
- Effective deadline: T=180s (per-prediction deadline wins)
- Execution timeout: 180s - 30s = 150s remaining
Scenario 3: Training with no deadlines
DIRECTOR_MAX_RUN_LIFETIME=0(disabled)DIRECTOR_PREDICT_TIMEOUT=1800(30 minutes)- No per-prediction deadline
- Effective deadline: None (uses fallback)
- Execution timeout: 1800s from execution start
Code references:
- Timeout configuration:
director/config.go:52-58 - Deadline calculation:
director/worker.go:63-107 - Execution timeout:
director/worker.go:516-528 - Setup timeout:
director/director.go:718-830
Workers AI Timeout System
Workers AI has a simpler timeout model than Replicate, but the request passes through multiple layers, each with its own timeout behavior.
Request Path Layers
A request from the internet traverses three services before reaching a GPU container:
- worker-constellation-entry (TypeScript Worker): The SDK-facing entry point. Calls
constellation-entry via a Workers service binding (
this.binding.fetch(...)) with no explicit timeout (ai/session.ts:131). See note below on Workers runtime limits. - constellation-entry (Rust, runs on metal): Wraps the entire request to
constellation-server in a hardcoded 4-minute
tokio::time::timeout(proxy_server.rs:99,proxy_server.rs:831-845). On timeout, returnsEntryError::ForwardTimeout. This applies to both HTTP and WebSocket paths. - constellation-server (Rust, runs on GPU nodes): Enforces per-model inference timeout (see below).
The 4-minute constellation-entry timeout acts as a hard ceiling. If a model’s
inference_timeout in constellation-server exceeds 240 seconds, constellation-entry will
kill the connection before constellation-server’s own timeout fires. This means
constellation-entry is the effective upper bound for any single inference request, unless
constellation-entry’s constant is changed.
Workers runtime limits: worker-constellation-entry is a standard Worker deployed on
Cloudflare’s internal AI account (not a pipeline First Party Worker). It does not set
limits.cpu_ms in its
wrangler.toml,
so it gets the default 30-second CPU time limit. However, CPU time only counts active
processing — time spent awaiting the service binding fetch to constellation-entry is
I/O wait and does not count
(docs). Since the
Worker is almost entirely I/O-bound, the CPU limit is unlikely to be reached. Wall-clock
duration has no hard cap for HTTP requests, though Workers can be evicted during runtime
restarts (~once/week, 30-second drain). In practice, the Workers runtime layer is
transparent for timeout purposes — constellation-entry’s 4-minute timeout fires well
before any platform limit would.
Configuration Levels
constellation-server Default
(constellation-server/src/cli.rs:107-109):
--infer-request-timeout-secs: CLI argument, default 30 seconds- Acts as the minimum timeout floor for all inference requests
- Applied globally across all models served by the constellation-server instance
Per-Model Configuration
(model-repository/src/config.rs:143):
inference_timeout: Model-specific timeout in seconds (default: 0)- Configured via Config API (Deus) and fetched from Quicksilver (distributed KV store)
- Part of
WorkersAiModelConfigstructure - Zero value means “use constellation-server default”
Timeout Resolution
When processing an inference request, constellation-server resolves the timeout
(server/src/lib.rs):
#![allow(unused)]
fn main() {
let min_timeout = state.infer_request_timeout_secs as u64;
let mut inference_timeout = Duration::from_secs(state.infer_request_timeout_secs as u64);
if let Some(model_config) = state.endpoint_lb.get_model_config(&model_id) {
inference_timeout = Duration::from_secs(
model_config.ai_config.inference_timeout.max(min_timeout)
);
}
}
The effective timeout is:
max(model_config.inference_timeout, infer_request_timeout_secs). This ensures:
- Models can extend their timeout beyond the 30-second default
- Models cannot reduce their timeout below the constellation-server minimum
- Unconfigured models (inference_timeout=0) use the 30-second default
Enforcement
constellation-server enforces the timeout by wrapping the entire inference call in
tokio::time::timeout(inference_timeout, handle_inference_request(...))
(server/src/lib.rs:588-599).
When the timeout elapses, the future is dropped, which closes the connection to the GPU
container. The error maps to ServerError::InferenceTimeout (line 603).
The other end of the dropped connection varies by model backend
(service-discovery/src/lib.rs:39-47):
- Triton: Upstream NVIDIA
tritonserverbinary (launched viamodel-greeter). Disconnect behavior depends on Triton’s own implementation. - TGI/TEI: Upstream HuggingFace binaries. Disconnect behavior depends on the upstream Rust server implementation.
- PipeHttp/PipeHttpLlm: Custom Python framework (
WorkersAiAppinai_catalog_common, FastAPI-based). The synchronous HTTP path (_generate) does not explicitly cancel theraw_generatecoroutine on client disconnect — the inference runs to completion (catalog/common/.../workers_ai_app/app.py:880-896). WebSocket connections do cancel processing tasks in theirfinallyblock (catalog/common/.../workers_ai_app/app.py:536-553).
The constellation-server timeout covers:
- Time waiting in the constellation-server queue (permit acquisition)
- Forwarding to the GPU container
- Inference execution in the backend (Triton, TGI, TEI)
- Response transmission back through constellation-server
But constellation-entry’s 4-minute timeout wraps the entire round-trip to constellation-server, so it is the effective ceiling regardless of per-model config.
Unlike Director’s system, there is no separate setup timeout. Model containers are managed by the orchestrator (K8s) independently of request processing. Container initialization and readiness are handled by service discovery and health checks, not request-level timeouts.
Timeout Behavior Examples
Scenario 1: Text generation model with default timeout
- Model config:
inference_timeout=0(not configured) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 30 seconds
Scenario 2: Image generation model with extended timeout
- Model config:
inference_timeout=120(2 minutes) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 120 seconds (model config wins)
Scenario 3: Model timeout exceeds constellation-entry ceiling
- Model config:
inference_timeout=300(5 minutes) - constellation-server:
infer_request_timeout_secs=30 - constellation-entry: 240 seconds (hardcoded)
- constellation-server effective timeout: 300 seconds
- Actual outcome: constellation-entry kills the connection at 240 seconds
Scenario 4: Attempted timeout reduction (not allowed)
- Model config:
inference_timeout=10(10 seconds) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 30 seconds (constellation-server minimum enforced)
Code references:
- worker-constellation-entry service binding call (no timeout):
ai/session.ts:131 - constellation-entry 4-minute timeout:
proxy_server.rs:99,proxy_server.rs:831-845 - constellation-server CLI configuration:
cli.rs:107-109 - Model config structure:
model-repository/src/config.rs:100-150 - Timeout resolution and enforcement:
server/src/lib.rs
Key Differences
Complexity and Granularity:
- Director: Multi-phase timeout system with separate controls for setup (model initialization) and execution (inference), plus deployment-level lifetime limits
- Workers AI: Two-layer timeout — constellation-entry imposes a hard 4-minute ceiling, constellation-server applies per-model timeouts within that ceiling
Deadline Calculation:
- Director: Priority-based deadline system with per-prediction, deployment, and fallback timeouts. Calculates earliest applicable deadline at dequeue time, accounting for time already spent in queue
- Workers AI: Simple max() operation between model config and constellation-server default, applied at request receipt time
Queue Time Handling:
- Director:
MaxRunLifetimeincludes queue time; execution timeout accounts for time already elapsed since creation - Workers AI: Timeout starts when constellation-server receives the request, inherently includes queue time
Setup vs Runtime:
- Director: Explicit
ModelSetupTimeoutfor container initialization, separate from prediction execution timeout - Workers AI: No request-level setup timeout; container lifecycle managed independently by orchestrator
Configuration Source:
- Director: Environment variables set by cluster service during deployment creation
- Workers AI: CLI argument (constellation-server) + distributed config store (Quicksilver/Config API)
Default Values:
- Director: 30 minutes prediction timeout (generous for long-running inference), setup and lifetime timeouts disabled by default
- Workers AI: 30 seconds base timeout (optimized for fast inference), extended per-model via config, hard ceiling of 4 minutes at constellation-entry
Minimum Enforcement:
- Director: 30-minute minimum enforced for prediction timeout if misconfigured
- Workers AI: constellation-server minimum enforced as floor, models can only extend
Use Case Alignment:
- Director: Designed for variable-length workloads including multi-hour training runs; flexible per-deployment configuration
- Workers AI: Optimized for inference workloads with known latency profiles; centralized model-specific configuration
Job Types
Job Types
Overview
Replicate supports three distinct job kinds — predictions, training, and procedures (pipelines) — all running through the same Director/cog infrastructure but with different queue prefixes, cog endpoints, timeout behaviors, and webhook structures. Workers AI has a single job type: inference. Task-type differentiation (text-generation, image-to-text, etc.) exists only as an input/output schema concern in worker-constellation-entry; everything downstream is task-agnostic.
Replicate Job Types
Director supports three job kinds, configured per-deployment via DIRECTOR_JOB_KIND
(director/config.go:37):
prediction(default)trainingprocedure
The job kind determines which cog HTTP endpoint Director calls, how webhooks are structured, and what timeout behavior applies.
How Job Kind Is Set
Cluster sets DIRECTOR_JOB_KIND when generating the Director pod spec
(pkg/kubernetes/deployable.go:1268-1280):
isProcedureMode()→procedureisTraining()→training- Neither → default
prediction
Where:
isProcedureMode()=UseProcedureMode || UseUnifiedPipelineNodePool(deployable_config.go:234)isTraining()=TrainingMode != nil && *TrainingMode(deployable_config.go:321)
DeployableKind (Web Side)
The web app distinguishes four deployable kinds
(deployable_config.py:114-120):
| DeployableKind | Director JobKind | Redis Queue Pattern | Key Pattern |
|---|---|---|---|
DEPLOYMENT_PREDICTION | prediction | input:prediction:{key} | dp-{uuid4_hex} |
FUNCTION_PREDICTION | procedure | input:prediction:{key} (or UNIFIED_PIPELINE_QUEUE) | fp-{hash} |
VERSION_PREDICTION | prediction or procedure | input:prediction:{key} (or UNIFIED_PIPELINE_QUEUE) | vp-{docker_image_id[:32]} |
VERSION_TRAINING | training | input:training:{key} | vt-{docker_image_id[:32]} |
Key prefixes: dp- = deployment, fp- = function/pipeline, vp- = version prediction,
vt- = version training. FUNCTION_PREDICTION is used for pipelines (procedures) — see
the Procedures section below.
Queue names are generated by DeployableConfig.queue
(deployable_config.py:565-589).
OfficialModel and ModelDeployment
Some models expose a “versionless API” — users run predictions against a model name
(POST /v1/models/{owner}/{name}/predictions) rather than a specific version hash.
OfficialModel (current production) — Ties a Model to a backing Deployment and a
BillingConfig
(official_model.py).
From the end user’s perspective, an OfficialModel behaves like a Deployment — it has
scaling, hardware selection, and version pinning — but the user interacts with it via the
model name, not a deployment name. The actual infrastructure (K8s deployment, Director
pods, Redis queues) runs through the backing Deployment.
ModelDeployment (stalled replacement) — Intended to replace OfficialModel by
associating a Model with a Deployment (and optionally a Procedure) via a labeled
relationship
(model_deployment.py).
The migration from OfficialModel to ModelDeployment was started but the work stalled and
was rolled back. Both entities exist in the codebase and the dispatch logic checks for a
default ModelDeployment first, falling back to OfficialModel
(api_internal_views.py:303-326).
TODO comments in the code (e.g. “Drop is_official from here once OfficialModels are
gone” at
model.py:1322)
reflect the intended direction.
A model uses_versionless_api when it has either an OfficialModel or a default
ModelDeployment
(model.py:1321-1323).
Predictions
The default job kind. Director sends PUT /predictions/{id} to cog
(cog/client.go:183-184).
Three prediction subtypes exist, distinguished by how the deployable is created:
Version predictions — Direct runs against a model version. Created when a user runs a
prediction against a specific version ID. Key: vp-{docker_image_id[:32]}
(version.py:618-619).
Deployment predictions — Runs through a named deployment. The deployment owns the
scaling config, hardware selection, and version pinning. Kind is DEPLOYMENT_PREDICTION
with key prefix dp-
(deployment.py:611,
deployment.py:1024-1027).
Hotswap predictions — A version that has additional_weights and a base_version.
The base container stays running and loads different weights (e.g. LoRA adapters) at
runtime instead of booting a new container per version.
Hotswap requirements
(version.py:586-594):
- Version has
additional_weights - Version has a
base_version - Version’s base docker image matches the base version’s docker image
- Base version’s model is public, image is non-virtual, and image accepts replicate weights
Hotswap key: hp-{base_version.docker_image_id[:32]}
(version.py:605-608).
Multiple hotswappable versions sharing the same base version share the same key, so they
share the same Director pool.
Hotswap uses prefer_same_stream=True
(logic.py:1243),
which tells Director’s Redis message queue to prefer consuming from the same stream shard
repeatedly
(redis/queue.go:248-251).
This increases the chance that a Director instance serves the same version’s weights
consecutively, avoiding unnecessary weight reloads. PreferSameStream is incompatible
with batching (MaxConcurrentPredictions > 1)
(config.go:90-92).
Training
DIRECTOR_JOB_KIND=training. Director sends PUT /trainings/{id} (if cog >= 0.11.6),
otherwise falls back to PUT /predictions/{id}
(cog/client.go:279-311).
Training-specific behaviors:
- Queue prefix is
input:training:instead ofinput:prediction: - Timeout uses the higher of the prediction timeout and
TrainTimeoutSeconds(24 hours) (deployable.go:1901) Prediction.Destinationfield is present (specifies where trained weights go) (cog/types.go:182)- Concurrency is always 1
(
deployable_config.py:417-418) - Webhook payload wraps the prediction in a
JobwithKind: "training"and the prediction in theTrainingfield instead ofPrediction(worker.go:746-749)
Training is set via DeployableConfig.training_mode which maps to
DeployableKind.VERSION_TRAINING
(deployable_config.py:591-595).
Procedures
DIRECTOR_JOB_KIND=procedure. Director sends PUT /procedures/{id}
(cog/client.go:54).
Procedures are multi-step workflows (pipelines) that run user-provided Python code on a
shared runtime image (pipelines-runtime). Unlike predictions and trainings, the
container image is not user-built — it’s a standard runtime that loads user code at
execution time.
Procedure source loading:
Before executing, Director downloads the procedure source archive (tar.gz) from a signed
URL, extracts it to a local cache directory, and rewrites the
procedure_source_url in the prediction context to a file:// path
(worker.go:540-548).
The procedures.Manager handles download, caching (keyed by URL hash), and extraction
(internal/procedures/manager.go).
Webhook compatibility hack:
When sending webhooks back to the API, Director overwrites the procedure job kind to
prediction because the API doesn’t understand the procedure kind yet
(worker.go:734-741).
Procedure subtypes in web:
Procedure— persistent, owned by a user/org, tied to a model. UsesDeployableKind.FUNCTION_PREDICTION(procedure.py:230).EphemeralProcedure— temporary/draft, used for iteration. Source archives expire after 12 hours. UsesDeployableKind.VERSION_PREDICTION(procedure.py:42).
Both can route to the UNIFIED_PIPELINE_QUEUE if configured
(deployable_config.py:573-584),
which is a shared warm compute pool.
Workers AI Job Types
Workers AI has a single job type: inference. There is no training, no procedure/pipeline execution, and no multi-step workflow at the request level.
Task Types (Schema-Level Only)
worker-constellation-entry defines 14 task types in taskMappings
(catalog.ts):
text-generation,text-classification,text-embeddings,text-to-image,text-to-speechautomatic-speech-recognition,image-classification,image-to-text,image-text-to-textobject-detection,translation,summarization,multimodal-embeddingsdumb-pipe(passthrough, no input/output transformation)
Each task type is a class that defines:
- Input schema validation and preprocessing
- Payload generation (transforming user input to backend format)
- Output postprocessing
The task type is determined by the model’s task_slug from the config API. The
infrastructure path (constellation-entry → constellation-server → GPU container) is
identical regardless of task type. Task types are purely a worker-constellation-entry
concern — constellation-entry and constellation-server are task-agnostic.
LoRA at Inference Time
constellation-server supports LoRA adapter loading for Triton-backed models
(inference/triton.rs:142-214).
When a request includes a lora parameter:
- Finetune config is fetched from the config API (Deus) via the
model-repositorycrate - Config is validated: finetune must exist for the current model, all required assets must be present
lora_configandlora_nameare injected as additional Triton inference inputs
This is inference-time adapter loading, not training. The base model stays loaded and the
adapter is applied per-request. There is no LoRA-aware routing or stickiness —
constellation-server’s endpoint selection and permit system are unaware of the lora
field, so consecutive requests for the same adapter may hit different GPU containers.
Async Queue
Two conditions must be met for async execution:
- The model must have
use_async_queueenabled in the config API (Deus) — a per-model property, defaultfalse(config.rs:134) - The requester must opt in per-request via
options.queueRequest: truein the JSON body (ai.ts:133-144)
When both are true, the request is POSTed to the async-queue-worker service binding
(queue.ts:23).
The response is either 200 (completed immediately) or 202 (queued). To retrieve results,
the requester sends a follow-up request with inputs.request_id set, which triggers a GET
to the queue service
(ai.ts:148-153).
Limitations:
- Polling only — no webhook/callback support
- No streaming —
inputs.streamis explicitly deleted whenqueueRequestis true (ai.ts:135)
This contrasts with Replicate where every prediction is async by default (webhook-driven),
and Prefer: wait is the opt-in for synchronous behavior. Workers AI is synchronous by
default, with async as an opt-in that requires both model-level enablement and per-request
signaling.
Key Differences
Job type diversity:
- Replicate: Three distinct job kinds (prediction, training, procedure) with different cog endpoints, timeout behaviors, queue prefixes, and webhook structures
- Workers AI: Single job type (inference) with task-type differentiation only at the input/output schema level
Training support:
- Replicate: First-class training support with dedicated queue prefix, extended timeouts (24h), destination field for trained weights, and separate cog endpoint
- Workers AI: No training capability. LoRA adapters are loaded at inference time from pre-trained weights, not trained on the platform
Multi-step workflows:
- Replicate: Procedures allow user-provided Python code to run on a shared runtime, enabling multi-model orchestration and custom logic. Source code is downloaded and cached per-execution
- Workers AI: No equivalent. Multi-step orchestration is left to the caller
Weight loading model:
- Replicate: Hotswap mechanism allows a base container to serve multiple versions by
loading different weights at runtime. Uses
prefer_same_streamto optimize for cache locality across a shared Director pool - Workers AI: LoRA adapters are injected per-request at the Triton level. No concept of a shared base container serving multiple “versions” — each model ID maps to a fixed deployment
Job type awareness:
- Replicate: The same infrastructure (cluster, Director, cog) handles all three job kinds. Job kind is a configuration axis that changes queue prefixes, cog endpoint paths, timeout calculations, and webhook payload structure — but the systems are the same
- Workers AI: Task type is a thin layer in worker-constellation-entry only. Everything downstream (constellation-entry, constellation-server, GPU containers) is task-agnostic
Deployment Management
Deployment Management
Overview
Replicate’s autoscaler is reactive: predictions create K8s Deployments on demand, queue depth drives scaling at 1-second resolution, and idle deployments get pruned automatically. Workers AI’s ai-scheduler is proactive: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and there is no idle-based pruning. The systems reflect fundamentally different assumptions — bursty ephemeral workloads vs. an always-available model catalog.
Replicate: Autoscaler
The autoscaler runs in every GPU model serving kubernetes cluster. It manages k8s
Deployment objects in the models or serving namespaces — creating them when prediction
traffic appears, scaling replica counts based on queue depth, and pruning idle ones.
Source: cluster/pkg/autoscaler/autoscaler.go
(fa8042d)
Concurrent loops
| Loop | Interval | Purpose |
|---|---|---|
| Deployment Dispatcher | event-driven (BRPOP) | Creates/updates K8s Deployments when predictions arrive |
| Scaler | 1 second | Adjusts replica counts based on queue depth |
| Scale State Snapshotter | 1 second | Captures queue lengths + replica counts into Redis |
| Queue Pruner | 1 hour | Deletes stuck requests older than 24 hours |
| Deployment Pruner | 1 minute | Deletes K8s Deployments that have been idle too long |
Deployment Dispatcher
┌───────────────┐
│ replicate/api │
└──────┬────────┘
│ LPUSH
▼
┌──────────────────────┐
│ prediction-versions │
│ (Redis list) │
└──────┬───────────────┘
│ BRPOP
▼
┌──────────────────────┐
│ Deployment │
│ Dispatcher │
│ (event-driven) │
└──────┬───────────────┘
│ create/update
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startDeploymentDispatcher BRPOPs from the
prediction-versions Redis queue. Each prediction triggers
ensureDeployableDeployment, which creates or
updates the K8s Deployment if the config has changed. Change detection uses a
config hash comparison plus a template version serial.
Key behavior:
- Deployments are created on demand — the first prediction for a version/deployment triggers K8s Deployment creation
- Rate-limited to avoid overwhelming the K8s API
- Config hash comparison means no-op if nothing changed
Scaler
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────┬───────────────┘
│ read replicas + queue lengths
▼
┌──────────────────────┐
│ Snapshotter (1s) │
└──────┬───────────────┘
│ write
▼
┌──────────────────────┐
│ Redis cache │
│ (scale state) │
└──────┬───────────────┘
│ read
▼
┌──────────────────────┐
│ Scaler (1s) │
│ computeNewReplica │
│ CountV2 │
└──────┬───────────────┘
│ patch replicas
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startScaler runs every 1 second. For each deployable found
in K8s, it loads the scale state from Redis cache and calls
scaleDeployable().
The scaling algorithm (computeNewReplicaCountV2)
is HPA-inspired:
desiredReplicas = currentReplicas × (metricValue / targetMetricValue)
Where the metric is backlog-per-instance:
backlogPerInstance = (queueLength + queueHeadroom) / instances
Three dampening mechanisms prevent oscillation:
- Scaling policies — rate limits on scale-out and scale-in. Default scale-out: allow +5 count or +100% per minute (whichever is larger). Default scale-in: unrestricted rate.
- Stabilization windows — the algorithm considers the min (for scale-in) or max (for scale-out) desired replica count over a time window. Default scale-out: 30 seconds. Default scale-in: 2 minutes.
- Hysteresis — ignore small oscillations below a threshold (default 0.02).
Additional scaling features:
- Slow start: cap at 5 replicas until the first pod reports ready
- Scale-to-zero: supported, with configurable idle timeout delay. Gated by
scale-to-zero-delaykill switch. - Override min replicas: per-deployable minimum via
DeployableConfig - Emergency cap:
cap-max-replicasfeature flag
Scaling configuration
scaling.Config holds per-deployable scaling parameters:
| Field | Default | Source |
|---|---|---|
| MetricTarget | config.BacklogPerInstance flag | CLI flag |
| Hysteresis | 0.02 | hardcoded |
| MinReplicas | 0 | DeployableConfig.ScalingConfig |
| MaxReplicas | config.MaxReplicas flag | CLI flag |
| ScaleOut behavior | 30s stabilization, +5 or +100%/min | algorithm_v2_defaults.go |
| ScaleIn behavior | 2min stabilization, no rate limit | algorithm_v2_defaults.go |
Per-deployable overrides come from deployable.ScalingConfig,
which is set via the web’s DeployableConfig and serialized
into K8s annotations.
Deployment Pruner
┌──────────────────────┐
│ Redis cache │
│ (last-request-time) │
└──────┬───────────────┘
│ read
▼
┌──────────────────────┐
│ Deployment Pruner │
│ (1min) │
└──────┬───────────────┘
│ delete idle
▼
┌──────────────────────┐
│ K8s Deployments │
│ (models / serving) │
└──────────────────────┘
startDeploymentPruner runs every 1 minute. It
deletes K8s Deployments that haven’t received a prediction since
DeployableDeploymentDeleteInactiveAfter.
Max 100 deletions per cycle. Fails safe if the latest request time is missing
from Redis (skips the deployment rather than deleting it).
Queue Pruner
startQueuePruner runs every 1 hour.
Deletes stuck requests older than 24 hours from per-deployable
Redis streams. The API’s sweeper also cleans these streams at much
shorter intervals (30s). See
Queue Management for details on both
cleanup mechanisms and how they overlap.
Workers AI: ai-scheduler
Workers AI deployment management is split across multiple systems.
ai-scheduler is a Rust binary deployed to a core datacenter K8s
cluster (pdx-c). It has multiple subcommands, each run as a
separate K8s deployment: AutoScaler, Scheduler, AdminAPI,
ReleaseManagerWatcher, ExternalNodesWatcher, RoutingUpdater.
They share the scheduling crate as a library.
Source: ai-scheduler/ (89f8e0d)
Architecture overview
┌──────────────────────────────────────────────────────────┐
│ ai-scheduler │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ auto_scaling │ │ admin_api │ │ release_mgr │ │
│ │ (5min loop) │ │ (manual ops) │ │ _watcher │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬─────────┘ │
│ │ │ │ │
│ └────────┬────────┴──────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ scheduling │ (action execution) │
│ └──────┬───────┘ │
│ │ │
└────────────────┼─────────────────────────────────────────┘
│
┌────────┴────────┐
▼ ▼
Cloud Chamber External K8s
(internal GPU) (OKE, CoreWeave, Nebius,
Lambda, GCP, Crusoe)
│
▼
inference-kubernetes-engine
(Model CRD operator)
Three entry points produce scheduling actions:
- auto_scaling — utilization-based autoscaler loop
- admin_api — manual scaling endpoints for humans
- release_manager_watcher — watches for software version changes, triggers rolling updates
All actions flow through the scheduling app, which applies them to either
Cloud Chamber (internal capacity) or external K8s clusters via the
external_nodes module.
Autoscaler (auto_scaling)
The autoscaler runs every 5 minutes. Each cycle:
- Fetches model usage from ClickHouse (request counts, inference time per minute over 15-minute windows)
- Fetches usage “forecast” from ClickHouse (same-time-last-week data, not a forecasting model)
- Fetches utilization metrics from Prometheus (soft — errors produce empty data, not failures)
- Loads model configuration from Config API
- Gets current application state from Cloud Chamber
- Fetches external endpoint health from Quicksilver and counts healthy deployments from Cloud Chamber
- Handles external capacity scheduling (may emit
ScheduleExternalModelApplications) - Computes desired instance count per model
- Emits
UpscaleModelApplicationsorDownscaleModelApplicationsactions
Scaling algorithms
Three algorithms coexist. Selection depends on per-model Config API properties:
1. Request-count-based (default fallback):
instances = avg_count_per_min × autoscaling_factor
Default autoscaling_factor is 0.8. This assumes each inference request takes
~1 minute. Crude, but serves as the baseline.
2. Utilization-based (model_utilization_autoscaler = true):
Computes utilization from cumulative inference time:
required_to_fit = inference_time_secs_per_min / 60 / max_concurrent_requests
utilization = required_to_fit / model_instances
Uses a deadband instead of dampening:
- If utilization <
1 / (factor + 0.15)→ downscale - If utilization >
1 / factor→ upscale - Otherwise → no change
Minimum factor is clamped to 1.2 (20% overprovisioning floor). Takes the max of current and forecast inference time.
3. EZ utilization-based (scaling_config.utilization_based.active = true):
The newest algorithm. Configurable per-model via ScalingConfig:
scaling_config:
disabled: false
utilization_based:
active: true
min_utilization: 0.3
max_utilization: 0.8
utilization_scale: 1.0
use_forecast: true
use_out_of_cap: true
prometheus: false
Features:
- Configurable min/max utilization bounds (replaces hardcoded deadband)
- Optional forecast-based scaling (
use_forecast) - Out-of-capacity adjustment (
use_out_of_cap): inflates measured utilization by(successful + out_of_cap) / successfulto account for requests turned away.successful_per_minfloored at 0.1 to avoid division by zero. - Safety cap: when OOC adjustment is active, measured utilization capped at 2× the sans-OOC value
- Prometheus metrics as an alternative utilization source
- Asymmetric instance base: upscale decisions use
model_healthy_instances, downscale usesmodel_requested_instances - Unhealthy instance handling: adds +1 buffer when upscaling with unhealthy instances; gradually decrements requested count when >1 unhealthy instances exist
When active, this algorithm overrides the request-count-based
algorithm. However, if model_utilization_autoscaler is also true,
the older utilization-based algorithm takes final precedence — the
priority order is: request-count → EZ utilization → old utilization.
Instance bounds
All algorithms clamp results to [min_count, max_count] from Config API:
- Default
min_count: 5 (set per autoscaler instance via CLI flagdefault_min_count) - Default
max_count: 100 - Downscale batch size capped at 10 per cycle
There is no idle-based scale-to-zero. Models stay provisioned at
min_count (default 5, but can be set to 0 per-model).
Kill switches
disable_scaling— global Config API property, disables all scalingscaling_config.disabled— per-model disable- Tier change grace period — skips models that recently changed tiers
- Tier selector — autoscaler instances can be scoped to specific tier ranges
Scheduling actions
Actions are applied via Cloud Chamber API or external K8s:
| Action | Target | Effect |
|---|---|---|
UpscaleModelApplications | Cloud Chamber | PATCH application instances count up |
DownscaleModelApplications | Cloud Chamber | PATCH application instances count down |
ScheduleExternalModelApplications | External K8s | Create/patch Model CRD replicas |
CreateModelApplication | Cloud Chamber | Create new CC application + set instances |
DeployModelApplicationToTiger | Cloud Chamber | Deploy to Tiger (canary) environment |
DeleteDeployment | Cloud Chamber | Delete specific deployment |
RemoveModelApplication | Cloud Chamber | Delete entire application |
ModifyApplicationRegions | Cloud Chamber | Modify region constraints |
ModifyApplicationSchedulingPriority | Cloud Chamber | Modify scheduling priority |
ModifyApplicationAffinities | Cloud Chamber | Modify colocation/affinity constraints |
Instance count is clamped to 0–1400 per application. Up/downscale distributes changes round-robin across applications.
Actions are defined in the Action enum.
External capacity
External capacity (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) has a separate
management model controlled by the ExternalCapacity config
(Management enum):
#![allow(unused)]
fn main() {
pub enum Management {
Disabled, // no auto-management
Manual, // humans manage it entirely
Upscaling, // autoscaler can only scale UP
Full, // autoscaler can scale both directions
}
}
The autoscaler code has this comment:
NOTE: currently auto-scaler can only scale up external instances but not scale down. In order to scale down, use admin api endpoint: /models/schedule_externally
This comment may be stale — Management::Full supports both
directions in the code. The real constraint is that the autoscaler’s
scaling algorithms don’t dynamically compute external replica targets;
external capacity is config-driven (expected_replicas), not
utilization-driven.
The external path works as follows:
- Infrastructure: K8s clusters provisioned via Terraform
(
oci-terraform/repo — OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) - Model operator:
inference-kubernetes-enginewatchesModelCRDs and reconciles K8s Deployments/Services to match - Scaling up: autoscaler patches Model CRD
spec.replicasviaKubernetesProvider.schedule() - Scaling down: manual via admin API or the
scale-downCLI tool in inference-kubernetes-engine, which gradually decrements replicas with a configurable interval and batch size
Admin API
Manual scaling endpoints behind is_workers_ai_team()
access check:
GET /models/scale— upscale a model by amount (createsUpscaleModelApplicationsaction; notably a GET for a mutating op)POST /models/schedule— run the full scheduler loop for a single modelPOST /models/schedule_externally— set external replica countPOST /models/schedule_externally_on_specific_provider— target specific providerPOST /models/remove— delete specific deployment by colo/metalPOST /models/delete_applications— delete all applications for a model, including external CRDsGET /models/status— model status/debugging- Tiger management — create, list, delete canary deployments
Model provisioning
Unlike Replicate, Workers AI models are pre-provisioned rather than created on demand from inference traffic:
- Model is registered in Config API with properties (
min_count,max_count,software,gpu_memory, etc.) release_manager_watcherdetects the software version and creates Cloud Chamber applications- Autoscaler maintains instance count within
[min_count, max_count]based on utilization - There is no equivalent of Replicate’s deployment pruner — models stay
provisioned at
min_countuntil manually removed
Config API properties (scaling-relevant)
| Property | Default | Description |
|---|---|---|
min_count | 5 | Minimum instances |
max_count | 100 | Maximum instances |
scaling_factor | 0.8 | Autoscaling factor |
model_utilization_autoscaler | false | Enable utilization-based algorithm |
scaling_config | none | YAML blob with disabled, utilization_based sub-config |
disable_scaling | false | Kill switch (also available as global property) |
external_capacity | none | External provider config with management mode |
gpu_memory | 22 | GPU memory request (GB) |
gpu_model | none | Specific GPU model requirement |
dual_gpu | false | Requires two GPUs |
colo_tier | none | Restrict to colo tier |
colo_region | none | Restrict to colo region |
tier | “unknown-scheduler” | Model tier (Tier-0, Tier-1, Tier-2) |
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Scaling signal | Queue depth (backlog per instance) — real-time | Inference time utilization — 15-minute ClickHouse windows |
| Loop frequency | 1 second | 5 minutes |
| Deployment creation | On demand from prediction traffic | Pre-provisioned via Config API + release manager |
| Scale-to-zero | Yes, with idle timeout | No idle-based scale-to-zero; min_count defaults to 5 but can be 0 per-model |
| Deployment pruning | Automatic (idle deployments deleted after timeout) | None — models stay provisioned until manually removed |
| Dampening | HPA-style: scaling policies, stabilization windows, hysteresis | Deadband: upper/lower utilization bounds |
| Orchestrator | Direct K8s API (Deployments) | Cloud Chamber (internal) + K8s operator (external) |
| External capacity | N/A (single K8s cluster) | Multi-provider (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) with per-model management mode |
| Manual controls | Feature flags (cap-max-replicas, scale-to-zero-delay) | Admin API endpoints, disable_scaling kill switch, Management::Manual mode |
| Scale-down on external | N/A | Config-driven; Management::Full supports both directions but targets are set manually, not utilization-driven |
| Algorithm selection | Single algorithm (HPA-inspired) | Three algorithms with priority ordering: request-count → EZ utilization → old utilization |
| Config source | K8s annotations (from web’s DeployableConfig) | Config API properties (key-value store) |
Architectural contrast
Replicate’s autoscaler is reactive and fine-grained: predictions create deployments, queue depth drives scaling at 1-second resolution, idle deployments get pruned. The system assumes workloads are bursty and ephemeral.
Workers AI’s ai-scheduler is proactive and coarse-grained: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and external capacity management is heavily human-assisted. The system assumes a catalog of always-available models.
Resource Allocation
Resource Allocation
Overview
Both platforms need to map model requirements to GPU hardware, but they approach it from opposite directions. Replicate exposes hardware selection to users as a first-class concept — users pick a SKU, and the platform provisions dedicated resources. Workers AI abstracts hardware entirely — operators configure GPU memory requirements per model in Config API, and the platform handles placement across a shared fleet.
Replicate Resource Model
Hardware SKUs
The Hardware Django model
(web/models/hardware.py) defines every available
hardware option as a SKU. Each SKU combines a compute unit (GPU type)
with a compute units per instance count (how many GPUs per pod).
Current compute units
(cluster/pkg/kubernetes/compute_unit.go):
| Compute Unit | GPU | Status |
|---|---|---|
cpu | None | Active |
gpu-t4 | Nvidia T4 (16 GB) | Active |
gpu-a100 | Nvidia A100 (40 GB) | Active |
gpu-a100-80g | Nvidia A100 (80 GB) | Active |
gpu-h100 | Nvidia H100 (80 GB) | Active |
gpu-h200 | Nvidia H200 (141 GB) | Active |
gpu-l40s | Nvidia L40S (48 GB) | Active |
gpu-a40, gpu-a40-small, gpu-a40-large | Nvidia A40 (48 GB) | Legacy |
gpu-t4-highmem, gpu-t4-lowmem | Nvidia T4 variants | Legacy |
gpu-rtx-a4000, gpu-rtx-a5000, gpu-rtx-a6000 | Nvidia RTX Axxxx | Legacy |
gpu-flex-ampere-min-40g | Any Ampere ≥40 GB | Legacy |
Multi-GPU SKUs use the same compute unit with a higher
compute_units_per_instance. For example, gpu-2x-a100 is compute unit
gpu-a100 with compute_units_per_instance=2. The Hardware model tracks
this as a separate SKU with its own pricing.
Hardware availability is gated by flags on the model:
allow_for_models, allow_for_deployments, is_legacy, is_preview.
The HardwareQuerySet methods (available_for_models(),
available_for_deployments()) filter to billable, non-legacy, public
hardware
(web/models/hardware.py:162-166).
K8s Resource Limits
The computeUnitLimits map
(cluster/pkg/kubernetes/resources.go) defines CPU,
memory, and GPU limits per compute unit:
| Compute Unit | CPU | Memory | GPUs |
|---|---|---|---|
cpu | 1 | 2 Gi | 0 |
gpu-t4 | 4 | 16 Gi | 1 |
gpu-a100 | 10 | 72 Gi | 1 |
gpu-a100-80g | 10 | 144 Gi | 1 |
gpu-h100 | 13 | 144 Gi | 1 |
gpu-h200 | 13 | 144 Gi | 1 |
gpu-l40s | 10 | 72 Gi | 1 |
For multi-GPU pods, limits are multiplied by compute_units_per_instance
with special cases for 8x configurations:
- 8x A40/A40-Large: 6 CPU, 85 Gi per unit (reduced from 10/72)
- 8x A100-80G: 10 CPU, 120 Gi per unit (reduced from 10/144)
- T4 requests: memory request fudged to 13 Gi (limit stays 16 Gi) to fit 4x T4 on a node
Director’s sidecar container adds its own overhead: 256 Mi for most
GPUs, 512 Mi for L40S/H100/A100/A100-80G, or 1 Gi when
DirectorExtraMemory is set
(deployable.go:1450-1465).
The model container also gets /dev/shm sized to 50% of its memory
limit
(deployable.go:681-686).
Node Placement and Bin-Packing
Replicate runs on three cloud providers
(config/config.go):
- CoreWeave CKS — primary, most models
- Nebius MK8S — H200 capacity
- GKE — T4 capacity
On CoreWeave CKS, placement uses a combination of node affinity,
bin-packing, priority classes, and topology spread
(deployable_cks.go):
Node affinity (required): pods are pinned to nodes matching the GPU
class label (gpu.nvidia.com/class). CPU models go on GPU nodes but
are excluded from H100/H200 nodes to preserve expensive capacity.
Procedure (pipeline) workloads go to a dedicated
customer-cpu-nodepool.
Bin-packing: GPU models use CoreWeave’s binpack-scheduler to pack
pods tightly, leaving room for 4x and 8x pods. Pod affinity preferences
(weight 10) group pods of the same compute unit on the same node. A
second pod affinity groups pods by grace period (standard vs extended
predict timeout) so that preemption doesn’t kill long-running
predictions.
Priority node pools: 8x GPU pods get a node preference (weight 100) for dedicated priority node pools. Non-8x pods prefer to avoid these pools. On-demand node pools are also deprioritized (weight 100 against).
Priority classes:
r8-high— 4x+ GPU pods that are allowed to preempt othersr8-high-no-preempt— 4x+ GPU pods that won’t preemptr8-cpu-model— CPU models, lowest priority, preemptible by GPU
Topology spread: CPU models on CKS use TopologySpreadConstraints
with maxSkew=2 to prevent bunching on a single node and starving GPU
models of CPU resources.
On Nebius MK8S, placement is similar to CKS but narrower in scope
(deployable_mk8s.go):
GPU support: H200 only. The switch statement handles ComputeUnitCPU
and ComputeUnitH200 — everything else gets a NO_SUCH_GPU sentinel
that prevents scheduling. Node affinity uses the standard K8s label
node.kubernetes.io/instance-type (value gpu-h200-sxm).
Bin-packing: Same pod affinity logic as CKS — non-8x GPU pods get compute-unit bin-packing (weight 10) and grace-period bin-packing (weight 10). No custom scheduler though — uses the default K8s scheduler.
CPU models: Excluded from H200 nodes via NodeSelectorOpNotIn on
the GPU class label, but no dedicated CPU node pool — they land on
whatever non-H200 nodes are available.
On GKE, placement is minimal
(deployable_gke.go):
Node affinity (required): pods are pinned to nodes matching a
replicate/role label with value model-{compute_unit} (e.g.
model-gpu-t4). This is a Replicate-defined label on the node pool,
not a vendor GPU class label.
Tolerations: Three tolerations per pod — the default GKE
nvidia.com/gpu taint, a legacy model-hardware taint, and a newer
model-compute-unit taint. The code comments indicate the
compute-unit taint is the intended replacement for the hardware taint.
Tenancy Model
The isolation boundary is the deployable (model version or deployment), not the account. Each deployable gets its own K8s Deployment with dedicated pods and exclusive GPU access. But multiple accounts can send prediction requests to the same deployable — the GPU pods themselves are shared across all authorized requesters.
Public models and deployments are multi-tenant: any account can run predictions against them, and all requests land on the same pool of Director pods. This is the common case for official models.
Private models and deployments are single-tenant by default: only the owning account can run predictions. Access can be granted to other accounts via an allow-list, similar to GitHub’s private repository permissions model. The access control is enforced at the API layer (web/api check permissions before enqueuing) — there is no infrastructure-level isolation (no separate namespaces, no network policies between tenants).
The “shared” vs “dedicated” billing distinction (instance_tenancy
in Metronome pricing config) is orthogonal to tenancy. “Shared”
means serverless pay-per-run pricing; “dedicated” means the customer
pays for reserved uptime on a deployment. Both billing models serve
requests from potentially multiple accounts on the same GPU pods.
Workers AI Resource Model
GPU Types and Memory Sizing
Workers AI models specify GPU requirements in Config API
(config_api/src/lib.rs):
| Field | Default | Description |
|---|---|---|
gpu_memory | 22 (GB) | GPU memory requested |
cpu_memory | same as gpu_memory | CPU memory override |
vcpu | (none) | CPU core request |
dual_gpu | false | Needs 2 GPUs |
gpu_model | (none) | GPU model override, e.g. "NVIDIA H100" |
The platform supports three GPU types
(cloudchamber/src/application.rs:259-263):
- L4 — Nvidia L4 (internal colos)
- H100 — Nvidia H100 80 GB (internal colos + external capacity)
- H200 — Nvidia H200 141 GB (external capacity)
GPU type is inferred from the gpu_model string in the application
config. If no model is specified, H100 is assumed
(application.rs:286-291).
Multi-GPU allocation is calculated from total gpu_memory divided by
per-GPU maximum: 80 GB for H100, 141 GB for H200. The result is clamped
to 1–8 GPUs
(external_nodes/.../model.rs:27-34).
Unlike Replicate, users never select hardware. The model’s GPU requirements are set by Workers AI operators via Config API, and the platform handles placement.
Placement and Scheduling
Workers AI uses two scheduling paths:
Internal capacity (Cloudchamber): Models are scheduled via
Cloudchamber, Cloudflare’s internal container orchestrator. The
create_application_request function
(application.rs:247-255) maps model config to a
Cloudchamber application with SchedulingPolicy::Gpu, placement
constraints (colo tier, region, PoPs), and optional scheduling priority.
Cloudchamber’s placement is controlled by constraints on the
application object. These map from Config API properties to
ApplicationConstraints fields
(application.rs:216-233):
Colo tier (colo_tier): Cloudflare datacenters are grouped into
tiers by GPU capacity. Setting colo_tier restricts a model’s
instances to datacenters at that tier. For example, when deploying in
“tiger mode” (canary), default single-GPU models are pinned to tier 3
(application.rs:231-232).
Colo region (colo_region): Restricts instances to datacenters in
specific geographic regions. Accepts a comma-separated list.
Colo PoPs (colo_pops): The most specific constraint — restricts
instances to named Points of Presence (individual datacenters).
Scheduling priority (scheduling_priority): Sets the Cloudchamber
scheduling priority for the application. Default is 50. Currently the
only non-default value is Leonardo = 75, used for Leonardo
partnership models to give them preferential placement
(scheduling_priority.rs).
Separately, blacklisted_colos removes specific colos from the
model’s routing table (not scheduling). This is consumed by the
routing app when building the colo list that constellation-entry
uses for request forwarding — it doesn’t affect where Cloudchamber
places instances
(routing/src/lib.rs:201-208).
External capacity (Kubernetes): For external GPU providers (OCI,
etc.), ai-scheduler creates a Model custom resource
(external_nodes/.../model.rs:73-150) that the IKE
operator reconciles into K8s Deployments. GPU limits are set as
nvidia.com/gpu resource requests when >1 GPU is needed.
Tenancy Model
Workers AI is multi-tenant at the model level. The default is
that all accounts share the same model instances — a single GPU
container serving @cf/meta/llama-3.1-8b-instruct handles requests
from every account. The isolation boundary is the model ID, not the
caller.
However, models can be restricted to specific accounts. Two
mechanisms exist in worker-constellation-entry
(ai.ts:49-64):
allowed_accounts— a comma-separated list of account IDs in Config API. When set, only listed accounts can use the model. Requests from other accounts get a 403. This is how partnership models (e.g.@cf/leonardo/phoenix-1.0) are restricted to the partner’s account.is_private— a boolean flag marking a model as private.
Both are enforced at the API layer in worker-constellation-entry, not at the infrastructure level. A “private” Leonardo model still runs on the same GPU fleet as public models — the access gate is just earlier in the request path. This parallels Replicate’s approach where private model access control is also API-layer only.
Since all accounts share the same GPU instances, fairness is enforced at the request level — per-account fair queuing and rate limiting in constellation-server. See Queue Management for details.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Hardware selection | User-facing SKU picker | Operator-configured, abstracted from users |
| GPU types | T4, A100 (40/80), H100, H200, L40S + legacy | L4, H100, H200 |
| Multi-GPU | compute_units_per_instance (1–8) | gpu_memory / per-GPU max (1–8) |
| Resource limits | Explicit per-unit CPU/mem/GPU in Go map | GPU memory-based, CPU/mem optional |
| Isolation boundary | Deployable (model version or deployment) — dedicated GPU per deployable, multiple accounts share it | Model ID — all accounts share the same instances, no per-account isolation |
| Access control | API-layer: public models open to all, private models gated by allow-list | API-layer: public models open to all, allowed_accounts / is_private restrict partnership and private models |
| Fairness | No fairness controls — all requests to a deployable are equal | Per-account fair queuing + rate limiting (see Queue Management) |
| Scheduling | K8s with bin-pack scheduler + affinity | Cloudchamber (internal) + K8s via IKE (external) |
| Cloud providers | CoreWeave CKS, Nebius, GKE (legacy) | Cloudflare internal colos + external (OCI, etc.) |
| Preemption | Priority classes, configurable per-deployable | Scheduling priority in Cloudchamber |
| Placement constraints | GPU class node affinity, priority node pools | Colo tier/region/PoP (scheduling); blacklisted colos (routing only) |
The fundamental architectural difference is where the isolation boundary sits. Replicate isolates at the deployable level — each model version or deployment gets dedicated GPU pods, but multiple accounts can share those pods (public models) or access is allow-listed (private models). Workers AI isolates at the model level — all accounts share the same instances for a given model, with fairness enforced at the request level via queuing and rate limiting rather than resource partitioning.
Queue Management
Queue Management
Overview
Both platforms queue inference requests when backend capacity is saturated, but the queue architectures are fundamentally different. Replicate uses Redis Streams as a durable, distributed queue between the API layer and Director workers — requests persist across process restarts and can be claimed by any Director instance. Workers AI uses bounded in-memory queues local to each constellation-server instance — requests exist only in the process that received them and are lost if it restarts.
Replicate: Redis Streams with Consumer Groups
Queue Naming
Every deployable gets its own Redis Stream. The queue name is derived
from the deployable’s key prefix and kind
(deployable_config.py:565-589):
| DeployableKind | Key Prefix | Queue Name |
|---|---|---|
deployment-prediction | dp- | input:prediction:dp-{hex} |
function-prediction | fp- | input:prediction:fp-{hex} |
version-prediction | vp- | input:prediction:vp-{hex} |
version-training | vt- | input:training:vt-{hex} |
Function and version predictions that use the unified pipeline
cluster config share a single queue: input:prediction:fp-{0*32}
(deployable_config.py:44-45).
MessageQueue
Director’s MessageQueue
(director/redis/queue.go) wraps a Redis
Streams consumer group. Each Director pod is a consumer in the
"director" consumer group. The queue runs two goroutines via
errgroup:
-
Fetcher: Waits on a request channel, calls
XREAD(via theQueueClient.Readinterface) with the configured block duration, and sends the result back on a response channel. The fetcher is the only goroutine that reads from Redis —GetMessagesends a request struct and waits for the response (queue.go:131-148). -
Claimer: Runs every 2 seconds (
claimInterval), callingXCLAIMon all unacked messages to maintain ownership. This prevents Redis from reassigning messages to other consumers in the group while the current Director is still processing them. After the fetcher stops (shutdown requested), the claimer continues until all messages are acked or the context is canceled (queue.go:150-191).
Messages are tracked in an unackedMessages map keyed by unique ID
(stream name + message ID). When a Director worker finishes
processing a prediction, it calls Ack on the message, which
removes it from the map. If a Director pod crashes, its unacked
messages remain in the consumer group’s pending entries list (PEL).
The API sweeper reclaims these orphaned messages via XPENDING +
XCLAIM after 30 seconds of idle time, marks the predictions as
failed, and deletes the messages. See
Queue Sweeping below.
Sharded Queues
Each deployable’s queue is sharded across multiple Redis streams
(:s0, :s1, etc.). The ShardedQueueClient
(queue.go:317-356) manages these shards,
using "director" as the consumer group name across all of them.
Every ack pipelines XACK + XDEL — the message is both
acknowledged in the consumer group and deleted from the stream in
a single round-trip (queue.go:329-341).
preferSameStream
When preferSameStream is enabled, the MessageQueue remembers
which stream shard it last read from and passes it as
PreferStream to the next Read call
(queue.go:246-252). This is a hotswap
optimization — by preferring the same shard, the Director pod
improves weight cache locality (the model weights are already
loaded for predictions from that shard’s deployable).
The streamChanged flag on the message indicates when the Director
switched to a different stream, which signals the worker loop that
it may need to handle a model swap.
MultiMessageQueue (Blue/Green)
MultiMessageQueue
(queue.go:73-77) wraps multiple Queue
backends and rotates between them on each GetMessage call. When
multiple backends are present, the block duration is hard-coded to
100ms to stay responsive regardless of which backend has messages
(queue.go:300-308).
This supports blue/green Redis migration — during a migration, both
the old and new Redis instances are active, and the
MultiMessageQueue polls both. Messages carry an explicit
reference to the queue they came from so that acks and claims go to
the correct backend.
Queue Pruning (Autoscaler)
The autoscaler’s startQueuePruner
(autoscaler.go:299-311) runs hourly, scanning
every deployable’s queue for stuck messages. A message is
“stuck” if it’s older than 24 hours (stuckTimeout)
(prune.go:25).
The pruner checks two sources per queue
(prune.go:118-148):
XPENDING: Messages that were delivered to a consumer but never acknowledged — these were likely picked up by a Director that crashed or got stuck during processing.XRANGE: Messages still in the stream that were never delivered to any consumer — these were likely enqueued when no Director pods were running (e.g. setup failed and the deployment scaled to zero).
Stuck messages are deleted in chunks of 50 via pipelined XACK +
XDEL (prune.go:151-161).
Queue Sweeping (API)
The API service runs its own sweeper
(api/internal/sweeper/sweeper.go) that
overlaps with the autoscaler’s pruner but operates differently:
- Frequency: Every ~30 seconds (with ±50% jitter), vs the pruner’s hourly cycle.
- Scope: Scans all streams matching each deployable kind prefix
(
deployment-prediction,function-prediction,hotswap-prediction,version-prediction,version-training). - Mechanism: Uses
ClaimStaleMessages(XPENDINGfiltered by 30-second idle threshold, thenXCLAIM) to take ownership of messages that a Director claimed but stopped processing — e.g. because the pod crashed. Then sends a failure update to the web service and deletes the message. - Purpose: Catches predictions where the Director that claimed them is gone. The sweeper marks them as failed so the user gets a response rather than waiting indefinitely. Messages that were never claimed by any Director (sitting unclaimed in the stream) are not visible to the sweeper — those are handled by the autoscaler’s pruner after 24 hours.
The sweeper also runs two deadline-related loops:
- Cordon loop (~1 second): Queries the
meta:deadlinessorted set for predictions whose deadline has passed within the last 10 seconds. Deadlines are stored as scores at enqueue time. Matched predictions are moved to a presorted garbage set and terminated with statusabortedviaTerminatePrediction. This handles predictions still sitting in the queue (or still running) when their time expires. - GC loop (~30 seconds): Processes the presorted garbage set, deleting the actual stream messages for predictions that were cordoned or otherwise terminated.
These three loops — sweep, cordon, GC — cover different failure modes. The sweep loop handles crashed Directors (idle PEL entries). The cordon loop enforces prediction deadlines regardless of Director state. The GC loop cleans up stream messages after either of the other two has terminated the prediction. The autoscaler’s hourly pruner is a slower safety net for messages that somehow survive all three.
Consumer Pruning (API)
A separate ConsumerPruner
(api/internal/sweeper/consumer_pruner.go)
removes stale consumers from consumer groups. Consumers that
haven’t claimed a message in 24 hours (staleConsumerTimeout) are
pruned. This cleans up after Director pods that were terminated
without gracefully leaving the consumer group.
Workers AI: Bounded In-Memory Queues
ModelRequestQueue
ModelRequestQueue
(queue.rs) is an in-memory queue per model per
constellation-server instance. It’s created when an EndpointGroup
is initialized and holds requests that couldn’t immediately acquire
a permit on any endpoint.
The queue supports two algorithm variants, selectable per model via Config API:
- Fifo (
fifo_queue.rs): SimpleVecDeque<(account_id, Sender)>. Requests are processed in arrival order. All accounts share a single queue. - Fair (
fair_queue.rs): Per-accountVecDeque<Sender>queues stored in aDashMap, with a round-robinVecDeque<account_id>that rotates through accounts. An account is added to the rotation on its first enqueue and removed when its queue empties.
Both variants retain only live senders — abandoned slots (where the
requester timed out or disconnected) are cleaned up on each
enqueue call via retain(|q| !q.is_closed()).
Capacity
Queue capacity is calculated from model config
(service-discovery/lib.rs:79-94):
capacity = max(queue_config.capacity, colo_queue_size)
× sum(endpoint.max_concurrent_requests)
The result is clamped to [0, 1000]. In practice, only one of
queue_config.capacity or colo_queue_size is set — the max is
taking the non-zero value.
For the Fair queue, capacity is enforced per account — each
account can have up to capacity items in its individual queue.
For the Fifo queue, capacity is a global limit on total queue size.
Permit System
The PermitManager
(permits.rs) tracks concurrency per endpoint. Each
endpoint has a concurrency_limit (from
EndpointConfig.max_concurrent_requests) and a map of active
permits keyed by request ID.
try_acquire_permit() is non-blocking — it checks
active_permits() < concurrency_limit and either returns a
Permit or None. Permits have release-on-drop semantics: when a
Permit is dropped, it calls release() which removes the entry
from the map and calls notify.notify_one() to wake the queue
processor (permits.rs:376-383).
There are three permit types:
Permit: Assigned to a specific request ID. Released on drop (unless detached).UnassignedPermit: Holds a concurrency slot without a request ID. Used by the queue processor to reserve capacity before matching to a queued request. Decrementsunassigned_counton drop (permits.rs:308-315).DetachedPermit: A permit that has been persisted elsewhere (e.g. Consul KV) and should NOT be released on drop. Created viapermit.detach().
The Three Outcomes
EndpointGroup.find_best_available_endpoint
(endpoint_group.rs:149-288) is the core
dispatch function. It tries endpoints in priority order and
produces one of three outcomes:
- Leased: A permit was acquired immediately on a healthy endpoint. The request proceeds to inference with no queuing.
- Queued: All endpoints are at capacity, but the queue has
room. A
oneshot::Receiveris returned — the caller awaits it and gets an endpoint+permit when the queue processor fulfills the request. - NoSlotsAvailable: All endpoints are at capacity AND the queue is full. The request is rejected with an out-of-capacity error.
Background Processor
Each ModelRequestQueue spawns a long-running tokio task
(queue.rs:252-264) that:
- Waits on a
Notifysignal (fired when a permit is released or config changes). - Calls
fulfill_next_queued()in a loop until it returns false. fulfill_next_queued()gets the next queued request (respecting the Fair/Fifo algorithm), tries to acquire an unassigned permit on a healthy endpoint, and if successful, sends the(Endpoint, UnassignedPermit)pair through the oneshot channel to the waiting request.
If no healthy endpoints exist when the processor runs, the entire
queue is cleared — all waiting requests receive a
NoLocalResourcesAvailable error
(queue.rs:203-209).
Hot-Swappable Algorithm
When the queue algorithm changes (Fair↔Fifo) via config update,
set_config (queue.rs:88-158):
- Creates a new queue implementation.
- Drains all pending requests from the old queue.
- Re-enqueues them into the new queue (skipping closed senders).
- Spawns a new processor task and aborts the old one.
Requests that can’t be re-enqueued (e.g. new queue is smaller)
receive a NoLocalResourcesAvailable error.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Queue backing | Redis Streams (durable, distributed) | In-memory VecDeque (per-process, volatile) |
| Scope | One stream per deployable, shared across all Director pods | One queue per model per constellation-server instance |
| Durability | Messages survive process restarts and Redis failovers | Messages lost on process restart |
| Consumer model | Consumer groups — any Director can claim any message | Direct dispatch — queue processor matches to local endpoints only |
| Fairness | None — all requests to a deployable are equal | Per-account fair queuing (round-robin) or FIFO, configurable per model |
| Capacity control | Unbounded stream (pruned after 24h) | Bounded: colo_queue_size × total_concurrency, clamped to 1000 |
| Backpressure | None at queue level — requests accumulate until pruned | Immediate rejection (NoSlotsAvailable) when queue is full |
| Stuck message handling | Autoscaler pruner (hourly, 24h threshold) + API sweeper (30s, claims and fails) | Not needed — in-memory queue with sender liveness checks |
| Concurrency control | Director processes one prediction at a time per worker goroutine | Permit-based: PermitManager per endpoint with configurable concurrency_limit |
| Blue/green migration | MultiMessageQueue rotates across Redis backends | Not applicable — no external queue to migrate |
The most significant architectural difference is durability vs bounded backpressure. Replicate’s Redis Streams are durable — a prediction enqueued during a Director outage will be picked up when a new pod starts. But this durability comes with unbounded queue growth and requires two separate cleanup mechanisms (pruner + sweeper) to handle stuck messages. Workers AI’s in-memory queues are volatile but provide immediate backpressure — when capacity is exhausted, the caller gets an error right away rather than waiting for a timeout or pruner cycle.
Model Storage & Images
Model Storage & Images
Overview
Both platforms need to get model weights and inference code onto GPU nodes before serving
requests. The approaches diverge sharply. Replicate builds Docker images via cog, then
optionally layers on a multi-tier caching system (pget → Hermes → object store) to get
weights close to GPU nodes. Workers AI stores model weights in R2 and fetches them at
container startup via a dedicated model-greeter init container, with a disk-reaper
sidecar managing local cache on each GPU node.
Replicate: Images, Weights, and Caches
Container Image Resolution
Every deployable has a docker_image URI set during cog push (e.g.
r8.im/user/model@sha256:...). At pod creation time, the cluster autoscaler resolves this
to an internal registry URI (deployable.go:1999-2040):
The r8.im/ prefix is stripped and replaced with the internal
MODEL_ARTIFACT_REGISTRY_BASE (an unauthenticated Artifact Registry mirror). Non-r8.im
URIs are passed through unmodified.
Two-Container Pods
Every model pod has two containers
(deployable.go:460-476):
director— orchestrates the model lifecycle (queue consumption, health polling, state reporting, cache restore/persist). Image comes from the services registry.model— runs the cog server. Image is the resolved model image.
They primarily interact via HTTP, and also share a supervisor-ipc volume (50 MB tmpfs)
for some cases, and a run-cache volume for runtime state.
Standard Startup Path
The model container’s entrypoint (ModelEntrypointScript.sh) runs:
- Updates
pgetbinary fromPGET_DOWNLOAD_URL(or uses the monobase-bundled version if available). - Optionally upgrades cog (via
ENTRYPOINT_COG_OVERRIDE_PACKAGE) or installshf_transfer. - Starts the cog HTTP server (
python -m cog.server.http).
The model’s setup() method runs inside the cog server and is often the step during which
weights are downloaded via the pget tool proxied through Hermes as a pull-through cache.
FUSE (deprecated)
Replicate also built a FUSE-based path (fuse/fuse.go) that separated
weights from code entirely: a host-level FUSE daemon served weights on demand, and the
model container ran a lightweight monobase image instead of a Docker image with weights
baked in. Director acted as a gRPC client to the FUSE mounter, managing mount lifecycle
via Start/Heartbeat/Stop RPCs. This eliminated the weight download step from cold
starts — weights were read lazily from the FUSE mount as the model accessed them.
The approach is being wound down and remaining FUSE-enabled models are slated for
migration to the standard path. The code still exists in getImageURI (monobase fallback)
and the FUSE entrypoint script, but no new models use it.
pget: Parallel Chunk Downloader
pget (replicate/pget) downloads model weights in parallel chunks. Key
configuration (download/options.go:24-43):
- CacheableURIPrefixes: Allowlist of domains+path-prefixes eligible for pull-through caching.
- CacheHosts: Ordered list of cache hostnames used with consistent hashing — the same URL always routes to the same cache host.
- ForceCachePrefixRewrite: When enabled, rewrites all requests to the first cache host (used for Hermes routing).
- CacheUsePathProxy: Prepends the original host to the cache request path instead of using host-based routing.
Config is injected via K8s ConfigMaps: pget-config (standard) or pget-hermes-config
(Hermes-enabled regions).
Hermes: Regional Edge Cache
Hermes (replicate/hermes) is an HTTP read-aside cache deployed in GPU
serving regions (CKS, Nebius). It caches model weights in region-local S3-compatible
object storage (CoreWeave CAIOS) to avoid repeated cross-region downloads.
Three components:
- Cache server (
server/cacherouter.go): Receives requests from pget. If the file is cached, returns a 307 redirect to a presigned S3 URL. If not cached, redirects to the origin and enqueues a background cache job. - Processor: Downloads from origin and uploads to regional S3.
- Pruner: TTL-based cleanup of cached objects.
HuggingFace traffic is routed through Hermes by setting
HF_ENDPOINT=http://hermes.../huggingface.co on all containers in CKS/Nebius regions. The
weights.replicate.delivery domain is also rewritten through Hermes when
ForceCachePrefixRewrite is enabled.
Torch and CUDA Caches
The director container manages two S3-backed caches (cache/cache.go):
- Torch Compile Cache: Backs
TORCHINDUCTOR_CACHE_DIR. Size range 10 MB–10 GB, 7-day TTL, refreshed daily. - CUDA Checkpoint Cache: Backs
DIRECTOR_CUDA_CHECKPOINT_DIR.
On startup, Director restores these caches from S3 (after the model reports
StatusReady). On shutdown, it persists any changes back. Cache files are stored as
.tar.zst archives with timestamp-based naming for ordering.
Workers AI: R2, model-greeter, and disk-reaper
SoftwareConfig and Image Resolution
The ai-scheduler determines what container image and configuration to use for each model. It reads from two sources:
- Config API: Model properties including
software_name,gpu_memory, and optionallycog_image(config_api/lib.rs:248-249). - R2 model-catalog bucket: SoftwareConfig YAML files at
{version}/{software_name}.yaml(catalog/r2.rs).
SoftwareConfig (config/software.rs) defines:
image— container image URIapi— inference backend type (e.g.pipe-http,tgi)ports— network port configurationmounts— volume mounts (cache dir, disk-reaper socket, etc.)network— firewall allow rules (slirpnetstack)entrypoint— optional override
For Cog models, SoftwareConfig::cog(image) constructs a config with the cog_image from
Config API and hardcoded network allow rules for HuggingFace, R2, PyPI, and Replicate
domains (software.rs:156-240).
Container Image Protocol
Container images use a cf:// protocol prefix that resolves to
registry.cloudchamber.cfdata.org (image_registry_protocol.rs).
This is a Cloudchamber concept — the protocol maps to one or more registry domains,
providing fallback if one registry is unavailable. In practice, cf://image:tag resolves
to registry.cloudchamber.cfdata.org/image:tag.
model-greeter: Weight Downloader
model-greeter downloads model files from R2 at container startup. On IKE (external) clusters it runs as a K8s init container; on Cloudchamber (internal) it runs as a sidecar.
The IKE operator configures model-greeter as an init container named
install-model-greeter (deployment.rs:615-623) that copies its
binary into a shared volume. The main container then uses that binary to download weights.
Environment variables injected by the scheduler
(cloudchamber/lib.rs:88-192):
R2_ENDPOINT— R2 API endpointMODEL_CATALOG_R2_BUCKET— bucket nameMODEL_CATALOG_VERSION— current catalog version (from Release Manager)SOFTWARE_TO_LOAD— software config nameMODEL_TO_LOAD— model identifier
Weights are downloaded to /cache (mounted from the disk-reaper managed volume).
disk-reaper: Local Cache Management
disk-reaper manages the local model cache on GPU nodes. Three modes
(model-crd/crd.rs:80-86):
- Shared (default): Cache backed by a PersistentVolumeClaim shared across pods on the same node. disk-reaper runs as a separate DaemonSet, communicating via Unix socket. Multiple models share the same cache volume.
- Ephemeral: Cache backed by an
emptyDirvolume. disk-reaper runs as a sidecar container inside the model pod (deployment.rs:583-585). Cache is lost when the pod terminates. - Disabled: No cache management. Cache volume is still an
emptyDirbut no reaper process runs.
The IKE operator selects the mode based on operator config and the Model CRD’s
disk_reaper.mode field. When the operator is configured for ephemeral-only mode, any
model requesting Shared is downgraded to Ephemeral
(deployment.rs:226-239).
Inference Backends
Workers AI supports multiple inference server backends, determined by the ai-software=
tag on the container and the api field in SoftwareConfig:
| Backend | SoftwareConfig api | Use Case |
|---|---|---|
| Triton | triton | NVIDIA Triton Inference Server |
| TGI | tgi | HuggingFace Text Generation Inference |
| TEI | tei | HuggingFace Text Embeddings Inference |
| PipeHttp | pipe-http | vLLM, Cog models |
| PipeHttpLlm | pipe-http-llm | LLM-specific pipe variant |
| PartnerPipeHttp | partner-pipe-http | Partner-hosted models |
This is a significant difference from Replicate, where Cog is the only inference server.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Weight sources | Mixed — some baked into Docker layers, many downloaded during setup() from HuggingFace, weights.replicate.delivery, and other origins | Single source — R2 model-catalog bucket |
| Weight download | Model’s setup() via pget (parallel chunks, consistent-hash caching) | model-greeter init container/sidecar downloads from R2 |
| Regional caching | Hermes (HTTP read-aside cache, 307 redirects to regional S3) | R2 serves from nearest edge node (region-less by design) |
| Local cache | No persistent local cache for weights | disk-reaper manages shared PVC or ephemeral emptyDir |
| Compilation caches | S3-backed torch compile + CUDA checkpoint caches (7-day TTL) | None |
| Container layout | Two containers: director sidecar + model | One main container + model-greeter init + optional disk-reaper sidecar |
| Inference servers | Cog only | Triton, TGI, TEI, PipeHttp, PipeHttpLlm, PartnerPipeHttp |
| Image protocol | r8.im/ rewritten to internal Artifact Registry | cf:// protocol with multi-registry fallback |
Replicate’s weight delivery is messy by nature. Model authors control what happens in
setup() — some models have weights baked into Docker layers, others download from
HuggingFace, others pull from weights.replicate.delivery, and some combine approaches.
pget and Hermes are optimizations layered on top: pget parallelizes downloads from
whatever origin the model uses, and Hermes caches results in regional S3 so subsequent
cold starts in the same region avoid cross-region transfers. Neither is strictly
required — models work without them, but cold starts are slower.
Workers AI’s approach is more uniform: weights always come from R2 via model-greeter, giving the platform full control over the download path. R2 operates region-lessly — it serves from the nearest edge node that has the data, so there’s no need for explicit regional copies the way Hermes populates region-local S3. The disk-reaper shared PVC adds another layer by keeping weights on-node across pod restarts.
The inference server flexibility is a notable Workers AI advantage. Supporting Triton, TGI, TEI, and vLLM alongside Cog means Workers AI can use purpose-built servers optimized for specific workload types, while Replicate routes everything through Cog.
State Reporting
State Reporting
Overview
Both platforms need to know what their model instances are doing — whether they’re booting, idle, processing requests, or failing.
Replicate’s Director reports instance state to the API every 15 seconds and sends a report when model setup finishes. Individual predictions are tracked via OTel spans. Workers AI logs a structured event for every inference request into Cloudflare’s internal analytics pipeline, and exposes Prometheus metrics for aggregate capacity signals.
Replicate: HTTP Reporting and OTel Spans
Instance State Monitor
Director runs an instance state monitor (instancestate/monitor.go) that
tracks what the model instance is doing at any given moment. Three activity states
(instancestate/types.go:10-14):
| State | Meaning |
|---|---|
Boot | Instance is starting up (setup running) |
Idle | Ready but no active predictions |
Active | Processing ≥1 prediction |
The monitor maintains a concurrency counter (activityLevel). When a prediction starts,
the level increments; when it finishes, it decrements. Any level > 0 means Active
(types.go:18-23).
Every 15 seconds (DefaultUtilizationInterval), the monitor computes a Metrics
payload (types.go:39-55) and POSTs it as JSON to the cluster-local
API instance:
| Field | Description |
|---|---|
active_time, boot_time, idle_time | Seconds in each state this interval |
metric_duration | Total interval length |
utilization | active_time / metric_duration |
mean_concurrency | Weighted average concurrency level |
deployment_key, docker_image_uri | Deployable identity |
hardware, compute_unit, compute_unit_count | Resource info |
instance_id, instance_metadata, version | Pod-level metadata |
Utilization is active_time / metric_duration — a number between 0 and 1 representing
what fraction of the interval the instance was processing at least one prediction. Mean
concurrency captures how many predictions were running simultaneously on average
(monitor.go:80-91).
Transport: HTTP POST to DIRECTOR_REPORT_INSTANCE_STATE_URL. Auth via
WEBHOOK_AUTH_TOKEN Bearer header. Retries via httpclient.ApplyRetryPolicy
(reporter.go:102-140).
Setup Run Reporting
Once when model setup completes (or times out), Director sends an HTTP POST to
DIRECTOR_REPORT_SETUP_RUN_URL (reporter.go:142-212). The payload
includes:
status— terminal state (“ready”, “failed”, etc.)started_at,completed_at— RFC3339 timestampslogs— setup output (truncated to last 20 KiB for OTel spans)instance_metadata— pod name, namespace, etc.runtime_metadata— cog version, cog version override, pod name
Director also records OTel span attributes for setup timing:
setuprun.scheduled_to_setup_started, setuprun.scheduled_to_ready,
setuprun.duration_seconds, and setuprun.status.
Scale State Snapshots
The cluster autoscaler captures periodic snapshots of each deployable’s queue depth and
replica count, stored in Redis (scalestate/scalestate.go):
| Field | Description |
|---|---|
queue_length | Total across blue+green Redis |
queue_length_blue, queue_length_green | Per-Redis-instance queue depth |
replica_count | Ready replicas from K8s |
time | Unix timestamp |
Up to 120 snapshots are retained per deployable (MaxSnapshots). The autoscaler uses
this history when making scaling decisions — the snapshot window provides context about
recent queue pressure and capacity.
Two Prometheus gauges are emitted per snapshot (scalestate.go:69-76):
autoscaler_snapshot_queue_length— queue depth, labeled by deployable attributesautoscaler_snapshot_replica_count— ready replicas
Scaling decisions are recorded as OTel spans (autoscaler.scaleVersion) with the full
snapshot history JSON, old/new replica counts, and all deployable attributes.
Prediction Tracker
Each prediction’s lifecycle is managed by a Tracker
(director/tracker.go) that wraps an OTel span and notifies subscribers on
state changes. The tracker is not thread-safe — it’s owned by a single worker goroutine.
Prediction states: Processing, Succeeded, Failed, Canceled, Aborted. The tracker
records span attributes for cancel reasons, error codes (e.g. E1234), compliance check
results, moderation outcomes, and billing metrics (token counts, image counts).
Subscribers receive PredictionUpdate{Prediction, Events} notifications. Events are
webhook event types (WebhookEventCompleted, WebhookEventOutput). Terminal updates are
held until Stop() to avoid sending duplicate completion events.
pget Download Metrics
Director exposes a POST /metrics endpoint (server/metrics.go) that
pget uses to report download metrics — URL, file size, throughput, and errors
(pget/pkg/pget.go:148-180). Each metric is converted to an OTel span
(director.metric) with attributes from the payload. Rate-limited to 100 req/s.
HuggingFace and S3 presigned URLs are normalized to strip query parameters, avoiding
high-cardinality span attributes.
Workers AI: Inference Events and Prometheus Metrics
Ready Analytics (Inference Events)
Every inference request produces an InferenceEvent that is serialized as a Cap’n Proto
message and sent to logfwdr (Cloudflare’s internal log forwarding daemon) via a Unix
domain socket (ready-analytics/src/inference_event.rs,
ready-analytics/src/logfwdr.rs). This feeds into Cloudflare’s Ready
Analytics pipeline.
Key fields per event (inference_event.rs:42-77):
| Field | Description |
|---|---|
account_id | Cloudflare account ID |
model_id | Model identifier (e.g. @cf/meta/llama-2-7b) |
request_time_ms | Total request time |
inference_time_ms | Actual inference duration |
queuing_time_ms | Time spent in local queue |
time_to_first_token | TTFT for streaming responses |
neurons | Neuron cost metric (billing) |
error_code | Error code if failed |
colo_id / source_colo_id | Colo identifiers |
local_queue_size | Queue depth at request time |
model_container_id | Container identifier |
Bitfield flags track request properties: streamed, beta model, external endpoint, queued
in colo, Unix socket connection (inference_event.rs:13-19).
RequestSource distinguishes Worker bindings, Pages bindings, and REST API requests
(inference_event.rs:22-28).
Transport: LogfwdrClient connects to a Unix socket with a 64-slot mpsc channel
buffer. 10-second watchdog reconnect on failure. Metrics: connect_err, write_err, ok
(logfwdr.rs).
Endpoint Analytics (Lifecycle Events)
The AvailableEndpointTracker
(endpoint-analytics/src/lib.rs) tracks endpoint membership
changes — when endpoints appear or disappear from the world map. It generates lifecycle
events sent to a Workers pipeline via LifecycleEventSender:
| Field | Description |
|---|---|
begin_ts | When endpoint appeared |
end_ts | When endpoint disappeared (0 if still present) |
event_type | Endpoint |
model_id | Model identifier |
hostname | Tracker’s hostname |
Updates are tied to the world map refresh interval.
Prometheus Metrics
constellation-server exposes Prometheus metrics
(metrics/src/lib.rs). Key gauges and counters:
Request-level:
requests— total requests by HTTP statuserrors— errors by internal + HTTP codeinfer_per_backend— inferences per backend type (Triton, TGI, etc.)infer_per_health_state— inferences by endpoint health state
Queue and capacity:
turnstile_outcomes— permit acquisition outcomes (leased/queued/rejected)full_fraction— fraction of time a model is fully saturatedtotal_permits/mean_used_permits— permit availability and usage per modelmean_queue_size— average queue depth per model
These queue/capacity metrics are computed from a ModelPermitBuffer — a rolling window of
600 samples at 500 ms intervals (5-minute window) tracking (total_permits, available_permits, queue_size) per model (metrics/src/lib.rs).
Infrastructure:
constellation_server_forward_failures— cross-colo forward failuresconstellation_server_endpoints_removed— endpoints in backoff stateconstellation_server_external_endpoints— external endpoint health by host/port
ai-scheduler Metrics
The scheduler emits its own Prometheus metrics (prefixed ai_scheduler_) every 60
seconds by polling the Cloudchamber API (crates/metrics/src/lib.rs):
model_deployments— deployment count by model, tier, region, status, GPU modelrequested_instances/min_requested_instances/max_requested_instances— scaling boundsautoscaler_measured_utilization/estimated_utilization/forecast_utilization— utilization signalsautoscaler_required_instances— minimum instances to meet target utilizationexternal_instances— external instance count by provider, region, model, phaseexternal_nodes— external GPU count by provider, region, state
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Instance state | 3 states (Boot/Idle/Active) with concurrency level, reported every 15s | Per-request inference events; permit/queue metrics from 5-min rolling window |
| Transport | HTTP POST (JSON) to Replicate API | Unix socket (Cap’n Proto) to logfwdr pipeline |
| Granularity | Periodic snapshots (15s utilization, setup completion) | Per-request events + rolling metric windows |
| Setup reporting | Report sent once when setup completes, with logs, timing, metadata | No direct equivalent; endpoint lifecycle events track appearance/disappearance |
| Scaling telemetry | Redis-cached snapshot history (120 snapshots), OTel spans per decision | Prometheus gauges polled every 60s from Cloudchamber API |
| Prediction tracking | Per-prediction OTel span with subscriber notifications | Per-request inference event to analytics pipeline |
| Metrics system | OTel spans → tracing backend | Prometheus (scraped) + Ready Analytics (logfwdr → pipeline) |
Both platforms have multiple reporting streams at different granularities. Replicate has
per-prediction OTel spans (Tracker), pget download metrics (POST /metrics), 15-second
instance utilization snapshots, and a setup report sent once when setup completes. Workers
AI has per-request inference events (Ready Analytics), endpoint lifecycle events,
Prometheus metrics (backed by internally-sampled 5-minute rolling windows), and 60-second
scheduler polls.
The main structural difference is where aggregation happens. Director’s instance state
monitor pre-aggregates utilization into 15-second windows before sending to the API, which
forwards to Web for customer-facing reporting. Workers AI’s inference events are sent raw
and aggregated by downstream consumers in the analytics pipeline. The Prometheus metrics
layer provides pre-computed summaries (like full_fraction and mean_queue_size) for
operational use.
Billing Metrics
Billing Metrics
Overview
Both platforms need to measure what happened during inference so they can bill for it. The billing models are different: Replicate bills on predict time plus per-unit metrics (tokens, images, etc.), while Workers AI bills on “neurons” — an abstract unit representing GPU compute.
How those metrics are collected also differs. Director estimates billing metrics for most models. Only explicitly opted-in trusted models may report their own billing metrics. Workers AI has multiple paths depending on the model server software: Triton-backed models report cost metrics as output tensors that constellation-server converts to neurons, while non-Triton models (omni, infire, partner) calculate neurons themselves and report them via HTTP response headers. Both paths converge at constellation-entry.
Replicate: Model-Reported and Director-Estimated Metrics
Three Flags
Director has three flags that control billing metric behavior
(director/config.go:46-48):
| Flag | Purpose |
|---|---|
DIRECTOR_TRUST_BILLING_METRICS | When true, pass through the model’s billing metrics to downstream systems |
DIRECTOR_CALCULATE_TOKEN_METRICS | When true, Director independently estimates token counts |
DIRECTOR_CALCULATE_IMAGE_METRICS | When true, Director independently estimates image counts |
The calculate flags are the default path for untrusted models — Director estimates billing metrics from the prediction input and output rather than trusting the model to report them.
Both calculate flags can be enabled alongside trustBillingMetrics. When they are,
Director compares its own estimates against the model’s reported values and records
match/mismatch as OTel span attributes (token_input_metrics.match,
image_count_metrics.match, etc.). This is useful for validating that trusted models
report accurately.
Model-Reported Metrics (Trusted Path)
When trustBillingMetrics is true, Director passes through three sets of metrics from
the model’s prediction response (tracker.go:524-585):
PublicMetrics — user-visible metrics stored on the prediction. Covers image count, batch size, input/output token counts, and predict time share.
BillingMetrics — internal metrics stored in prediction.InternalMetadata["billing_metrics"].
35+ fields covering (cog/types.go:68-125):
- Audio (input/output count, duration)
- Characters (input/output count)
- Images (input/output count, megapixels, pixel dimensions, step count)
- Tokens (input/output count)
- Video (input/output count, duration, frame counts, megapixel-seconds)
- Documents (page input count)
- Training (step count)
BillingCriteria — model variant and configuration that affects pricing, stored in
prediction.InternalMetadata["billing_criteria"]
(cog/types.go:58-66). Covers model variant, resolution
target, motion mode, source/target FPS, and audio flag.
When trustBillingMetrics is false, all three are silently dropped.
Director-Estimated Token Counts
When calculateTokenMetrics is true, Director estimates token counts from the prediction
input and output (tracker.go:656-704):
Input tokens: Scans prediction input keys for any containing the substring "prompt"
(matches prompt, system_prompt, prompt_template, etc.). Each matching string value
is tokenized and the results are summed.
The tokenizer (tracker.go:714-717) is a rough heuristic: split
on blankspace, count words, multiply by 4/3.
Output tokens: If the output is a string, run countTokens on it. If it’s an array,
count the array length (for streaming models, each element is typically one token).
Timing: When output tokens exist, Director also calculates TimeToFirstToken and
TokensPerSecond.
Director-Estimated Image Counts
When calculateImageMetrics is true, Director estimates image counts from the prediction
output (tracker.go:629-653):
- Array output → count = array length
- Non-empty string output → count = 1
- Empty or nil output → count = 0
There’s no content inspection — a TODO: Check we're counting images and not something else acknowledges this. If the model already reported via BillingMetrics, Director uses
that value instead.
Metric Flow
- Model reports
PublicMetrics,BillingMetrics, andBillingCriteriain its prediction response. - Director processes them based on the three flags — passing through trusted metrics, estimating where needed, comparing when both paths are active.
- Director sends the prediction to the API via internal webhook.
BillingMetricsandBillingCriteriatravel inInternalMetadata(opaque to users).PublicMetricsare on the prediction itself (visible to users). - The API forwards to Web for billing aggregation.
Workers AI: Neurons and Multiple Reporting Paths
Neurons
Workers AI bills in neurons — an abstract unit representing GPU compute.
Internally, 1 neuron ≈ 0.1 L4 GPU-seconds (baselined at Sept 2023 efficiency).
Each model has per-metric neuron coefficients that convert raw usage into a
neuron total. The formula
(neuron/src/lib.rs:7-36):
total_neurons = cost_per_infer + Σ(neuron_cost × metric_value)
cost_per_infer is an optional flat cost per request. The per-metric
multipliers (e.g., input_tokens, output_tokens, image_steps) are
configured per model via Consul service config or Deus env vars like
INPUT_TOKEN_NEURONS and OUTPUT_TOKEN_NEURONS.
Triton Path: Cost Metric Tensors
Triton-backed models — including triton-vllm, which covers the majority of
LLMs — report billing metrics as output tensors prefixed with COST_METRIC_.
constellation-server extracts these in the Triton result handler
(inference/triton/result.rs:196-249):
- Iterate over result tensors.
- For each tensor named
COST_METRIC_*, strip the prefix to get the metric name (e.g.,input_tokens,output_tokens). - Sum the tensor values (handles scalar, 1D array, and [N×1] shapes).
- Call
calculate_neurons(neuron_config, &cost_metric_values)to compute total neurons. - Strip
COST_METRIC_*tensors from the response before sending to the client. - Put the neuron total and up to two named cost metrics in the response
cf-ai-cserver-metaJSON header.
The client never sees the raw cost metric tensors.
Non-Triton Path: Model-Reported Headers
Non-Triton backends report neurons via HTTP response headers that constellation-server passes through transparently:
Omni models (PipeHttp software type) — the omni framework
(cloudflare/ai/omni) provides a Python API where model
code calls context.cost.set_neurons(value) and
context.cost.set_usage_metric(name, value). Omni converts these to
cf-ai-neurons and cf-ai-cost-metric-{name,value}-N response headers
(omni/shared/src/cost.rs:82-96). Models read their neuron
coefficients from the cf-ai-model-config request header that
constellation-server forwards from the model’s Consul config.
Infire models (PipeHttpLlm software type) — the newer vLLM deployment
path. Neuron coefficients are defined in workers-ai.yaml (e.g.,
input_token: 0.02561, output_token: 0.07515). The model server
calculates and reports neurons via the same cf-ai-neurons header
convention.
Partner models (PartnerPipeHttp) — partner-bouncer calculates neurons
and emits cf-ai-neurons headers
(partner-bouncer/src/server/model.rs:178-211).
Convergence at constellation-entry
All paths converge at constellation-entry, the edge Worker that assembles
the billing event
(sdk/.../src/lib/headers.ts:20-82):
- Parse
cf-ai-neuronsfrom response headers (non-Triton path). - Parse
cf-ai-cserver-metaJSON and merge its fields — includingneurons— into the metrics context (Triton path). - Both sources write to the same
raMetrics.neuronsfield. - Send a Ready Analytics event with the neuron total to the SDK RA table.
Billing reads from the SDK RA table
(aiinference_sdk_production_by_namespace_account_sampled), summing
neurons × _sample_interval per account.
Streaming
For streaming responses, cost metrics are accumulated across chunks. In the
Triton path, constellation-server’s InferenceEventBuilder adds metric
values to running totals
(inference_event.rs:153-156). In the non-Triton path,
constellation-entry accumulates neurons and cost_metric_value_2 from
each chunk’s meta field
(sdk/.../src/lib/tools.ts:948-959).
No Server-Side Token Counting
No component in the Workers AI stack counts tokens independently. Token
counts come from the model — either as COST_METRIC_input_tokens /
COST_METRIC_output_tokens tensors (Triton) or as usage metrics reported
via context.cost.set_usage_metric() (omni). There is no tokenizer in
constellation-server, constellation-entry, or ai-scheduler.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Billing unit | Predict time + per-unit metrics (tokens, images, video, etc.) | Neurons (abstract GPU compute unit, ≈ 0.1 L4-seconds) |
| Who counts | Director estimates (untrusted) or model reports (trusted) | Model always reports — via tensors (Triton) or headers (omni/infire/partner) |
| Token counting | Director: word count × 4/3 heuristic (untrusted). Model: actual counts (trusted). | Model-reported only. No server-side tokenizer. |
| Cost formula | Raw metrics passed to billing system, pricing applied downstream | neurons = cost_per_infer + Σ(neuron_cost × metric_value), computed at inference time or by model code |
| Metric types | 35+ fields across audio, image, video, tokens, training, documents | Arbitrary named metrics (typically 1-3 per model) |
| Trust model | Explicit trust_billing_metrics flag gates model-reported metrics | Implicit — all models report their own metrics, no server-side estimation |
| Reporting paths | Single path through Director | Multiple: Triton tensors, omni SDK, infire headers, partner-bouncer — all converge at constellation-entry |
| Billing table | Web aggregates from prediction metadata | SDK RA table (written by constellation-entry) queried via BigQuery |
Both platforms have a trust asymmetry. Replicate runs untrusted third-party models and must estimate billing metrics for them — only explicitly opted-in trusted models may report their own. Workers AI controls all model deployments, so every model reports its own metrics through one of several backend-specific mechanisms.
Replicate’s BillingMetrics struct with 35+ fields reflects the diversity
of model types it supports (image generators, video models, audio models,
LLMs, training jobs). Workers AI’s neuron abstraction collapses all of this
into a single number — per-model pricing changes only require updating
neuron coefficients in config, not model code.
Workload Types Overview
Workload Types in Replicate
Replicate handles several distinct workload types, each with different characteristics and configurations. Understanding these distinctions is essential for comparing with Workers AI, which has a more uniform workload model.
Overview
Replicate defines four primary workload types via the DeployableKind enum
(replicate/web/models/models/deployable_config.py:114):
class DeployableKind(models.TextChoices):
DEPLOYMENT_PREDICTION = "deployment-prediction", "deployment-prediction"
FUNCTION_PREDICTION = "function-prediction", "function-prediction"
VERSION_PREDICTION = "version-prediction", "version-prediction"
VERSION_TRAINING = "version-training", "version-training"
Each workload type has its own deployable_metadata_for_* function in
replicate/web/models/logic.py
that generates the appropriate configuration.
1. Deployment Predictions
What: Predictions running on a Deployment - a stable, long-lived identifier that routes to a backing model version. The backing version can be changed over time and doesn’t need to be owned by the same account as the deployment.
Characteristics:
- Persistent deployment entity (configuration/routing), but infrastructure can scale to 0 replicas
- Custom configuration per deployment that can override many version-level settings
- Uses dedicated deployment key for consistent routing
Configuration:
deployable_metadata_for_deployment_prediction()
Code references:
- Deployment model:
replicate/web/models/models/deployment.py - Kind validation:
logic.py:1156- assertskind == DeployableKind.DEPLOYMENT_PREDICTIONandnot deployment.used_by_model
Queue behavior: Standard shuffle-sharded queues per deployment
2. Function Predictions (Pipelines/Procedures)
What: Predictions for multi-step workflows (Replicate Pipelines). Function predictions run on a shared container image with CPU-only hardware that “swaps” in procedure source code at prediction time. Similar to hotswaps but specifically for CPU-only Python code rather than GPU model weights.
Characteristics:
- Run on CPU hardware (no GPU)
- Share a base container image across procedures
- Procedure source code is swapped in at runtime (analogous to weight swapping in hotswaps)
- Part of a larger multi-step workflow
- Uses
AbstractProceduremodel, notVersion
Configuration:
deployable_metadata_for_procedure_prediction()
Code references:
- Procedure model:
replicate/web/models/models/procedure.py - Kind validation:
logic.py:1165- whenkind == DeployableKind.FUNCTION_PREDICTION, assertsdeployment.used_by_model
Director configuration: Director runs procedures with DIRECTOR_JOB_KIND=procedure
(director/config.go:37)
Queue behavior: Standard queues, no special routing
3. Version Predictions
What: Predictions running directly on a model version (not through a deployment).
Characteristics:
- Ephemeral infrastructure (scaled up/down based on demand)
- Configuration comes from version’s
current_prediction_deployable_config - Two sub-types: normal and hotswap (see below)
Configuration:
deployable_metadata_for_version_prediction()
3a. Normal Version Predictions
What: Standard version predictions without hotswapping.
Characteristics:
- One container image per version
- Standard queue routing
prefer_same_stream = False(default)
Code:
logic.py:1214-1222
if not version.is_hotswappable:
metadata = DeployableConfigSerializer(
version.current_prediction_deployable_config
).data
3b. Hotswap Version Predictions
What: Versions that share a base container but load different weights at runtime. Multiple “hotswap versions” can run on the same pod by swapping weights instead of restarting containers.
Characteristics:
- Multiple versions share the same base Docker image
- Each version has
additional_weights(weights loaded at runtime) - Base version must be public, non-virtual, and accept Replicate weights
prefer_same_stream = True- workers preferentially consume from the same stream to optimize weight locality- Uses base version’s deployment key for infrastructure sharing
When a version is hotswappable
(version.py:586):
def is_hotswappable(self) -> bool:
if not self.additional_weights:
return False
if not self.base_version:
return False
if self.base_docker_image_id != self.base_version.docker_image_relation_id:
return False
return self.base_version.is_valid_hotswap_base
Configuration:
logic.py:1229-1244
deployable_config_fields = {
...
"docker_image": version.base_version.docker_image_relation,
"fuse_config": None,
"key": version.base_version.key_for_hotswap_base_predictions,
"prefer_same_stream": True, # KEY DIFFERENCE
}
Queue behavior:
- Shuffle-sharded queues (like all workloads)
DIRECTOR_PREFER_SAME_STREAM=trueset for hotswap versions- Director workers repeatedly consume from the same stream before checking others
- Optimizes for weight caching locality (reduce weight download/loading overhead)
Why prefer_same_stream matters:
- Hotswap versions load different weights into the same container
- If a worker has already loaded weights for version A, it’s faster to keep processing version A predictions
- Without
prefer_same_stream, workers round-robin across all streams, losing weight locality benefits
Code references:
- Version model:
replicate/web/models/models/version.py - Hotswap validation:
version.py:586-595 - Queue affinity:
director/config.go:50 - Redis implementation:
director/redis/queue.go
4. Version Trainings
What: Training jobs for versions marked as trainable.
Characteristics:
- Uses separate
current_training_deployable_config(not prediction config) - Longer timeouts and different resource requirements
- Different billing model
- Director runs with
DIRECTOR_JOB_KIND=training
Configuration:
deployable_metadata_for_version_training()
Code references:
- Training config:
logic.py:1307-1308 - Director job kind:
director/config.go:37
Summary Table
| Workload Type | Infrastructure | Config Source | prefer_same_stream | Director Job Kind |
|---|---|---|---|---|
| Deployment Prediction | Persistent | Deployment config | False | prediction |
| Function Prediction | Ephemeral | Procedure config | False | procedure |
| Version Prediction (normal) | Ephemeral | Version prediction config | False | prediction |
| Version Prediction (hotswap) | Shared base | Base version config + weights | True | prediction |
| Version Training | Ephemeral | Version training config | False | training |
Key Takeaways
- Replicate has explicit workload type separation with different configurations and behaviors per type
- Hotswap predictions are unique - they’re the only workload using
prefer_same_stream=Truefor weight locality optimization - Function predictions use code swapping - similar to hotswaps but for CPU-only Python code instead of GPU model weights
- Each workload type has distinct characteristics - different infrastructure models, configuration sources, and queue behaviors
These workload distinctions will be referenced throughout the document when discussing queue behavior, timeouts, scaling, and other operational characteristics.