Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This is a technical comparison of Replicate and Cloudflare Workers AI. It covers how each system handles request routing, autoscaling, resource management, model loading, and observability. When possible, links to source code are included.

Audience

Engineers and engineering leaders on the Workers AI and adjacent ai-platform teams. Sections are written to be useful both as reference material for people building these systems and as context for people making decisions about them.

How to read this

Each section covers both platforms, ending with a “Key Differences” summary. Sections are self-contained — read them in order or jump to what’s relevant.

Scope

This comparison covers the inference serving path and supporting infrastructure: how requests arrive, get routed to models, execute, and return results. In general, it does not cover ancillary tooling or browser level user experience topics.

Request Lifecycle

Request Lifecycle Management

Overview

As one might expect, a given prediction or inference request touches different components on each platform. The diagrams below show the (simplified) full path from client to inference server and back.

Replicate

graph TD
    Client([Client])
    Web[web]

    subgraph GPU cluster
        API[api]
        Redis[(redis)]
        Director[director]
        Cog[cog]
    end

    Client -->|HTTP| API
    Web -.->|playground| API
    API -->|LPUSH / XADD| Redis
    Redis -->|XREADGROUP| Director
    Director -->|HTTP| Cog
    Cog -->|response| Director
    Director -->|webhook| API
    API -->|poll / webhook| Client

Key points: the request is asynchronous by default. The client submits a prediction, the API enqueues it, and Director picks it up later. The client polls, receives a webhook when the result is ready, or opts into “blocking” mode at the per-request level. Streaming predictions respond over SSE but still enter through the same queue.

Workers AI

graph TD
    Client([Client])
    WCE[worker-constellation-entry]
    CE[constellation-entry]
    CS[constellation-server]

    subgraph GPU cluster
        GPU[GPU container]
    end

    Client -->|HTTP| WCE
    WCE -->|HTTP| CE
    CE -->|pipefitter / HTTP| CS
    CS -->|HTTP| GPU
    GPU -->|response| CS
    CS -->|response| CE
    CE -->|response| WCE
    WCE -->|response| Client

Key points: the request is synchronous. The client holds an open HTTP connection while the request flows through the stack to a GPU container and back. There is no persistent queue — requests wait in bounded in-memory queues or get rejected with a 429 if no capacity is available.


Replicate Request Processing

The Replicate request path has three major phases: API-side routing and enqueue, transient queuing in Redis, and Director-side dequeue and execution. Durable state lives in PostgreSQL (via web), not in the queue. The sections below cover each phase in detail.

Behavior varies by workload type (see Workload Types appendix). All workload types share the same core queue and worker infrastructure but differ in configuration.

Queue and Message Handling

Each GPU provider’s managed Kubernetes cluster runs its own Redis cluster (with sentinel for HA). Director uses Redis Streams with consumer groups for request queuing. Each deployable has its own stream — the API pushes prediction requests to the stream, and Director pods consume from it. Queue names combine a workload-type prefix (input:prediction:dp-, input:prediction:fp-, etc.) with the deployable’s key.

Queues use shuffle sharding by default (since October 2024) — each tenant is assigned a subset of stream shards, providing fairness by preventing one account’s batch from blocking others. Hotswap predictions enable PreferSameStream for weight cache locality.

See Queue Management for full details on queue internals: MessageQueue fetcher/claimer goroutines, sharded queues, blue/green migration via MultiMessageQueue, and stuck message cleanup.

Worker Loop

Each Director instance runs worker goroutines that poll the queue. The number of workers is controlled by DIRECTOR_CONCURRENT_PREDICTIONS (default: 1). This is set by cluster when the model’s Cog config specifies concurrency.max > 1 (cluster/pkg/kubernetes/deployable.go:1149-1153). Most models run with the default of 1 worker.

The worker loop (director/worker.go:134) continuously:

  1. Polls Redis Stream for messages using consumer groups
  2. Checks if prediction execution is allowed (spending validation, principal checks)
  3. Calculates effective deadline from prediction metadata, deployment lifetime, and timeout config
  4. Creates a Tracker instance for state management
  5. Executes prediction via Executor interface (which calls into Cog)
  6. Handles terminal states (success, failure, cancellation)
  7. Acknowledges message in Redis

The worker loop behavior is the same across all workload types. What differs is the job kind passed to the executor:

  • Predictions (deployment, version): DIRECTOR_JOB_KIND=prediction (default)
  • Trainings: DIRECTOR_JOB_KIND=training
  • Procedures (function predictions): DIRECTOR_JOB_KIND=procedure

Note: Internally, Director converts procedure to prediction when sending to Cog because the API doesn’t yet support the procedure value (director/worker.go:731-737).

Code references:

Tracker and State Transitions

The Tracker (director/tracker.go) manages state transitions and notifications. It defines four webhook event types:

  • start - Prediction begins execution
  • output - Intermediate outputs during execution
  • logs - Log messages during execution
  • completed - Prediction reaches terminal state

The Tracker maintains:

  • Prediction state (struct at director/tracker.go:36-69)
  • Subscriber functions that receive state updates
  • Pending webhook events (if notifications are suppressed temporarily)
  • Upload tracking for async file completions
  • Moderation state (if content moderation is enabled)

Webhooks are sent to the Replicate API at state transitions. The webhook sender uses a PreemptingExecutor (director/executor.go) that may drop intermediate webhooks if new updates arrive before the previous one is sent - this is intentional since the latest state supersedes earlier ones. Terminal webhooks are always delivered.

Code references:

Streaming

Director supports Server-Sent Events (SSE) for streaming outputs. Streaming is enabled when the API includes a stream_publish_url in the prediction’s internal metadata (cog/types.go:230). The StreamClient (director/streaming.go) publishes output events to this URL as they occur.

Concurrency Control

Director supports concurrent predictions within a single container via DIRECTOR_CONCURRENT_PREDICTIONS. See Concurrency Deep Dive for details on configuration and usage patterns.

Workers AI Request Processing

Routing Architecture

Workers AI requests flow through multiple components:

  1. worker-constellation-entry: Pipeline Worker (TypeScript) that receives API requests from SDK or REST API. Handles input validation, model lookup, and request formatting before forwarding to constellation-entry (apps/worker-constellation-entry/src/worker.ts).

  2. constellation-entry: Rust service running on every metal. Fetches model config from QS, selects target colos using routing algorithms, resolves constellation-server addresses, and forwards via pipefitter or HTTP (gitlab.cfdata.org/cloudflare/ai/constellation-entry).

  3. constellation-server: Colo-level load balancer (Rust). Manages permits and request queues per model, selects GPU container via service discovery, handles inference forwarding and cross-colo failover (gitlab.cfdata.org/cloudflare/ai/constellation-server).

  4. GPU container: Inference server running the model.

Wiki references:

constellation-entry Routing

constellation-entry fetches model configuration from Quicksilver (distributed KV store), which includes the list of colos where the model is deployed (src/model_config.rs). It then applies a routing algorithm to select target colos. Routing algorithms include:

  • Random: Random selection among available colos (default)
  • LatencySensitive: Prefer nearby colos, with step-based selection
  • ForcedColo: Override to a specific colo (for debugging/testing)
  • ForcedResourceFlow: Use resource-flow routing (src/resource_flow.rs)

Routing uses a step-based algorithm (StepBasedRouter) that generates a prioritized list of colos. The algorithm can be customized per-model via routing_params in the model config.

Once colos are selected, constellation-entry resolves constellation-server addresses via QS or DNS (src/cs_resolver.rs) and forwards the request via pipefitter or HTTP (src/pipefitter_transport.rs).

Code references:

Wiki references:

constellation-server Load Balancing

constellation-server runs in each colo and handles:

Request handling (src/main.rs, server/src/lib.rs):

  • Listens on Unix domain socket (for pipefitter) and TCP port (src/cli.rs:23,31)
  • Extracts routing metadata: model ID, account ID, fallback routing info
  • Tracks distributed tracing spans
  • Emits metrics to Ready Analytics

Endpoint load balancing and concurrency (server/src/endpoint_lb.rs):

Service discovery (service-discovery/src/lib.rs):

Inference forwarding (server/src/inference/mod.rs):

  • Regular HTTP requests: Buffers and forwards to GPU container with retry logic
  • WebSocket requests: Upgrades connection and relays bidirectionally (no retries)
  • Backend-specific handlers for Triton, TGI, TEI, and generic HTTP (PipeHttp — used for vLLM and others) in server/src/inference/

Failover and forwarding (server/src/colo_forwarding.rs):

  • If target returns error or is unavailable, may forward to different colo
  • Tracks inflight requests via headers (cf-ai-requests-inflight) to coordinate across instances
  • For pipefitter: Returns 500 with headers telling pipefitter to retry different colos
  • Handles graceful draining: continues accepting requests while failing health checks

Code references:

Wiki references:

Queuing and Backpressure

When GPU containers are busy, backpressure propagates through the stack:

  1. constellation-server: find_upstream() attempts to acquire a permit for a GPU endpoint. If all permits are taken, the request enters a bounded local colo queue (ModelRequestQueue). The queue is a future that resolves when a permit frees up (via tokio::sync::Notify). If the queue is also full, the request gets NoSlotsAvailableServerError::OutOfCapacity. OOC may also trigger forwarding to another constellation-server in the same colo (server/src/endpoint_lb/endpoint_group.rs:200-277).

  2. constellation-entry: Receives OOC from constellation-server (error codes 3033, 3040, 3047, 3048). Has a retry loop with separate budgets for errors (up to 3) and OOC (varies by request size). For small requests, pipefitter handles cross-colo forwarding. For large requests (>256 KiB), constellation-entry retries across target colos itself. All OOC variants are mapped to generic 3040 before returning to the client (proxy_server.rs:1509-1600, errors.rs:62-67).

  3. worker-constellation-entry: Has an optional retry loop on 429/424 responses, but outOfCapacityRetries defaults to 0 (configurable per-model via sdkConfig). By default, OOC goes straight to the client (ai/session.ts:101,137-141).

  4. Client: Receives HTTP 429 with "Capacity temporarily exceeded, please try again." (internal code 3040).

There is no persistent queue in the Workers AI stack. Backpressure is handled through permit-based admission control, bounded in-memory queues, cross-colo retry, and ultimately rejection to the client. This contrasts with Replicate’s Redis Streams-based queue, where requests persist until a Director instance dequeues them.

State and Observability

Workers AI observability has two main channels from constellation-server:

Ready Analytics (ready-analytics/src/): Per-inference events sent via logfwdr (Unix socket). Each InferenceEvent includes:

  • Request timing: request_time_ms, inference_time_ms, queuing_time_ms, time_to_first_token
  • Billing: neurons, cost metric values
  • Routing: colo_id, source_colo_id, colo_path, upstream_address
  • Queue state: local_queue_size
  • Flags: streamed, beta, external endpoint, queued in colo

See ready-analytics/src/inference_event.rs for full field list.

Prometheus metrics (metrics/src/lib.rs): Local metrics scraped by monitoring. Includes:

  • requests (by status), errors (by internal/HTTP code)
  • total_permits, mean_used_permits, mean_queue_size, full_fraction (per model)
  • infer_per_backend, infer_per_health_state
  • forward_failures (per target colo)
  • backoff_removed_endpoints (endpoints in backoff state)

Note: “Workers Analytics” in the public docs refers to analytics for Workers code execution, not GPU inference. GPU inference observability flows through Ready Analytics.

Streaming

constellation-server supports two streaming mechanisms:

HTTP streaming (server/src/inference/pipe_http.rs): For PipeHttp backends, the response body can be InferenceBody::Streaming (line 222-230), which passes the Incoming body through without buffering via chunked transfer encoding. The permit is held until the connection completes (line 396-397 for HTTP/1, line 472 for HTTP/2). If the backend sends SSE-formatted data (e.g., TGI’s streaming responses), constellation-server passes it through transparently without parsing.

WebSocket (server/src/inference/mod.rs:180-400): Bidirectional WebSocket connections are handled by handle_ws_inference_request. After the HTTP upgrade, handle_websocket_request relays messages between client and inference server. A max connection duration is enforced (max_connection_duration_s, line 332-338).

Loopback SSE (server/src/loopback/): Separate from client-facing streaming. constellation-server maintains SSE connections to inference servers that support async batch processing. These connections receive completion events (InferenceSuccess, InferenceError, WebsocketFinished, GpuFree) which trigger permit release and async queue acknowledgment. See loopback/sse.rs for the SSE client implementation.

Concurrency Control

constellation-server uses permit-based concurrency control per model via EndpointLoadBalancer (server/src/endpoint_lb.rs). Concurrency limits are configured in Config API. When local queuing is enabled, requests wait in an explicit ModelRequestQueue (FIFO or Fair algorithm) until a permit becomes available. Different models can have different concurrency limits on the same server.


Key Differences

AspectReplicateWorkers AI
Request modelAsynchronous by default (poll/webhook); sync opt-in via Prefer: waitSynchronous by default; async opt-in via queueRequest
Queue persistenceRedis Streams — requests survive component restartsNo persistent queue — bounded in-memory queues, rejection on overflow
BackpressureQueue absorbs bursts; Director dequeues when readyPermit-based admission → local queue → cross-colo retry → 429 to client
State managementDirector Tracker with webhook notifications (start, output, logs, completed)Stateless request-response; observability via Ready Analytics events
StreamingSSE via stream_publish_url; Director publishes output eventsHTTP chunked transfer passthrough; WebSocket relay; loopback SSE for async batch
RoutingAPI routes to correct GPU cluster’s Redis; Director pulls from local queueconstellation-entry selects colos via routing algorithms; constellation-server selects GPU endpoint
ConcurrencyDIRECTOR_CONCURRENT_PREDICTIONS per container (default 1)Permit-based per model in constellation-server; FIFO or Fair queue algorithm
FairnessPer-tenant shuffle sharding at queue levelPer-account round-robin in fair queue; per-account rate limiting
TransportHTTP between Director and CogPipefitter (small requests) or HTTP (>256 KiB) between constellation-entry and constellation-server
Failure handlingWebhook terminal states (failed/canceled); message ACK in RedisError codes propagated through stack; OOC triggers cross-colo retry

Timeouts & Deadlines

Timeouts & Deadlines

Overview

Both platforms enforce timeouts on inference requests, but the designs reflect different assumptions about workload duration. Replicate has a multi-phase timeout system with separate controls for model setup, prediction execution, and overall lifetime — designed for workloads ranging from seconds to 24-hour training runs. Workers AI has a two-layer system: a per-model inference timeout in constellation-server (default 30 seconds) under a hard 4-minute ceiling in constellation-entry — optimized for fast, predictable inference.


Replicate Timeout System

Timeout behavior spans multiple components:

  • replicate/web - defines timeout values per workload type in deployable metadata
  • replicate/api - can inject per-prediction deadlines into prediction metadata
  • replicate/cluster - translates deployable config into DIRECTOR_* environment variables when creating K8s deployments; provides default values when the deployable config doesn’t specify them (pkg/config/config.go)
  • director - reads env vars and enforces timeouts during execution

Configuration

Three timeout parameters control prediction lifecycle. Each flows from replicate/web through replicate/cluster to director as an environment variable. Director reads them as flags (director/config.go:52-58).

Model Setup Timeout (DIRECTOR_MODEL_SETUP_TIMEOUT)

Time allowed for model container initialization. Applied during health check polling while waiting for the container to become ready. Measures time from StatusStarting to completion. If exceeded, Director fails the prediction with setup logs and reports to the setup run endpoint (director/director.go:718-830).

Resolution chain:

  1. web: Model.setup_timeout property — returns DEFAULT_MODEL_SETUP_TIMEOUT (10 minutes) when the underlying DB field is None (models/model.py:1038-1042). Stored on DeployableConfig.setup_timeout, then serialized as model_setup_timeout with a multiplier and bonus applied (api_serializers.py:1678-1681).
  2. cluster: Reads deployable.ModelSetupTimeout. If nonzero, uses it; otherwise falls back to config.ModelSetupTimeout (10 minutes) (pkg/config/config.go:74, pkg/kubernetes/deployable.go:1191-1201).
  3. director: Reads DIRECTOR_MODEL_SETUP_TIMEOUT env var. Own flag default is 0 (disabled), but cluster always sets it.

Prediction Timeout (DIRECTOR_PREDICT_TIMEOUT)

Time allowed for prediction execution, starting from when execution begins (not including queue time). Acts as a duration-based fallback when no explicit deadline is set. Director enforces a minimum of 30 minutes if configured to 0 or negative (director/director.go:285-288).

Resolution chain:

  1. web: Model.prediction_timeout (nullable duration). When set, stored on DeployableConfig.run_timeout and serialized as prediction_timeout (models/model.py:1010-1017, api_serializers.py:1702-1703). When None, the serialized prediction_timeout is omitted.

  2. cluster: predictTimeout() resolves the value with this priority (pkg/kubernetes/deployable.go:1886-1906):

    1. deployable.PredictionTimeout (from web, if set)
    2. Hardcoded per-account override map userSpecificTimeouts (pkg/kubernetes/deployable.go:88-116)
    3. config.PredictTimeoutSeconds (30 minutes) (pkg/config/config.go:71)
    4. For training workloads, uses the higher of the above and TrainTimeoutSeconds (24 hours)

    A mirrored per-account map exists in web’s USER_SPECIFIC_TIMEOUTS (models/prediction.py:60-107) to prevent terminate_stuck_predictions from killing predictions that have these extended timeouts.

  3. director: Reads DIRECTOR_PREDICT_TIMEOUT env var. Own flag default is 1800s (30 minutes), but cluster always sets it.

Max Run Lifetime (DIRECTOR_MAX_RUN_LIFETIME)

Maximum time for the entire prediction lifecycle including queue time and setup. Calculated from prediction.CreatedAt timestamp.

There are two paths that produce a max run lifetime constraint — one via the deployable config (env var on the Director pod) and one via a per-request header that becomes a deadline on the prediction itself.

Deployable config path (env var):

  1. web: DeployableConfig.max_run_lifetime — defaults to DEFAULT_MAX_RUN_LIFETIME (24 hours) at the DB level (models/deployable_config.py:259-261). For deployment predictions, deployment.max_run_lifetime can override this (logic.py:1175-1176). Serialized as max_run_lifetime (api_serializers.py:1675-1676).
  2. cluster: Reads deployable.MaxRunLifetime. If nonzero, uses it; otherwise falls back to config.MaxRunLifetime (24 hours) (pkg/config/config.go:75, pkg/kubernetes/deployable.go:1203-1213).
  3. director: Reads DIRECTOR_MAX_RUN_LIFETIME env var. Own flag default is 0 (disabled), but cluster always sets it.

Per-request path (Cancel-After header):

  1. api: The Cancel-After HTTP header on a prediction request is parsed as a duration (Go-style like 5m or bare seconds like 300). Minimum 5 seconds (server/v1_prediction_handler.go:233-276).
  2. api: calculateEffectiveDeadline() picks the shorter of the request’s Cancel-After value and the deployable metadata’s max_run_lifetime, computes an absolute deadline from prediction creation time, and sets it on prediction.InternalMetadata.Deadline (logic/prediction.go:1076-1109).
  3. director: This deadline is the “per-prediction deadline” checked first in the deadline priority (see Deadline Calculation below).

Deadline Calculation and Enforcement

Director calculates an effective deadline at dequeue time using this priority (director/worker.go:63-107):

  1. Per-prediction deadline (prediction.InternalMetadata.Deadline)
  2. Deployment deadline (prediction.CreatedAt + MaxRunLifetime, if configured)
  3. Prediction timeout (fallback)

The earliest applicable deadline wins. At execution time (director/worker.go:516-528):

deadline, source, ok := getEffectiveDeadline(prediction)
var timeoutDuration time.Duration

if ok {
    timeoutDuration = time.Until(deadline)
} else {
    timeoutDuration = d.predictionTimeout
}

timeoutTimer := time.After(timeoutDuration)

Timeout outcomes vary based on when and why the deadline is exceeded:

Before execution (deadline already passed at dequeue):

  • Per-prediction deadline → aborted
  • Deployment deadline → failed

During execution (timer fires while running):

  • Per-prediction deadline → canceled
  • Deployment deadline → failed
  • Prediction timeout (fallback) → failed

Timeout Behavior Examples

Scenario 1: Version prediction with deployment deadline

  • DIRECTOR_MAX_RUN_LIFETIME=300 (5 minutes)
  • DIRECTOR_PREDICT_TIMEOUT=1800 (30 minutes)
  • Prediction created at T=0, dequeued at T=60s
  • Effective deadline: T=0 + 300s = T=300s
  • Execution timeout: 300s - 60s = 240s remaining

Scenario 2: Prediction with explicit deadline

  • Per-prediction deadline: T=180s (3 minutes from creation)
  • DIRECTOR_MAX_RUN_LIFETIME=600 (10 minutes)
  • Prediction created at T=0, dequeued at T=30s
  • Effective deadline: T=180s (per-prediction deadline wins)
  • Execution timeout: 180s - 30s = 150s remaining

Scenario 3: Training with no deadlines

  • DIRECTOR_MAX_RUN_LIFETIME=0 (disabled)
  • DIRECTOR_PREDICT_TIMEOUT=1800 (30 minutes)
  • No per-prediction deadline
  • Effective deadline: None (uses fallback)
  • Execution timeout: 1800s from execution start

Code references:

Workers AI Timeout System

Workers AI has a simpler timeout model than Replicate, but the request passes through multiple layers, each with its own timeout behavior.

Request Path Layers

A request from the internet traverses three services before reaching a GPU container:

  1. worker-constellation-entry (TypeScript Worker): The SDK-facing entry point. Calls constellation-entry via a Workers service binding (this.binding.fetch(...)) with no explicit timeout (ai/session.ts:131). See note below on Workers runtime limits.
  2. constellation-entry (Rust, runs on metal): Wraps the entire request to constellation-server in a hardcoded 4-minute tokio::time::timeout (proxy_server.rs:99, proxy_server.rs:831-845). On timeout, returns EntryError::ForwardTimeout. This applies to both HTTP and WebSocket paths.
  3. constellation-server (Rust, runs on GPU nodes): Enforces per-model inference timeout (see below).

The 4-minute constellation-entry timeout acts as a hard ceiling. If a model’s inference_timeout in constellation-server exceeds 240 seconds, constellation-entry will kill the connection before constellation-server’s own timeout fires. This means constellation-entry is the effective upper bound for any single inference request, unless constellation-entry’s constant is changed.

Workers runtime limits: worker-constellation-entry is a standard Worker deployed on Cloudflare’s internal AI account (not a pipeline First Party Worker). It does not set limits.cpu_ms in its wrangler.toml, so it gets the default 30-second CPU time limit. However, CPU time only counts active processing — time spent awaiting the service binding fetch to constellation-entry is I/O wait and does not count (docs). Since the Worker is almost entirely I/O-bound, the CPU limit is unlikely to be reached. Wall-clock duration has no hard cap for HTTP requests, though Workers can be evicted during runtime restarts (~once/week, 30-second drain). In practice, the Workers runtime layer is transparent for timeout purposes — constellation-entry’s 4-minute timeout fires well before any platform limit would.

Configuration Levels

constellation-server Default (constellation-server/src/cli.rs:107-109):

  • --infer-request-timeout-secs: CLI argument, default 30 seconds
  • Acts as the minimum timeout floor for all inference requests
  • Applied globally across all models served by the constellation-server instance

Per-Model Configuration (model-repository/src/config.rs:143):

  • inference_timeout: Model-specific timeout in seconds (default: 0)
  • Configured via Config API (Deus) and fetched from Quicksilver (distributed KV store)
  • Part of WorkersAiModelConfig structure
  • Zero value means “use constellation-server default”

Timeout Resolution

When processing an inference request, constellation-server resolves the timeout (server/src/lib.rs):

#![allow(unused)]
fn main() {
let min_timeout = state.infer_request_timeout_secs as u64;
let mut inference_timeout = Duration::from_secs(state.infer_request_timeout_secs as u64);

if let Some(model_config) = state.endpoint_lb.get_model_config(&model_id) {
    inference_timeout = Duration::from_secs(
        model_config.ai_config.inference_timeout.max(min_timeout)
    );
}
}

The effective timeout is: max(model_config.inference_timeout, infer_request_timeout_secs). This ensures:

  • Models can extend their timeout beyond the 30-second default
  • Models cannot reduce their timeout below the constellation-server minimum
  • Unconfigured models (inference_timeout=0) use the 30-second default

Enforcement

constellation-server enforces the timeout by wrapping the entire inference call in tokio::time::timeout(inference_timeout, handle_inference_request(...)) (server/src/lib.rs:588-599). When the timeout elapses, the future is dropped, which closes the connection to the GPU container. The error maps to ServerError::InferenceTimeout (line 603).

The other end of the dropped connection varies by model backend (service-discovery/src/lib.rs:39-47):

  • Triton: Upstream NVIDIA tritonserver binary (launched via model-greeter). Disconnect behavior depends on Triton’s own implementation.
  • TGI/TEI: Upstream HuggingFace binaries. Disconnect behavior depends on the upstream Rust server implementation.
  • PipeHttp/PipeHttpLlm: Custom Python framework (WorkersAiApp in ai_catalog_common, FastAPI-based). The synchronous HTTP path (_generate) does not explicitly cancel the raw_generate coroutine on client disconnect — the inference runs to completion (catalog/common/.../workers_ai_app/app.py:880-896). WebSocket connections do cancel processing tasks in their finally block (catalog/common/.../workers_ai_app/app.py:536-553).

The constellation-server timeout covers:

  • Time waiting in the constellation-server queue (permit acquisition)
  • Forwarding to the GPU container
  • Inference execution in the backend (Triton, TGI, TEI)
  • Response transmission back through constellation-server

But constellation-entry’s 4-minute timeout wraps the entire round-trip to constellation-server, so it is the effective ceiling regardless of per-model config.

Unlike Director’s system, there is no separate setup timeout. Model containers are managed by the orchestrator (K8s) independently of request processing. Container initialization and readiness are handled by service discovery and health checks, not request-level timeouts.

Timeout Behavior Examples

Scenario 1: Text generation model with default timeout

  • Model config: inference_timeout=0 (not configured)
  • constellation-server: infer_request_timeout_secs=30
  • Effective timeout: 30 seconds

Scenario 2: Image generation model with extended timeout

  • Model config: inference_timeout=120 (2 minutes)
  • constellation-server: infer_request_timeout_secs=30
  • Effective timeout: 120 seconds (model config wins)

Scenario 3: Model timeout exceeds constellation-entry ceiling

  • Model config: inference_timeout=300 (5 minutes)
  • constellation-server: infer_request_timeout_secs=30
  • constellation-entry: 240 seconds (hardcoded)
  • constellation-server effective timeout: 300 seconds
  • Actual outcome: constellation-entry kills the connection at 240 seconds

Scenario 4: Attempted timeout reduction (not allowed)

  • Model config: inference_timeout=10 (10 seconds)
  • constellation-server: infer_request_timeout_secs=30
  • Effective timeout: 30 seconds (constellation-server minimum enforced)

Code references:

Key Differences

Complexity and Granularity:

  • Director: Multi-phase timeout system with separate controls for setup (model initialization) and execution (inference), plus deployment-level lifetime limits
  • Workers AI: Two-layer timeout — constellation-entry imposes a hard 4-minute ceiling, constellation-server applies per-model timeouts within that ceiling

Deadline Calculation:

  • Director: Priority-based deadline system with per-prediction, deployment, and fallback timeouts. Calculates earliest applicable deadline at dequeue time, accounting for time already spent in queue
  • Workers AI: Simple max() operation between model config and constellation-server default, applied at request receipt time

Queue Time Handling:

  • Director: MaxRunLifetime includes queue time; execution timeout accounts for time already elapsed since creation
  • Workers AI: Timeout starts when constellation-server receives the request, inherently includes queue time

Setup vs Runtime:

  • Director: Explicit ModelSetupTimeout for container initialization, separate from prediction execution timeout
  • Workers AI: No request-level setup timeout; container lifecycle managed independently by orchestrator

Configuration Source:

  • Director: Environment variables set by cluster service during deployment creation
  • Workers AI: CLI argument (constellation-server) + distributed config store (Quicksilver/Config API)

Default Values:

  • Director: 30 minutes prediction timeout (generous for long-running inference), setup and lifetime timeouts disabled by default
  • Workers AI: 30 seconds base timeout (optimized for fast inference), extended per-model via config, hard ceiling of 4 minutes at constellation-entry

Minimum Enforcement:

  • Director: 30-minute minimum enforced for prediction timeout if misconfigured
  • Workers AI: constellation-server minimum enforced as floor, models can only extend

Use Case Alignment:

  • Director: Designed for variable-length workloads including multi-hour training runs; flexible per-deployment configuration
  • Workers AI: Optimized for inference workloads with known latency profiles; centralized model-specific configuration

Job Types

Job Types

Overview

Replicate supports three distinct job kinds — predictions, training, and procedures (pipelines) — all running through the same Director/cog infrastructure but with different queue prefixes, cog endpoints, timeout behaviors, and webhook structures. Workers AI has a single job type: inference. Task-type differentiation (text-generation, image-to-text, etc.) exists only as an input/output schema concern in worker-constellation-entry; everything downstream is task-agnostic.


Replicate Job Types

Director supports three job kinds, configured per-deployment via DIRECTOR_JOB_KIND (director/config.go:37):

  • prediction (default)
  • training
  • procedure

The job kind determines which cog HTTP endpoint Director calls, how webhooks are structured, and what timeout behavior applies.

How Job Kind Is Set

Cluster sets DIRECTOR_JOB_KIND when generating the Director pod spec (pkg/kubernetes/deployable.go:1268-1280):

  • isProcedureMode()procedure
  • isTraining()training
  • Neither → default prediction

Where:

DeployableKind (Web Side)

The web app distinguishes four deployable kinds (deployable_config.py:114-120):

DeployableKindDirector JobKindRedis Queue PatternKey Pattern
DEPLOYMENT_PREDICTIONpredictioninput:prediction:{key}dp-{uuid4_hex}
FUNCTION_PREDICTIONprocedureinput:prediction:{key} (or UNIFIED_PIPELINE_QUEUE)fp-{hash}
VERSION_PREDICTIONprediction or procedureinput:prediction:{key} (or UNIFIED_PIPELINE_QUEUE)vp-{docker_image_id[:32]}
VERSION_TRAININGtraininginput:training:{key}vt-{docker_image_id[:32]}

Key prefixes: dp- = deployment, fp- = function/pipeline, vp- = version prediction, vt- = version training. FUNCTION_PREDICTION is used for pipelines (procedures) — see the Procedures section below.

Queue names are generated by DeployableConfig.queue (deployable_config.py:565-589).

OfficialModel and ModelDeployment

Some models expose a “versionless API” — users run predictions against a model name (POST /v1/models/{owner}/{name}/predictions) rather than a specific version hash.

OfficialModel (current production) — Ties a Model to a backing Deployment and a BillingConfig (official_model.py). From the end user’s perspective, an OfficialModel behaves like a Deployment — it has scaling, hardware selection, and version pinning — but the user interacts with it via the model name, not a deployment name. The actual infrastructure (K8s deployment, Director pods, Redis queues) runs through the backing Deployment.

ModelDeployment (stalled replacement) — Intended to replace OfficialModel by associating a Model with a Deployment (and optionally a Procedure) via a labeled relationship (model_deployment.py). The migration from OfficialModel to ModelDeployment was started but the work stalled and was rolled back. Both entities exist in the codebase and the dispatch logic checks for a default ModelDeployment first, falling back to OfficialModel (api_internal_views.py:303-326). TODO comments in the code (e.g. “Drop is_official from here once OfficialModels are gone” at model.py:1322) reflect the intended direction.

A model uses_versionless_api when it has either an OfficialModel or a default ModelDeployment (model.py:1321-1323).

Predictions

The default job kind. Director sends PUT /predictions/{id} to cog (cog/client.go:183-184).

Three prediction subtypes exist, distinguished by how the deployable is created:

Version predictions — Direct runs against a model version. Created when a user runs a prediction against a specific version ID. Key: vp-{docker_image_id[:32]} (version.py:618-619).

Deployment predictions — Runs through a named deployment. The deployment owns the scaling config, hardware selection, and version pinning. Kind is DEPLOYMENT_PREDICTION with key prefix dp- (deployment.py:611, deployment.py:1024-1027).

Hotswap predictions — A version that has additional_weights and a base_version. The base container stays running and loads different weights (e.g. LoRA adapters) at runtime instead of booting a new container per version.

Hotswap requirements (version.py:586-594):

  • Version has additional_weights
  • Version has a base_version
  • Version’s base docker image matches the base version’s docker image
  • Base version’s model is public, image is non-virtual, and image accepts replicate weights

Hotswap key: hp-{base_version.docker_image_id[:32]} (version.py:605-608). Multiple hotswappable versions sharing the same base version share the same key, so they share the same Director pool.

Hotswap uses prefer_same_stream=True (logic.py:1243), which tells Director’s Redis message queue to prefer consuming from the same stream shard repeatedly (redis/queue.go:248-251). This increases the chance that a Director instance serves the same version’s weights consecutively, avoiding unnecessary weight reloads. PreferSameStream is incompatible with batching (MaxConcurrentPredictions > 1) (config.go:90-92).

Training

DIRECTOR_JOB_KIND=training. Director sends PUT /trainings/{id} (if cog >= 0.11.6), otherwise falls back to PUT /predictions/{id} (cog/client.go:279-311).

Training-specific behaviors:

  • Queue prefix is input:training: instead of input:prediction:
  • Timeout uses the higher of the prediction timeout and TrainTimeoutSeconds (24 hours) (deployable.go:1901)
  • Prediction.Destination field is present (specifies where trained weights go) (cog/types.go:182)
  • Concurrency is always 1 (deployable_config.py:417-418)
  • Webhook payload wraps the prediction in a Job with Kind: "training" and the prediction in the Training field instead of Prediction (worker.go:746-749)

Training is set via DeployableConfig.training_mode which maps to DeployableKind.VERSION_TRAINING (deployable_config.py:591-595).

Procedures

DIRECTOR_JOB_KIND=procedure. Director sends PUT /procedures/{id} (cog/client.go:54).

Procedures are multi-step workflows (pipelines) that run user-provided Python code on a shared runtime image (pipelines-runtime). Unlike predictions and trainings, the container image is not user-built — it’s a standard runtime that loads user code at execution time.

Procedure source loading:

Before executing, Director downloads the procedure source archive (tar.gz) from a signed URL, extracts it to a local cache directory, and rewrites the procedure_source_url in the prediction context to a file:// path (worker.go:540-548). The procedures.Manager handles download, caching (keyed by URL hash), and extraction (internal/procedures/manager.go).

Webhook compatibility hack:

When sending webhooks back to the API, Director overwrites the procedure job kind to prediction because the API doesn’t understand the procedure kind yet (worker.go:734-741).

Procedure subtypes in web:

  • Procedure — persistent, owned by a user/org, tied to a model. Uses DeployableKind.FUNCTION_PREDICTION (procedure.py:230).
  • EphemeralProcedure — temporary/draft, used for iteration. Source archives expire after 12 hours. Uses DeployableKind.VERSION_PREDICTION (procedure.py:42).

Both can route to the UNIFIED_PIPELINE_QUEUE if configured (deployable_config.py:573-584), which is a shared warm compute pool.

Workers AI Job Types

Workers AI has a single job type: inference. There is no training, no procedure/pipeline execution, and no multi-step workflow at the request level.

Task Types (Schema-Level Only)

worker-constellation-entry defines 14 task types in taskMappings (catalog.ts):

  • text-generation, text-classification, text-embeddings, text-to-image, text-to-speech
  • automatic-speech-recognition, image-classification, image-to-text, image-text-to-text
  • object-detection, translation, summarization, multimodal-embeddings
  • dumb-pipe (passthrough, no input/output transformation)

Each task type is a class that defines:

  • Input schema validation and preprocessing
  • Payload generation (transforming user input to backend format)
  • Output postprocessing

The task type is determined by the model’s task_slug from the config API. The infrastructure path (constellation-entry → constellation-server → GPU container) is identical regardless of task type. Task types are purely a worker-constellation-entry concern — constellation-entry and constellation-server are task-agnostic.

LoRA at Inference Time

constellation-server supports LoRA adapter loading for Triton-backed models (inference/triton.rs:142-214). When a request includes a lora parameter:

  1. Finetune config is fetched from the config API (Deus) via the model-repository crate
  2. Config is validated: finetune must exist for the current model, all required assets must be present
  3. lora_config and lora_name are injected as additional Triton inference inputs

This is inference-time adapter loading, not training. The base model stays loaded and the adapter is applied per-request. There is no LoRA-aware routing or stickiness — constellation-server’s endpoint selection and permit system are unaware of the lora field, so consecutive requests for the same adapter may hit different GPU containers.

Async Queue

Two conditions must be met for async execution:

  1. The model must have use_async_queue enabled in the config API (Deus) — a per-model property, default false (config.rs:134)
  2. The requester must opt in per-request via options.queueRequest: true in the JSON body (ai.ts:133-144)

When both are true, the request is POSTed to the async-queue-worker service binding (queue.ts:23). The response is either 200 (completed immediately) or 202 (queued). To retrieve results, the requester sends a follow-up request with inputs.request_id set, which triggers a GET to the queue service (ai.ts:148-153).

Limitations:

  • Polling only — no webhook/callback support
  • No streaming — inputs.stream is explicitly deleted when queueRequest is true (ai.ts:135)

This contrasts with Replicate where every prediction is async by default (webhook-driven), and Prefer: wait is the opt-in for synchronous behavior. Workers AI is synchronous by default, with async as an opt-in that requires both model-level enablement and per-request signaling.

Key Differences

Job type diversity:

  • Replicate: Three distinct job kinds (prediction, training, procedure) with different cog endpoints, timeout behaviors, queue prefixes, and webhook structures
  • Workers AI: Single job type (inference) with task-type differentiation only at the input/output schema level

Training support:

  • Replicate: First-class training support with dedicated queue prefix, extended timeouts (24h), destination field for trained weights, and separate cog endpoint
  • Workers AI: No training capability. LoRA adapters are loaded at inference time from pre-trained weights, not trained on the platform

Multi-step workflows:

  • Replicate: Procedures allow user-provided Python code to run on a shared runtime, enabling multi-model orchestration and custom logic. Source code is downloaded and cached per-execution
  • Workers AI: No equivalent. Multi-step orchestration is left to the caller

Weight loading model:

  • Replicate: Hotswap mechanism allows a base container to serve multiple versions by loading different weights at runtime. Uses prefer_same_stream to optimize for cache locality across a shared Director pool
  • Workers AI: LoRA adapters are injected per-request at the Triton level. No concept of a shared base container serving multiple “versions” — each model ID maps to a fixed deployment

Job type awareness:

  • Replicate: The same infrastructure (cluster, Director, cog) handles all three job kinds. Job kind is a configuration axis that changes queue prefixes, cog endpoint paths, timeout calculations, and webhook payload structure — but the systems are the same
  • Workers AI: Task type is a thin layer in worker-constellation-entry only. Everything downstream (constellation-entry, constellation-server, GPU containers) is task-agnostic

Deployment Management

Deployment Management

Overview

Replicate’s autoscaler is reactive: predictions create K8s Deployments on demand, queue depth drives scaling at 1-second resolution, and idle deployments get pruned automatically. Workers AI’s ai-scheduler is proactive: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and there is no idle-based pruning. The systems reflect fundamentally different assumptions — bursty ephemeral workloads vs. an always-available model catalog.


Replicate: Autoscaler

The autoscaler runs in every GPU model serving kubernetes cluster. It manages k8s Deployment objects in the models or serving namespaces — creating them when prediction traffic appears, scaling replica counts based on queue depth, and pruning idle ones.

Source: cluster/pkg/autoscaler/autoscaler.go (fa8042d)

Concurrent loops

LoopIntervalPurpose
Deployment Dispatcherevent-driven (BRPOP)Creates/updates K8s Deployments when predictions arrive
Scaler1 secondAdjusts replica counts based on queue depth
Scale State Snapshotter1 secondCaptures queue lengths + replica counts into Redis
Queue Pruner1 hourDeletes stuck requests older than 24 hours
Deployment Pruner1 minuteDeletes K8s Deployments that have been idle too long

Deployment Dispatcher

┌───────────────┐
│ replicate/api │
└──────┬────────┘
       │ LPUSH
       ▼
┌──────────────────────┐
│ prediction-versions  │
│ (Redis list)         │
└──────┬───────────────┘
       │ BRPOP
       ▼
┌──────────────────────┐
│ Deployment           │
│ Dispatcher           │
│ (event-driven)       │
└──────┬───────────────┘
       │ create/update
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentDispatcher BRPOPs from the prediction-versions Redis queue. Each prediction triggers ensureDeployableDeployment, which creates or updates the K8s Deployment if the config has changed. Change detection uses a config hash comparison plus a template version serial.

Key behavior:

  • Deployments are created on demand — the first prediction for a version/deployment triggers K8s Deployment creation
  • Rate-limited to avoid overwhelming the K8s API
  • Config hash comparison means no-op if nothing changed

Scaler

┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────┬───────────────┘
       │ read replicas + queue lengths
       ▼
┌──────────────────────┐
│ Snapshotter (1s)     │
└──────┬───────────────┘
       │ write
       ▼
┌──────────────────────┐
│ Redis cache          │
│ (scale state)        │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Scaler (1s)          │
│ computeNewReplica    │
│ CountV2              │
└──────┬───────────────┘
       │ patch replicas
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startScaler runs every 1 second. For each deployable found in K8s, it loads the scale state from Redis cache and calls scaleDeployable().

The scaling algorithm (computeNewReplicaCountV2) is HPA-inspired:

desiredReplicas = currentReplicas × (metricValue / targetMetricValue)

Where the metric is backlog-per-instance:

backlogPerInstance = (queueLength + queueHeadroom) / instances

Three dampening mechanisms prevent oscillation:

  1. Scaling policies — rate limits on scale-out and scale-in. Default scale-out: allow +5 count or +100% per minute (whichever is larger). Default scale-in: unrestricted rate.
  2. Stabilization windows — the algorithm considers the min (for scale-in) or max (for scale-out) desired replica count over a time window. Default scale-out: 30 seconds. Default scale-in: 2 minutes.
  3. Hysteresis — ignore small oscillations below a threshold (default 0.02).

Additional scaling features:

  • Slow start: cap at 5 replicas until the first pod reports ready
  • Scale-to-zero: supported, with configurable idle timeout delay. Gated by scale-to-zero-delay kill switch.
  • Override min replicas: per-deployable minimum via DeployableConfig
  • Emergency cap: cap-max-replicas feature flag

Scaling configuration

scaling.Config holds per-deployable scaling parameters:

FieldDefaultSource
MetricTargetconfig.BacklogPerInstance flagCLI flag
Hysteresis0.02hardcoded
MinReplicas0DeployableConfig.ScalingConfig
MaxReplicasconfig.MaxReplicas flagCLI flag
ScaleOut behavior30s stabilization, +5 or +100%/minalgorithm_v2_defaults.go
ScaleIn behavior2min stabilization, no rate limitalgorithm_v2_defaults.go

Per-deployable overrides come from deployable.ScalingConfig, which is set via the web’s DeployableConfig and serialized into K8s annotations.

Deployment Pruner

┌──────────────────────┐
│ Redis cache          │
│ (last-request-time)  │
└──────┬───────────────┘
       │ read
       ▼
┌──────────────────────┐
│ Deployment Pruner    │
│ (1min)               │
└──────┬───────────────┘
       │ delete idle
       ▼
┌──────────────────────┐
│ K8s Deployments      │
│ (models / serving)   │
└──────────────────────┘

startDeploymentPruner runs every 1 minute. It deletes K8s Deployments that haven’t received a prediction since DeployableDeploymentDeleteInactiveAfter. Max 100 deletions per cycle. Fails safe if the latest request time is missing from Redis (skips the deployment rather than deleting it).

Queue Pruner

startQueuePruner runs every 1 hour. Deletes stuck requests older than 24 hours from per-deployable Redis streams. The API’s sweeper also cleans these streams at much shorter intervals (30s). See Queue Management for details on both cleanup mechanisms and how they overlap.


Workers AI: ai-scheduler

Workers AI deployment management is split across multiple systems. ai-scheduler is a Rust binary deployed to a core datacenter K8s cluster (pdx-c). It has multiple subcommands, each run as a separate K8s deployment: AutoScaler, Scheduler, AdminAPI, ReleaseManagerWatcher, ExternalNodesWatcher, RoutingUpdater. They share the scheduling crate as a library.

Source: ai-scheduler/ (89f8e0d)

Architecture overview

┌──────────────────────────────────────────────────────────┐
│                      ai-scheduler                        │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐ │
│  │ auto_scaling  │  │  admin_api   │  │ release_mgr    │ │
│  │ (5min loop)   │  │ (manual ops) │  │ _watcher       │ │
│  └──────┬───────┘  └──────┬───────┘  └──────┬─────────┘ │
│         │                 │                  │           │
│         └────────┬────────┴──────────────────┘           │
│                  ▼                                        │
│         ┌──────────────┐                                 │
│         │  scheduling  │  (action execution)             │
│         └──────┬───────┘                                 │
│                │                                         │
└────────────────┼─────────────────────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
  Cloud Chamber      External K8s
  (internal GPU)     (OKE, CoreWeave, Nebius,
                      Lambda, GCP, Crusoe)
                          │
                          ▼
                  inference-kubernetes-engine
                  (Model CRD operator)

Three entry points produce scheduling actions:

  • auto_scaling — utilization-based autoscaler loop
  • admin_api — manual scaling endpoints for humans
  • release_manager_watcher — watches for software version changes, triggers rolling updates

All actions flow through the scheduling app, which applies them to either Cloud Chamber (internal capacity) or external K8s clusters via the external_nodes module.

Autoscaler (auto_scaling)

The autoscaler runs every 5 minutes. Each cycle:

  1. Fetches model usage from ClickHouse (request counts, inference time per minute over 15-minute windows)
  2. Fetches usage “forecast” from ClickHouse (same-time-last-week data, not a forecasting model)
  3. Fetches utilization metrics from Prometheus (soft — errors produce empty data, not failures)
  4. Loads model configuration from Config API
  5. Gets current application state from Cloud Chamber
  6. Fetches external endpoint health from Quicksilver and counts healthy deployments from Cloud Chamber
  7. Handles external capacity scheduling (may emit ScheduleExternalModelApplications)
  8. Computes desired instance count per model
  9. Emits UpscaleModelApplications or DownscaleModelApplications actions

Scaling algorithms

Three algorithms coexist. Selection depends on per-model Config API properties:

1. Request-count-based (default fallback):

instances = avg_count_per_min × autoscaling_factor

Default autoscaling_factor is 0.8. This assumes each inference request takes ~1 minute. Crude, but serves as the baseline.

2. Utilization-based (model_utilization_autoscaler = true):

Computes utilization from cumulative inference time:

required_to_fit = inference_time_secs_per_min / 60 / max_concurrent_requests
utilization = required_to_fit / model_instances

Uses a deadband instead of dampening:

  • If utilization < 1 / (factor + 0.15) → downscale
  • If utilization > 1 / factor → upscale
  • Otherwise → no change

Minimum factor is clamped to 1.2 (20% overprovisioning floor). Takes the max of current and forecast inference time.

3. EZ utilization-based (scaling_config.utilization_based.active = true):

The newest algorithm. Configurable per-model via ScalingConfig:

scaling_config:
  disabled: false
  utilization_based:
    active: true
    min_utilization: 0.3
    max_utilization: 0.8
    utilization_scale: 1.0
    use_forecast: true
    use_out_of_cap: true
    prometheus: false

Features:

  • Configurable min/max utilization bounds (replaces hardcoded deadband)
  • Optional forecast-based scaling (use_forecast)
  • Out-of-capacity adjustment (use_out_of_cap): inflates measured utilization by (successful + out_of_cap) / successful to account for requests turned away. successful_per_min floored at 0.1 to avoid division by zero.
  • Safety cap: when OOC adjustment is active, measured utilization capped at 2× the sans-OOC value
  • Prometheus metrics as an alternative utilization source
  • Asymmetric instance base: upscale decisions use model_healthy_instances, downscale uses model_requested_instances
  • Unhealthy instance handling: adds +1 buffer when upscaling with unhealthy instances; gradually decrements requested count when >1 unhealthy instances exist

When active, this algorithm overrides the request-count-based algorithm. However, if model_utilization_autoscaler is also true, the older utilization-based algorithm takes final precedence — the priority order is: request-count → EZ utilization → old utilization.

Instance bounds

All algorithms clamp results to [min_count, max_count] from Config API:

  • Default min_count: 5 (set per autoscaler instance via CLI flag default_min_count)
  • Default max_count: 100
  • Downscale batch size capped at 10 per cycle

There is no idle-based scale-to-zero. Models stay provisioned at min_count (default 5, but can be set to 0 per-model).

Kill switches

  • disable_scaling — global Config API property, disables all scaling
  • scaling_config.disabled — per-model disable
  • Tier change grace period — skips models that recently changed tiers
  • Tier selector — autoscaler instances can be scoped to specific tier ranges

Scheduling actions

Actions are applied via Cloud Chamber API or external K8s:

ActionTargetEffect
UpscaleModelApplicationsCloud ChamberPATCH application instances count up
DownscaleModelApplicationsCloud ChamberPATCH application instances count down
ScheduleExternalModelApplicationsExternal K8sCreate/patch Model CRD replicas
CreateModelApplicationCloud ChamberCreate new CC application + set instances
DeployModelApplicationToTigerCloud ChamberDeploy to Tiger (canary) environment
DeleteDeploymentCloud ChamberDelete specific deployment
RemoveModelApplicationCloud ChamberDelete entire application
ModifyApplicationRegionsCloud ChamberModify region constraints
ModifyApplicationSchedulingPriorityCloud ChamberModify scheduling priority
ModifyApplicationAffinitiesCloud ChamberModify colocation/affinity constraints

Instance count is clamped to 0–1400 per application. Up/downscale distributes changes round-robin across applications.

Actions are defined in the Action enum.

External capacity

External capacity (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) has a separate management model controlled by the ExternalCapacity config (Management enum):

#![allow(unused)]
fn main() {
pub enum Management {
    Disabled,   // no auto-management
    Manual,     // humans manage it entirely
    Upscaling,  // autoscaler can only scale UP
    Full,       // autoscaler can scale both directions
}
}

The autoscaler code has this comment:

NOTE: currently auto-scaler can only scale up external instances but not scale down. In order to scale down, use admin api endpoint: /models/schedule_externally

This comment may be stale — Management::Full supports both directions in the code. The real constraint is that the autoscaler’s scaling algorithms don’t dynamically compute external replica targets; external capacity is config-driven (expected_replicas), not utilization-driven.

The external path works as follows:

  1. Infrastructure: K8s clusters provisioned via Terraform (oci-terraform/ repo — OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe)
  2. Model operator: inference-kubernetes-engine watches Model CRDs and reconciles K8s Deployments/Services to match
  3. Scaling up: autoscaler patches Model CRD spec.replicas via KubernetesProvider.schedule()
  4. Scaling down: manual via admin API or the scale-down CLI tool in inference-kubernetes-engine, which gradually decrements replicas with a configurable interval and batch size

Admin API

Manual scaling endpoints behind is_workers_ai_team() access check:

  • GET /models/scale — upscale a model by amount (creates UpscaleModelApplications action; notably a GET for a mutating op)
  • POST /models/schedule — run the full scheduler loop for a single model
  • POST /models/schedule_externally — set external replica count
  • POST /models/schedule_externally_on_specific_provider — target specific provider
  • POST /models/remove — delete specific deployment by colo/metal
  • POST /models/delete_applications — delete all applications for a model, including external CRDs
  • GET /models/status — model status/debugging
  • Tiger management — create, list, delete canary deployments

Model provisioning

Unlike Replicate, Workers AI models are pre-provisioned rather than created on demand from inference traffic:

  1. Model is registered in Config API with properties (min_count, max_count, software, gpu_memory, etc.)
  2. release_manager_watcher detects the software version and creates Cloud Chamber applications
  3. Autoscaler maintains instance count within [min_count, max_count] based on utilization
  4. There is no equivalent of Replicate’s deployment pruner — models stay provisioned at min_count until manually removed

Config API properties (scaling-relevant)

PropertyDefaultDescription
min_count5Minimum instances
max_count100Maximum instances
scaling_factor0.8Autoscaling factor
model_utilization_autoscalerfalseEnable utilization-based algorithm
scaling_confignoneYAML blob with disabled, utilization_based sub-config
disable_scalingfalseKill switch (also available as global property)
external_capacitynoneExternal provider config with management mode
gpu_memory22GPU memory request (GB)
gpu_modelnoneSpecific GPU model requirement
dual_gpufalseRequires two GPUs
colo_tiernoneRestrict to colo tier
colo_regionnoneRestrict to colo region
tier“unknown-scheduler”Model tier (Tier-0, Tier-1, Tier-2)

Key Differences

AspectReplicateWorkers AI
Scaling signalQueue depth (backlog per instance) — real-timeInference time utilization — 15-minute ClickHouse windows
Loop frequency1 second5 minutes
Deployment creationOn demand from prediction trafficPre-provisioned via Config API + release manager
Scale-to-zeroYes, with idle timeoutNo idle-based scale-to-zero; min_count defaults to 5 but can be 0 per-model
Deployment pruningAutomatic (idle deployments deleted after timeout)None — models stay provisioned until manually removed
DampeningHPA-style: scaling policies, stabilization windows, hysteresisDeadband: upper/lower utilization bounds
OrchestratorDirect K8s API (Deployments)Cloud Chamber (internal) + K8s operator (external)
External capacityN/A (single K8s cluster)Multi-provider (OKE, CoreWeave, Nebius, Lambda, GCP, Crusoe) with per-model management mode
Manual controlsFeature flags (cap-max-replicas, scale-to-zero-delay)Admin API endpoints, disable_scaling kill switch, Management::Manual mode
Scale-down on externalN/AConfig-driven; Management::Full supports both directions but targets are set manually, not utilization-driven
Algorithm selectionSingle algorithm (HPA-inspired)Three algorithms with priority ordering: request-count → EZ utilization → old utilization
Config sourceK8s annotations (from web’s DeployableConfig)Config API properties (key-value store)

Architectural contrast

Replicate’s autoscaler is reactive and fine-grained: predictions create deployments, queue depth drives scaling at 1-second resolution, idle deployments get pruned. The system assumes workloads are bursty and ephemeral.

Workers AI’s ai-scheduler is proactive and coarse-grained: models are pre-provisioned with minimum instance counts, scaling adjusts within configured bounds at 5-minute resolution, and external capacity management is heavily human-assisted. The system assumes a catalog of always-available models.

Resource Allocation

Resource Allocation

Overview

Both platforms need to map model requirements to GPU hardware, but they approach it from opposite directions. Replicate exposes hardware selection to users as a first-class concept — users pick a SKU, and the platform provisions dedicated resources. Workers AI abstracts hardware entirely — operators configure GPU memory requirements per model in Config API, and the platform handles placement across a shared fleet.


Replicate Resource Model

Hardware SKUs

The Hardware Django model (web/models/hardware.py) defines every available hardware option as a SKU. Each SKU combines a compute unit (GPU type) with a compute units per instance count (how many GPUs per pod).

Current compute units (cluster/pkg/kubernetes/compute_unit.go):

Compute UnitGPUStatus
cpuNoneActive
gpu-t4Nvidia T4 (16 GB)Active
gpu-a100Nvidia A100 (40 GB)Active
gpu-a100-80gNvidia A100 (80 GB)Active
gpu-h100Nvidia H100 (80 GB)Active
gpu-h200Nvidia H200 (141 GB)Active
gpu-l40sNvidia L40S (48 GB)Active
gpu-a40, gpu-a40-small, gpu-a40-largeNvidia A40 (48 GB)Legacy
gpu-t4-highmem, gpu-t4-lowmemNvidia T4 variantsLegacy
gpu-rtx-a4000, gpu-rtx-a5000, gpu-rtx-a6000Nvidia RTX AxxxxLegacy
gpu-flex-ampere-min-40gAny Ampere ≥40 GBLegacy

Multi-GPU SKUs use the same compute unit with a higher compute_units_per_instance. For example, gpu-2x-a100 is compute unit gpu-a100 with compute_units_per_instance=2. The Hardware model tracks this as a separate SKU with its own pricing.

Hardware availability is gated by flags on the model: allow_for_models, allow_for_deployments, is_legacy, is_preview. The HardwareQuerySet methods (available_for_models(), available_for_deployments()) filter to billable, non-legacy, public hardware (web/models/hardware.py:162-166).

K8s Resource Limits

The computeUnitLimits map (cluster/pkg/kubernetes/resources.go) defines CPU, memory, and GPU limits per compute unit:

Compute UnitCPUMemoryGPUs
cpu12 Gi0
gpu-t4416 Gi1
gpu-a1001072 Gi1
gpu-a100-80g10144 Gi1
gpu-h10013144 Gi1
gpu-h20013144 Gi1
gpu-l40s1072 Gi1

For multi-GPU pods, limits are multiplied by compute_units_per_instance with special cases for 8x configurations:

  • 8x A40/A40-Large: 6 CPU, 85 Gi per unit (reduced from 10/72)
  • 8x A100-80G: 10 CPU, 120 Gi per unit (reduced from 10/144)
  • T4 requests: memory request fudged to 13 Gi (limit stays 16 Gi) to fit 4x T4 on a node

Director’s sidecar container adds its own overhead: 256 Mi for most GPUs, 512 Mi for L40S/H100/A100/A100-80G, or 1 Gi when DirectorExtraMemory is set (deployable.go:1450-1465).

The model container also gets /dev/shm sized to 50% of its memory limit (deployable.go:681-686).

Node Placement and Bin-Packing

Replicate runs on three cloud providers (config/config.go):

  • CoreWeave CKS — primary, most models
  • Nebius MK8S — H200 capacity
  • GKE — T4 capacity

On CoreWeave CKS, placement uses a combination of node affinity, bin-packing, priority classes, and topology spread (deployable_cks.go):

Node affinity (required): pods are pinned to nodes matching the GPU class label (gpu.nvidia.com/class). CPU models go on GPU nodes but are excluded from H100/H200 nodes to preserve expensive capacity. Procedure (pipeline) workloads go to a dedicated customer-cpu-nodepool.

Bin-packing: GPU models use CoreWeave’s binpack-scheduler to pack pods tightly, leaving room for 4x and 8x pods. Pod affinity preferences (weight 10) group pods of the same compute unit on the same node. A second pod affinity groups pods by grace period (standard vs extended predict timeout) so that preemption doesn’t kill long-running predictions.

Priority node pools: 8x GPU pods get a node preference (weight 100) for dedicated priority node pools. Non-8x pods prefer to avoid these pools. On-demand node pools are also deprioritized (weight 100 against).

Priority classes:

  • r8-high — 4x+ GPU pods that are allowed to preempt others
  • r8-high-no-preempt — 4x+ GPU pods that won’t preempt
  • r8-cpu-model — CPU models, lowest priority, preemptible by GPU

Topology spread: CPU models on CKS use TopologySpreadConstraints with maxSkew=2 to prevent bunching on a single node and starving GPU models of CPU resources.

On Nebius MK8S, placement is similar to CKS but narrower in scope (deployable_mk8s.go):

GPU support: H200 only. The switch statement handles ComputeUnitCPU and ComputeUnitH200 — everything else gets a NO_SUCH_GPU sentinel that prevents scheduling. Node affinity uses the standard K8s label node.kubernetes.io/instance-type (value gpu-h200-sxm).

Bin-packing: Same pod affinity logic as CKS — non-8x GPU pods get compute-unit bin-packing (weight 10) and grace-period bin-packing (weight 10). No custom scheduler though — uses the default K8s scheduler.

CPU models: Excluded from H200 nodes via NodeSelectorOpNotIn on the GPU class label, but no dedicated CPU node pool — they land on whatever non-H200 nodes are available.

On GKE, placement is minimal (deployable_gke.go):

Node affinity (required): pods are pinned to nodes matching a replicate/role label with value model-{compute_unit} (e.g. model-gpu-t4). This is a Replicate-defined label on the node pool, not a vendor GPU class label.

Tolerations: Three tolerations per pod — the default GKE nvidia.com/gpu taint, a legacy model-hardware taint, and a newer model-compute-unit taint. The code comments indicate the compute-unit taint is the intended replacement for the hardware taint.

Tenancy Model

The isolation boundary is the deployable (model version or deployment), not the account. Each deployable gets its own K8s Deployment with dedicated pods and exclusive GPU access. But multiple accounts can send prediction requests to the same deployable — the GPU pods themselves are shared across all authorized requesters.

Public models and deployments are multi-tenant: any account can run predictions against them, and all requests land on the same pool of Director pods. This is the common case for official models.

Private models and deployments are single-tenant by default: only the owning account can run predictions. Access can be granted to other accounts via an allow-list, similar to GitHub’s private repository permissions model. The access control is enforced at the API layer (web/api check permissions before enqueuing) — there is no infrastructure-level isolation (no separate namespaces, no network policies between tenants).

The “shared” vs “dedicated” billing distinction (instance_tenancy in Metronome pricing config) is orthogonal to tenancy. “Shared” means serverless pay-per-run pricing; “dedicated” means the customer pays for reserved uptime on a deployment. Both billing models serve requests from potentially multiple accounts on the same GPU pods.


Workers AI Resource Model

GPU Types and Memory Sizing

Workers AI models specify GPU requirements in Config API (config_api/src/lib.rs):

FieldDefaultDescription
gpu_memory22 (GB)GPU memory requested
cpu_memorysame as gpu_memoryCPU memory override
vcpu(none)CPU core request
dual_gpufalseNeeds 2 GPUs
gpu_model(none)GPU model override, e.g. "NVIDIA H100"

The platform supports three GPU types (cloudchamber/src/application.rs:259-263):

  • L4 — Nvidia L4 (internal colos)
  • H100 — Nvidia H100 80 GB (internal colos + external capacity)
  • H200 — Nvidia H200 141 GB (external capacity)

GPU type is inferred from the gpu_model string in the application config. If no model is specified, H100 is assumed (application.rs:286-291).

Multi-GPU allocation is calculated from total gpu_memory divided by per-GPU maximum: 80 GB for H100, 141 GB for H200. The result is clamped to 1–8 GPUs (external_nodes/.../model.rs:27-34).

Unlike Replicate, users never select hardware. The model’s GPU requirements are set by Workers AI operators via Config API, and the platform handles placement.

Placement and Scheduling

Workers AI uses two scheduling paths:

Internal capacity (Cloudchamber): Models are scheduled via Cloudchamber, Cloudflare’s internal container orchestrator. The create_application_request function (application.rs:247-255) maps model config to a Cloudchamber application with SchedulingPolicy::Gpu, placement constraints (colo tier, region, PoPs), and optional scheduling priority.

Cloudchamber’s placement is controlled by constraints on the application object. These map from Config API properties to ApplicationConstraints fields (application.rs:216-233):

Colo tier (colo_tier): Cloudflare datacenters are grouped into tiers by GPU capacity. Setting colo_tier restricts a model’s instances to datacenters at that tier. For example, when deploying in “tiger mode” (canary), default single-GPU models are pinned to tier 3 (application.rs:231-232).

Colo region (colo_region): Restricts instances to datacenters in specific geographic regions. Accepts a comma-separated list.

Colo PoPs (colo_pops): The most specific constraint — restricts instances to named Points of Presence (individual datacenters).

Scheduling priority (scheduling_priority): Sets the Cloudchamber scheduling priority for the application. Default is 50. Currently the only non-default value is Leonardo = 75, used for Leonardo partnership models to give them preferential placement (scheduling_priority.rs).

Separately, blacklisted_colos removes specific colos from the model’s routing table (not scheduling). This is consumed by the routing app when building the colo list that constellation-entry uses for request forwarding — it doesn’t affect where Cloudchamber places instances (routing/src/lib.rs:201-208).

External capacity (Kubernetes): For external GPU providers (OCI, etc.), ai-scheduler creates a Model custom resource (external_nodes/.../model.rs:73-150) that the IKE operator reconciles into K8s Deployments. GPU limits are set as nvidia.com/gpu resource requests when >1 GPU is needed.

Tenancy Model

Workers AI is multi-tenant at the model level. The default is that all accounts share the same model instances — a single GPU container serving @cf/meta/llama-3.1-8b-instruct handles requests from every account. The isolation boundary is the model ID, not the caller.

However, models can be restricted to specific accounts. Two mechanisms exist in worker-constellation-entry (ai.ts:49-64):

  • allowed_accounts — a comma-separated list of account IDs in Config API. When set, only listed accounts can use the model. Requests from other accounts get a 403. This is how partnership models (e.g. @cf/leonardo/phoenix-1.0) are restricted to the partner’s account.
  • is_private — a boolean flag marking a model as private.

Both are enforced at the API layer in worker-constellation-entry, not at the infrastructure level. A “private” Leonardo model still runs on the same GPU fleet as public models — the access gate is just earlier in the request path. This parallels Replicate’s approach where private model access control is also API-layer only.

Since all accounts share the same GPU instances, fairness is enforced at the request level — per-account fair queuing and rate limiting in constellation-server. See Queue Management for details.


Key Differences

AspectReplicateWorkers AI
Hardware selectionUser-facing SKU pickerOperator-configured, abstracted from users
GPU typesT4, A100 (40/80), H100, H200, L40S + legacyL4, H100, H200
Multi-GPUcompute_units_per_instance (1–8)gpu_memory / per-GPU max (1–8)
Resource limitsExplicit per-unit CPU/mem/GPU in Go mapGPU memory-based, CPU/mem optional
Isolation boundaryDeployable (model version or deployment) — dedicated GPU per deployable, multiple accounts share itModel ID — all accounts share the same instances, no per-account isolation
Access controlAPI-layer: public models open to all, private models gated by allow-listAPI-layer: public models open to all, allowed_accounts / is_private restrict partnership and private models
FairnessNo fairness controls — all requests to a deployable are equalPer-account fair queuing + rate limiting (see Queue Management)
SchedulingK8s with bin-pack scheduler + affinityCloudchamber (internal) + K8s via IKE (external)
Cloud providersCoreWeave CKS, Nebius, GKE (legacy)Cloudflare internal colos + external (OCI, etc.)
PreemptionPriority classes, configurable per-deployableScheduling priority in Cloudchamber
Placement constraintsGPU class node affinity, priority node poolsColo tier/region/PoP (scheduling); blacklisted colos (routing only)

The fundamental architectural difference is where the isolation boundary sits. Replicate isolates at the deployable level — each model version or deployment gets dedicated GPU pods, but multiple accounts can share those pods (public models) or access is allow-listed (private models). Workers AI isolates at the model level — all accounts share the same instances for a given model, with fairness enforced at the request level via queuing and rate limiting rather than resource partitioning.

Queue Management

Queue Management

Overview

Both platforms queue inference requests when backend capacity is saturated, but the queue architectures are fundamentally different. Replicate uses Redis Streams as a durable, distributed queue between the API layer and Director workers — requests persist across process restarts and can be claimed by any Director instance. Workers AI uses bounded in-memory queues local to each constellation-server instance — requests exist only in the process that received them and are lost if it restarts.


Replicate: Redis Streams with Consumer Groups

Queue Naming

Every deployable gets its own Redis Stream. The queue name is derived from the deployable’s key prefix and kind (deployable_config.py:565-589):

DeployableKindKey PrefixQueue Name
deployment-predictiondp-input:prediction:dp-{hex}
function-predictionfp-input:prediction:fp-{hex}
version-predictionvp-input:prediction:vp-{hex}
version-trainingvt-input:training:vt-{hex}

Function and version predictions that use the unified pipeline cluster config share a single queue: input:prediction:fp-{0*32} (deployable_config.py:44-45).

MessageQueue

Director’s MessageQueue (director/redis/queue.go) wraps a Redis Streams consumer group. Each Director pod is a consumer in the "director" consumer group. The queue runs two goroutines via errgroup:

  • Fetcher: Waits on a request channel, calls XREAD (via the QueueClient.Read interface) with the configured block duration, and sends the result back on a response channel. The fetcher is the only goroutine that reads from Redis — GetMessage sends a request struct and waits for the response (queue.go:131-148).

  • Claimer: Runs every 2 seconds (claimInterval), calling XCLAIM on all unacked messages to maintain ownership. This prevents Redis from reassigning messages to other consumers in the group while the current Director is still processing them. After the fetcher stops (shutdown requested), the claimer continues until all messages are acked or the context is canceled (queue.go:150-191).

Messages are tracked in an unackedMessages map keyed by unique ID (stream name + message ID). When a Director worker finishes processing a prediction, it calls Ack on the message, which removes it from the map. If a Director pod crashes, its unacked messages remain in the consumer group’s pending entries list (PEL). The API sweeper reclaims these orphaned messages via XPENDING + XCLAIM after 30 seconds of idle time, marks the predictions as failed, and deletes the messages. See Queue Sweeping below.

Sharded Queues

Each deployable’s queue is sharded across multiple Redis streams (:s0, :s1, etc.). The ShardedQueueClient (queue.go:317-356) manages these shards, using "director" as the consumer group name across all of them.

Every ack pipelines XACK + XDEL — the message is both acknowledged in the consumer group and deleted from the stream in a single round-trip (queue.go:329-341).

preferSameStream

When preferSameStream is enabled, the MessageQueue remembers which stream shard it last read from and passes it as PreferStream to the next Read call (queue.go:246-252). This is a hotswap optimization — by preferring the same shard, the Director pod improves weight cache locality (the model weights are already loaded for predictions from that shard’s deployable).

The streamChanged flag on the message indicates when the Director switched to a different stream, which signals the worker loop that it may need to handle a model swap.

MultiMessageQueue (Blue/Green)

MultiMessageQueue (queue.go:73-77) wraps multiple Queue backends and rotates between them on each GetMessage call. When multiple backends are present, the block duration is hard-coded to 100ms to stay responsive regardless of which backend has messages (queue.go:300-308).

This supports blue/green Redis migration — during a migration, both the old and new Redis instances are active, and the MultiMessageQueue polls both. Messages carry an explicit reference to the queue they came from so that acks and claims go to the correct backend.

Queue Pruning (Autoscaler)

The autoscaler’s startQueuePruner (autoscaler.go:299-311) runs hourly, scanning every deployable’s queue for stuck messages. A message is “stuck” if it’s older than 24 hours (stuckTimeout) (prune.go:25).

The pruner checks two sources per queue (prune.go:118-148):

  • XPENDING: Messages that were delivered to a consumer but never acknowledged — these were likely picked up by a Director that crashed or got stuck during processing.
  • XRANGE: Messages still in the stream that were never delivered to any consumer — these were likely enqueued when no Director pods were running (e.g. setup failed and the deployment scaled to zero).

Stuck messages are deleted in chunks of 50 via pipelined XACK + XDEL (prune.go:151-161).

Queue Sweeping (API)

The API service runs its own sweeper (api/internal/sweeper/sweeper.go) that overlaps with the autoscaler’s pruner but operates differently:

  • Frequency: Every ~30 seconds (with ±50% jitter), vs the pruner’s hourly cycle.
  • Scope: Scans all streams matching each deployable kind prefix (deployment-prediction, function-prediction, hotswap-prediction, version-prediction, version-training).
  • Mechanism: Uses ClaimStaleMessages (XPENDING filtered by 30-second idle threshold, then XCLAIM) to take ownership of messages that a Director claimed but stopped processing — e.g. because the pod crashed. Then sends a failure update to the web service and deletes the message.
  • Purpose: Catches predictions where the Director that claimed them is gone. The sweeper marks them as failed so the user gets a response rather than waiting indefinitely. Messages that were never claimed by any Director (sitting unclaimed in the stream) are not visible to the sweeper — those are handled by the autoscaler’s pruner after 24 hours.

The sweeper also runs two deadline-related loops:

  • Cordon loop (~1 second): Queries the meta:deadlines sorted set for predictions whose deadline has passed within the last 10 seconds. Deadlines are stored as scores at enqueue time. Matched predictions are moved to a presorted garbage set and terminated with status aborted via TerminatePrediction. This handles predictions still sitting in the queue (or still running) when their time expires.
  • GC loop (~30 seconds): Processes the presorted garbage set, deleting the actual stream messages for predictions that were cordoned or otherwise terminated.

These three loops — sweep, cordon, GC — cover different failure modes. The sweep loop handles crashed Directors (idle PEL entries). The cordon loop enforces prediction deadlines regardless of Director state. The GC loop cleans up stream messages after either of the other two has terminated the prediction. The autoscaler’s hourly pruner is a slower safety net for messages that somehow survive all three.

Consumer Pruning (API)

A separate ConsumerPruner (api/internal/sweeper/consumer_pruner.go) removes stale consumers from consumer groups. Consumers that haven’t claimed a message in 24 hours (staleConsumerTimeout) are pruned. This cleans up after Director pods that were terminated without gracefully leaving the consumer group.


Workers AI: Bounded In-Memory Queues

ModelRequestQueue

ModelRequestQueue (queue.rs) is an in-memory queue per model per constellation-server instance. It’s created when an EndpointGroup is initialized and holds requests that couldn’t immediately acquire a permit on any endpoint.

The queue supports two algorithm variants, selectable per model via Config API:

  • Fifo (fifo_queue.rs): Simple VecDeque<(account_id, Sender)>. Requests are processed in arrival order. All accounts share a single queue.
  • Fair (fair_queue.rs): Per-account VecDeque<Sender> queues stored in a DashMap, with a round-robin VecDeque<account_id> that rotates through accounts. An account is added to the rotation on its first enqueue and removed when its queue empties.

Both variants retain only live senders — abandoned slots (where the requester timed out or disconnected) are cleaned up on each enqueue call via retain(|q| !q.is_closed()).

Capacity

Queue capacity is calculated from model config (service-discovery/lib.rs:79-94):

capacity = max(queue_config.capacity, colo_queue_size)
         × sum(endpoint.max_concurrent_requests)

The result is clamped to [0, 1000]. In practice, only one of queue_config.capacity or colo_queue_size is set — the max is taking the non-zero value.

For the Fair queue, capacity is enforced per account — each account can have up to capacity items in its individual queue. For the Fifo queue, capacity is a global limit on total queue size.

Permit System

The PermitManager (permits.rs) tracks concurrency per endpoint. Each endpoint has a concurrency_limit (from EndpointConfig.max_concurrent_requests) and a map of active permits keyed by request ID.

try_acquire_permit() is non-blocking — it checks active_permits() < concurrency_limit and either returns a Permit or None. Permits have release-on-drop semantics: when a Permit is dropped, it calls release() which removes the entry from the map and calls notify.notify_one() to wake the queue processor (permits.rs:376-383).

There are three permit types:

  • Permit: Assigned to a specific request ID. Released on drop (unless detached).
  • UnassignedPermit: Holds a concurrency slot without a request ID. Used by the queue processor to reserve capacity before matching to a queued request. Decrements unassigned_count on drop (permits.rs:308-315).
  • DetachedPermit: A permit that has been persisted elsewhere (e.g. Consul KV) and should NOT be released on drop. Created via permit.detach().

The Three Outcomes

EndpointGroup.find_best_available_endpoint (endpoint_group.rs:149-288) is the core dispatch function. It tries endpoints in priority order and produces one of three outcomes:

  1. Leased: A permit was acquired immediately on a healthy endpoint. The request proceeds to inference with no queuing.
  2. Queued: All endpoints are at capacity, but the queue has room. A oneshot::Receiver is returned — the caller awaits it and gets an endpoint+permit when the queue processor fulfills the request.
  3. NoSlotsAvailable: All endpoints are at capacity AND the queue is full. The request is rejected with an out-of-capacity error.

Background Processor

Each ModelRequestQueue spawns a long-running tokio task (queue.rs:252-264) that:

  1. Waits on a Notify signal (fired when a permit is released or config changes).
  2. Calls fulfill_next_queued() in a loop until it returns false.
  3. fulfill_next_queued() gets the next queued request (respecting the Fair/Fifo algorithm), tries to acquire an unassigned permit on a healthy endpoint, and if successful, sends the (Endpoint, UnassignedPermit) pair through the oneshot channel to the waiting request.

If no healthy endpoints exist when the processor runs, the entire queue is cleared — all waiting requests receive a NoLocalResourcesAvailable error (queue.rs:203-209).

Hot-Swappable Algorithm

When the queue algorithm changes (Fair↔Fifo) via config update, set_config (queue.rs:88-158):

  1. Creates a new queue implementation.
  2. Drains all pending requests from the old queue.
  3. Re-enqueues them into the new queue (skipping closed senders).
  4. Spawns a new processor task and aborts the old one.

Requests that can’t be re-enqueued (e.g. new queue is smaller) receive a NoLocalResourcesAvailable error.


Key Differences

AspectReplicateWorkers AI
Queue backingRedis Streams (durable, distributed)In-memory VecDeque (per-process, volatile)
ScopeOne stream per deployable, shared across all Director podsOne queue per model per constellation-server instance
DurabilityMessages survive process restarts and Redis failoversMessages lost on process restart
Consumer modelConsumer groups — any Director can claim any messageDirect dispatch — queue processor matches to local endpoints only
FairnessNone — all requests to a deployable are equalPer-account fair queuing (round-robin) or FIFO, configurable per model
Capacity controlUnbounded stream (pruned after 24h)Bounded: colo_queue_size × total_concurrency, clamped to 1000
BackpressureNone at queue level — requests accumulate until prunedImmediate rejection (NoSlotsAvailable) when queue is full
Stuck message handlingAutoscaler pruner (hourly, 24h threshold) + API sweeper (30s, claims and fails)Not needed — in-memory queue with sender liveness checks
Concurrency controlDirector processes one prediction at a time per worker goroutinePermit-based: PermitManager per endpoint with configurable concurrency_limit
Blue/green migrationMultiMessageQueue rotates across Redis backendsNot applicable — no external queue to migrate

The most significant architectural difference is durability vs bounded backpressure. Replicate’s Redis Streams are durable — a prediction enqueued during a Director outage will be picked up when a new pod starts. But this durability comes with unbounded queue growth and requires two separate cleanup mechanisms (pruner + sweeper) to handle stuck messages. Workers AI’s in-memory queues are volatile but provide immediate backpressure — when capacity is exhausted, the caller gets an error right away rather than waiting for a timeout or pruner cycle.

Model Storage & Images

Model Storage & Images

Overview

Both platforms need to get model weights and inference code onto GPU nodes before serving requests. The approaches diverge sharply. Replicate builds Docker images via cog, then optionally layers on a multi-tier caching system (pget → Hermes → object store) to get weights close to GPU nodes. Workers AI stores model weights in R2 and fetches them at container startup via a dedicated model-greeter init container, with a disk-reaper sidecar managing local cache on each GPU node.


Replicate: Images, Weights, and Caches

Container Image Resolution

Every deployable has a docker_image URI set during cog push (e.g. r8.im/user/model@sha256:...). At pod creation time, the cluster autoscaler resolves this to an internal registry URI (deployable.go:1999-2040):

The r8.im/ prefix is stripped and replaced with the internal MODEL_ARTIFACT_REGISTRY_BASE (an unauthenticated Artifact Registry mirror). Non-r8.im URIs are passed through unmodified.

Two-Container Pods

Every model pod has two containers (deployable.go:460-476):

  1. director — orchestrates the model lifecycle (queue consumption, health polling, state reporting, cache restore/persist). Image comes from the services registry.
  2. model — runs the cog server. Image is the resolved model image.

They primarily interact via HTTP, and also share a supervisor-ipc volume (50 MB tmpfs) for some cases, and a run-cache volume for runtime state.

Standard Startup Path

The model container’s entrypoint (ModelEntrypointScript.sh) runs:

  1. Updates pget binary from PGET_DOWNLOAD_URL (or uses the monobase-bundled version if available).
  2. Optionally upgrades cog (via ENTRYPOINT_COG_OVERRIDE_PACKAGE) or installs hf_transfer.
  3. Starts the cog HTTP server (python -m cog.server.http).

The model’s setup() method runs inside the cog server and is often the step during which weights are downloaded via the pget tool proxied through Hermes as a pull-through cache.

FUSE (deprecated)

Replicate also built a FUSE-based path (fuse/fuse.go) that separated weights from code entirely: a host-level FUSE daemon served weights on demand, and the model container ran a lightweight monobase image instead of a Docker image with weights baked in. Director acted as a gRPC client to the FUSE mounter, managing mount lifecycle via Start/Heartbeat/Stop RPCs. This eliminated the weight download step from cold starts — weights were read lazily from the FUSE mount as the model accessed them.

The approach is being wound down and remaining FUSE-enabled models are slated for migration to the standard path. The code still exists in getImageURI (monobase fallback) and the FUSE entrypoint script, but no new models use it.

pget: Parallel Chunk Downloader

pget (replicate/pget) downloads model weights in parallel chunks. Key configuration (download/options.go:24-43):

  • CacheableURIPrefixes: Allowlist of domains+path-prefixes eligible for pull-through caching.
  • CacheHosts: Ordered list of cache hostnames used with consistent hashing — the same URL always routes to the same cache host.
  • ForceCachePrefixRewrite: When enabled, rewrites all requests to the first cache host (used for Hermes routing).
  • CacheUsePathProxy: Prepends the original host to the cache request path instead of using host-based routing.

Config is injected via K8s ConfigMaps: pget-config (standard) or pget-hermes-config (Hermes-enabled regions).

Hermes: Regional Edge Cache

Hermes (replicate/hermes) is an HTTP read-aside cache deployed in GPU serving regions (CKS, Nebius). It caches model weights in region-local S3-compatible object storage (CoreWeave CAIOS) to avoid repeated cross-region downloads.

Three components:

  • Cache server (server/cacherouter.go): Receives requests from pget. If the file is cached, returns a 307 redirect to a presigned S3 URL. If not cached, redirects to the origin and enqueues a background cache job.
  • Processor: Downloads from origin and uploads to regional S3.
  • Pruner: TTL-based cleanup of cached objects.

HuggingFace traffic is routed through Hermes by setting HF_ENDPOINT=http://hermes.../huggingface.co on all containers in CKS/Nebius regions. The weights.replicate.delivery domain is also rewritten through Hermes when ForceCachePrefixRewrite is enabled.

Torch and CUDA Caches

The director container manages two S3-backed caches (cache/cache.go):

  • Torch Compile Cache: Backs TORCHINDUCTOR_CACHE_DIR. Size range 10 MB–10 GB, 7-day TTL, refreshed daily.
  • CUDA Checkpoint Cache: Backs DIRECTOR_CUDA_CHECKPOINT_DIR.

On startup, Director restores these caches from S3 (after the model reports StatusReady). On shutdown, it persists any changes back. Cache files are stored as .tar.zst archives with timestamp-based naming for ordering.


Workers AI: R2, model-greeter, and disk-reaper

SoftwareConfig and Image Resolution

The ai-scheduler determines what container image and configuration to use for each model. It reads from two sources:

  1. Config API: Model properties including software_name, gpu_memory, and optionally cog_image (config_api/lib.rs:248-249).
  2. R2 model-catalog bucket: SoftwareConfig YAML files at {version}/{software_name}.yaml (catalog/r2.rs).

SoftwareConfig (config/software.rs) defines:

  • image — container image URI
  • api — inference backend type (e.g. pipe-http, tgi)
  • ports — network port configuration
  • mounts — volume mounts (cache dir, disk-reaper socket, etc.)
  • network — firewall allow rules (slirpnetstack)
  • entrypoint — optional override

For Cog models, SoftwareConfig::cog(image) constructs a config with the cog_image from Config API and hardcoded network allow rules for HuggingFace, R2, PyPI, and Replicate domains (software.rs:156-240).

Container Image Protocol

Container images use a cf:// protocol prefix that resolves to registry.cloudchamber.cfdata.org (image_registry_protocol.rs). This is a Cloudchamber concept — the protocol maps to one or more registry domains, providing fallback if one registry is unavailable. In practice, cf://image:tag resolves to registry.cloudchamber.cfdata.org/image:tag.

model-greeter: Weight Downloader

model-greeter downloads model files from R2 at container startup. On IKE (external) clusters it runs as a K8s init container; on Cloudchamber (internal) it runs as a sidecar.

The IKE operator configures model-greeter as an init container named install-model-greeter (deployment.rs:615-623) that copies its binary into a shared volume. The main container then uses that binary to download weights.

Environment variables injected by the scheduler (cloudchamber/lib.rs:88-192):

  • R2_ENDPOINT — R2 API endpoint
  • MODEL_CATALOG_R2_BUCKET — bucket name
  • MODEL_CATALOG_VERSION — current catalog version (from Release Manager)
  • SOFTWARE_TO_LOAD — software config name
  • MODEL_TO_LOAD — model identifier

Weights are downloaded to /cache (mounted from the disk-reaper managed volume).

disk-reaper: Local Cache Management

disk-reaper manages the local model cache on GPU nodes. Three modes (model-crd/crd.rs:80-86):

  • Shared (default): Cache backed by a PersistentVolumeClaim shared across pods on the same node. disk-reaper runs as a separate DaemonSet, communicating via Unix socket. Multiple models share the same cache volume.
  • Ephemeral: Cache backed by an emptyDir volume. disk-reaper runs as a sidecar container inside the model pod (deployment.rs:583-585). Cache is lost when the pod terminates.
  • Disabled: No cache management. Cache volume is still an emptyDir but no reaper process runs.

The IKE operator selects the mode based on operator config and the Model CRD’s disk_reaper.mode field. When the operator is configured for ephemeral-only mode, any model requesting Shared is downgraded to Ephemeral (deployment.rs:226-239).

Inference Backends

Workers AI supports multiple inference server backends, determined by the ai-software= tag on the container and the api field in SoftwareConfig:

BackendSoftwareConfig apiUse Case
TritontritonNVIDIA Triton Inference Server
TGItgiHuggingFace Text Generation Inference
TEIteiHuggingFace Text Embeddings Inference
PipeHttppipe-httpvLLM, Cog models
PipeHttpLlmpipe-http-llmLLM-specific pipe variant
PartnerPipeHttppartner-pipe-httpPartner-hosted models

This is a significant difference from Replicate, where Cog is the only inference server.


Key Differences

AspectReplicateWorkers AI
Weight sourcesMixed — some baked into Docker layers, many downloaded during setup() from HuggingFace, weights.replicate.delivery, and other originsSingle source — R2 model-catalog bucket
Weight downloadModel’s setup() via pget (parallel chunks, consistent-hash caching)model-greeter init container/sidecar downloads from R2
Regional cachingHermes (HTTP read-aside cache, 307 redirects to regional S3)R2 serves from nearest edge node (region-less by design)
Local cacheNo persistent local cache for weightsdisk-reaper manages shared PVC or ephemeral emptyDir
Compilation cachesS3-backed torch compile + CUDA checkpoint caches (7-day TTL)None
Container layoutTwo containers: director sidecar + modelOne main container + model-greeter init + optional disk-reaper sidecar
Inference serversCog onlyTriton, TGI, TEI, PipeHttp, PipeHttpLlm, PartnerPipeHttp
Image protocolr8.im/ rewritten to internal Artifact Registrycf:// protocol with multi-registry fallback

Replicate’s weight delivery is messy by nature. Model authors control what happens in setup() — some models have weights baked into Docker layers, others download from HuggingFace, others pull from weights.replicate.delivery, and some combine approaches. pget and Hermes are optimizations layered on top: pget parallelizes downloads from whatever origin the model uses, and Hermes caches results in regional S3 so subsequent cold starts in the same region avoid cross-region transfers. Neither is strictly required — models work without them, but cold starts are slower.

Workers AI’s approach is more uniform: weights always come from R2 via model-greeter, giving the platform full control over the download path. R2 operates region-lessly — it serves from the nearest edge node that has the data, so there’s no need for explicit regional copies the way Hermes populates region-local S3. The disk-reaper shared PVC adds another layer by keeping weights on-node across pod restarts.

The inference server flexibility is a notable Workers AI advantage. Supporting Triton, TGI, TEI, and vLLM alongside Cog means Workers AI can use purpose-built servers optimized for specific workload types, while Replicate routes everything through Cog.

State Reporting

State Reporting

Overview

Both platforms need to know what their model instances are doing — whether they’re booting, idle, processing requests, or failing.

Replicate’s Director reports instance state to the API every 15 seconds and sends a report when model setup finishes. Individual predictions are tracked via OTel spans. Workers AI logs a structured event for every inference request into Cloudflare’s internal analytics pipeline, and exposes Prometheus metrics for aggregate capacity signals.


Replicate: HTTP Reporting and OTel Spans

Instance State Monitor

Director runs an instance state monitor (instancestate/monitor.go) that tracks what the model instance is doing at any given moment. Three activity states (instancestate/types.go:10-14):

StateMeaning
BootInstance is starting up (setup running)
IdleReady but no active predictions
ActiveProcessing ≥1 prediction

The monitor maintains a concurrency counter (activityLevel). When a prediction starts, the level increments; when it finishes, it decrements. Any level > 0 means Active (types.go:18-23).

Every 15 seconds (DefaultUtilizationInterval), the monitor computes a Metrics payload (types.go:39-55) and POSTs it as JSON to the cluster-local API instance:

FieldDescription
active_time, boot_time, idle_timeSeconds in each state this interval
metric_durationTotal interval length
utilizationactive_time / metric_duration
mean_concurrencyWeighted average concurrency level
deployment_key, docker_image_uriDeployable identity
hardware, compute_unit, compute_unit_countResource info
instance_id, instance_metadata, versionPod-level metadata

Utilization is active_time / metric_duration — a number between 0 and 1 representing what fraction of the interval the instance was processing at least one prediction. Mean concurrency captures how many predictions were running simultaneously on average (monitor.go:80-91).

Transport: HTTP POST to DIRECTOR_REPORT_INSTANCE_STATE_URL. Auth via WEBHOOK_AUTH_TOKEN Bearer header. Retries via httpclient.ApplyRetryPolicy (reporter.go:102-140).

Setup Run Reporting

Once when model setup completes (or times out), Director sends an HTTP POST to DIRECTOR_REPORT_SETUP_RUN_URL (reporter.go:142-212). The payload includes:

  • status — terminal state (“ready”, “failed”, etc.)
  • started_at, completed_at — RFC3339 timestamps
  • logs — setup output (truncated to last 20 KiB for OTel spans)
  • instance_metadata — pod name, namespace, etc.
  • runtime_metadata — cog version, cog version override, pod name

Director also records OTel span attributes for setup timing: setuprun.scheduled_to_setup_started, setuprun.scheduled_to_ready, setuprun.duration_seconds, and setuprun.status.

Scale State Snapshots

The cluster autoscaler captures periodic snapshots of each deployable’s queue depth and replica count, stored in Redis (scalestate/scalestate.go):

FieldDescription
queue_lengthTotal across blue+green Redis
queue_length_blue, queue_length_greenPer-Redis-instance queue depth
replica_countReady replicas from K8s
timeUnix timestamp

Up to 120 snapshots are retained per deployable (MaxSnapshots). The autoscaler uses this history when making scaling decisions — the snapshot window provides context about recent queue pressure and capacity.

Two Prometheus gauges are emitted per snapshot (scalestate.go:69-76):

  • autoscaler_snapshot_queue_length — queue depth, labeled by deployable attributes
  • autoscaler_snapshot_replica_count — ready replicas

Scaling decisions are recorded as OTel spans (autoscaler.scaleVersion) with the full snapshot history JSON, old/new replica counts, and all deployable attributes.

Prediction Tracker

Each prediction’s lifecycle is managed by a Tracker (director/tracker.go) that wraps an OTel span and notifies subscribers on state changes. The tracker is not thread-safe — it’s owned by a single worker goroutine.

Prediction states: Processing, Succeeded, Failed, Canceled, Aborted. The tracker records span attributes for cancel reasons, error codes (e.g. E1234), compliance check results, moderation outcomes, and billing metrics (token counts, image counts).

Subscribers receive PredictionUpdate{Prediction, Events} notifications. Events are webhook event types (WebhookEventCompleted, WebhookEventOutput). Terminal updates are held until Stop() to avoid sending duplicate completion events.

pget Download Metrics

Director exposes a POST /metrics endpoint (server/metrics.go) that pget uses to report download metrics — URL, file size, throughput, and errors (pget/pkg/pget.go:148-180). Each metric is converted to an OTel span (director.metric) with attributes from the payload. Rate-limited to 100 req/s. HuggingFace and S3 presigned URLs are normalized to strip query parameters, avoiding high-cardinality span attributes.


Workers AI: Inference Events and Prometheus Metrics

Ready Analytics (Inference Events)

Every inference request produces an InferenceEvent that is serialized as a Cap’n Proto message and sent to logfwdr (Cloudflare’s internal log forwarding daemon) via a Unix domain socket (ready-analytics/src/inference_event.rs, ready-analytics/src/logfwdr.rs). This feeds into Cloudflare’s Ready Analytics pipeline.

Key fields per event (inference_event.rs:42-77):

FieldDescription
account_idCloudflare account ID
model_idModel identifier (e.g. @cf/meta/llama-2-7b)
request_time_msTotal request time
inference_time_msActual inference duration
queuing_time_msTime spent in local queue
time_to_first_tokenTTFT for streaming responses
neuronsNeuron cost metric (billing)
error_codeError code if failed
colo_id / source_colo_idColo identifiers
local_queue_sizeQueue depth at request time
model_container_idContainer identifier

Bitfield flags track request properties: streamed, beta model, external endpoint, queued in colo, Unix socket connection (inference_event.rs:13-19).

RequestSource distinguishes Worker bindings, Pages bindings, and REST API requests (inference_event.rs:22-28).

Transport: LogfwdrClient connects to a Unix socket with a 64-slot mpsc channel buffer. 10-second watchdog reconnect on failure. Metrics: connect_err, write_err, ok (logfwdr.rs).

Endpoint Analytics (Lifecycle Events)

The AvailableEndpointTracker (endpoint-analytics/src/lib.rs) tracks endpoint membership changes — when endpoints appear or disappear from the world map. It generates lifecycle events sent to a Workers pipeline via LifecycleEventSender:

FieldDescription
begin_tsWhen endpoint appeared
end_tsWhen endpoint disappeared (0 if still present)
event_typeEndpoint
model_idModel identifier
hostnameTracker’s hostname

Updates are tied to the world map refresh interval.

Prometheus Metrics

constellation-server exposes Prometheus metrics (metrics/src/lib.rs). Key gauges and counters:

Request-level:

  • requests — total requests by HTTP status
  • errors — errors by internal + HTTP code
  • infer_per_backend — inferences per backend type (Triton, TGI, etc.)
  • infer_per_health_state — inferences by endpoint health state

Queue and capacity:

  • turnstile_outcomes — permit acquisition outcomes (leased/queued/rejected)
  • full_fraction — fraction of time a model is fully saturated
  • total_permits / mean_used_permits — permit availability and usage per model
  • mean_queue_size — average queue depth per model

These queue/capacity metrics are computed from a ModelPermitBuffer — a rolling window of 600 samples at 500 ms intervals (5-minute window) tracking (total_permits, available_permits, queue_size) per model (metrics/src/lib.rs).

Infrastructure:

  • constellation_server_forward_failures — cross-colo forward failures
  • constellation_server_endpoints_removed — endpoints in backoff state
  • constellation_server_external_endpoints — external endpoint health by host/port

ai-scheduler Metrics

The scheduler emits its own Prometheus metrics (prefixed ai_scheduler_) every 60 seconds by polling the Cloudchamber API (crates/metrics/src/lib.rs):

  • model_deployments — deployment count by model, tier, region, status, GPU model
  • requested_instances / min_requested_instances / max_requested_instances — scaling bounds
  • autoscaler_measured_utilization / estimated_utilization / forecast_utilization — utilization signals
  • autoscaler_required_instances — minimum instances to meet target utilization
  • external_instances — external instance count by provider, region, model, phase
  • external_nodes — external GPU count by provider, region, state

Key Differences

AspectReplicateWorkers AI
Instance state3 states (Boot/Idle/Active) with concurrency level, reported every 15sPer-request inference events; permit/queue metrics from 5-min rolling window
TransportHTTP POST (JSON) to Replicate APIUnix socket (Cap’n Proto) to logfwdr pipeline
GranularityPeriodic snapshots (15s utilization, setup completion)Per-request events + rolling metric windows
Setup reportingReport sent once when setup completes, with logs, timing, metadataNo direct equivalent; endpoint lifecycle events track appearance/disappearance
Scaling telemetryRedis-cached snapshot history (120 snapshots), OTel spans per decisionPrometheus gauges polled every 60s from Cloudchamber API
Prediction trackingPer-prediction OTel span with subscriber notificationsPer-request inference event to analytics pipeline
Metrics systemOTel spans → tracing backendPrometheus (scraped) + Ready Analytics (logfwdr → pipeline)

Both platforms have multiple reporting streams at different granularities. Replicate has per-prediction OTel spans (Tracker), pget download metrics (POST /metrics), 15-second instance utilization snapshots, and a setup report sent once when setup completes. Workers AI has per-request inference events (Ready Analytics), endpoint lifecycle events, Prometheus metrics (backed by internally-sampled 5-minute rolling windows), and 60-second scheduler polls.

The main structural difference is where aggregation happens. Director’s instance state monitor pre-aggregates utilization into 15-second windows before sending to the API, which forwards to Web for customer-facing reporting. Workers AI’s inference events are sent raw and aggregated by downstream consumers in the analytics pipeline. The Prometheus metrics layer provides pre-computed summaries (like full_fraction and mean_queue_size) for operational use.

Billing Metrics

Billing Metrics

Overview

Both platforms need to measure what happened during inference so they can bill for it. The billing models are different: Replicate bills on predict time plus per-unit metrics (tokens, images, etc.), while Workers AI bills on “neurons” — an abstract unit representing GPU compute.

How those metrics are collected also differs. Director estimates billing metrics for most models. Only explicitly opted-in trusted models may report their own billing metrics. Workers AI has multiple paths depending on the model server software: Triton-backed models report cost metrics as output tensors that constellation-server converts to neurons, while non-Triton models (omni, infire, partner) calculate neurons themselves and report them via HTTP response headers. Both paths converge at constellation-entry.


Replicate: Model-Reported and Director-Estimated Metrics

Three Flags

Director has three flags that control billing metric behavior (director/config.go:46-48):

FlagPurpose
DIRECTOR_TRUST_BILLING_METRICSWhen true, pass through the model’s billing metrics to downstream systems
DIRECTOR_CALCULATE_TOKEN_METRICSWhen true, Director independently estimates token counts
DIRECTOR_CALCULATE_IMAGE_METRICSWhen true, Director independently estimates image counts

The calculate flags are the default path for untrusted models — Director estimates billing metrics from the prediction input and output rather than trusting the model to report them.

Both calculate flags can be enabled alongside trustBillingMetrics. When they are, Director compares its own estimates against the model’s reported values and records match/mismatch as OTel span attributes (token_input_metrics.match, image_count_metrics.match, etc.). This is useful for validating that trusted models report accurately.

Model-Reported Metrics (Trusted Path)

When trustBillingMetrics is true, Director passes through three sets of metrics from the model’s prediction response (tracker.go:524-585):

PublicMetrics — user-visible metrics stored on the prediction. Covers image count, batch size, input/output token counts, and predict time share.

BillingMetrics — internal metrics stored in prediction.InternalMetadata["billing_metrics"]. 35+ fields covering (cog/types.go:68-125):

  • Audio (input/output count, duration)
  • Characters (input/output count)
  • Images (input/output count, megapixels, pixel dimensions, step count)
  • Tokens (input/output count)
  • Video (input/output count, duration, frame counts, megapixel-seconds)
  • Documents (page input count)
  • Training (step count)

BillingCriteria — model variant and configuration that affects pricing, stored in prediction.InternalMetadata["billing_criteria"] (cog/types.go:58-66). Covers model variant, resolution target, motion mode, source/target FPS, and audio flag.

When trustBillingMetrics is false, all three are silently dropped.

Director-Estimated Token Counts

When calculateTokenMetrics is true, Director estimates token counts from the prediction input and output (tracker.go:656-704):

Input tokens: Scans prediction input keys for any containing the substring "prompt" (matches prompt, system_prompt, prompt_template, etc.). Each matching string value is tokenized and the results are summed.

The tokenizer (tracker.go:714-717) is a rough heuristic: split on blankspace, count words, multiply by 4/3.

Output tokens: If the output is a string, run countTokens on it. If it’s an array, count the array length (for streaming models, each element is typically one token).

Timing: When output tokens exist, Director also calculates TimeToFirstToken and TokensPerSecond.

Director-Estimated Image Counts

When calculateImageMetrics is true, Director estimates image counts from the prediction output (tracker.go:629-653):

  • Array output → count = array length
  • Non-empty string output → count = 1
  • Empty or nil output → count = 0

There’s no content inspection — a TODO: Check we're counting images and not something else acknowledges this. If the model already reported via BillingMetrics, Director uses that value instead.

Metric Flow

  1. Model reports PublicMetrics, BillingMetrics, and BillingCriteria in its prediction response.
  2. Director processes them based on the three flags — passing through trusted metrics, estimating where needed, comparing when both paths are active.
  3. Director sends the prediction to the API via internal webhook. BillingMetrics and BillingCriteria travel in InternalMetadata (opaque to users). PublicMetrics are on the prediction itself (visible to users).
  4. The API forwards to Web for billing aggregation.

Workers AI: Neurons and Multiple Reporting Paths

Neurons

Workers AI bills in neurons — an abstract unit representing GPU compute. Internally, 1 neuron ≈ 0.1 L4 GPU-seconds (baselined at Sept 2023 efficiency). Each model has per-metric neuron coefficients that convert raw usage into a neuron total. The formula (neuron/src/lib.rs:7-36):

total_neurons = cost_per_infer + Σ(neuron_cost × metric_value)

cost_per_infer is an optional flat cost per request. The per-metric multipliers (e.g., input_tokens, output_tokens, image_steps) are configured per model via Consul service config or Deus env vars like INPUT_TOKEN_NEURONS and OUTPUT_TOKEN_NEURONS.

Triton Path: Cost Metric Tensors

Triton-backed models — including triton-vllm, which covers the majority of LLMs — report billing metrics as output tensors prefixed with COST_METRIC_. constellation-server extracts these in the Triton result handler (inference/triton/result.rs:196-249):

  1. Iterate over result tensors.
  2. For each tensor named COST_METRIC_*, strip the prefix to get the metric name (e.g., input_tokens, output_tokens).
  3. Sum the tensor values (handles scalar, 1D array, and [N×1] shapes).
  4. Call calculate_neurons(neuron_config, &cost_metric_values) to compute total neurons.
  5. Strip COST_METRIC_* tensors from the response before sending to the client.
  6. Put the neuron total and up to two named cost metrics in the response cf-ai-cserver-meta JSON header.

The client never sees the raw cost metric tensors.

Non-Triton Path: Model-Reported Headers

Non-Triton backends report neurons via HTTP response headers that constellation-server passes through transparently:

Omni models (PipeHttp software type) — the omni framework (cloudflare/ai/omni) provides a Python API where model code calls context.cost.set_neurons(value) and context.cost.set_usage_metric(name, value). Omni converts these to cf-ai-neurons and cf-ai-cost-metric-{name,value}-N response headers (omni/shared/src/cost.rs:82-96). Models read their neuron coefficients from the cf-ai-model-config request header that constellation-server forwards from the model’s Consul config.

Infire models (PipeHttpLlm software type) — the newer vLLM deployment path. Neuron coefficients are defined in workers-ai.yaml (e.g., input_token: 0.02561, output_token: 0.07515). The model server calculates and reports neurons via the same cf-ai-neurons header convention.

Partner models (PartnerPipeHttp) — partner-bouncer calculates neurons and emits cf-ai-neurons headers (partner-bouncer/src/server/model.rs:178-211).

Convergence at constellation-entry

All paths converge at constellation-entry, the edge Worker that assembles the billing event (sdk/.../src/lib/headers.ts:20-82):

  1. Parse cf-ai-neurons from response headers (non-Triton path).
  2. Parse cf-ai-cserver-meta JSON and merge its fields — including neurons — into the metrics context (Triton path).
  3. Both sources write to the same raMetrics.neurons field.
  4. Send a Ready Analytics event with the neuron total to the SDK RA table.

Billing reads from the SDK RA table (aiinference_sdk_production_by_namespace_account_sampled), summing neurons × _sample_interval per account.

Streaming

For streaming responses, cost metrics are accumulated across chunks. In the Triton path, constellation-server’s InferenceEventBuilder adds metric values to running totals (inference_event.rs:153-156). In the non-Triton path, constellation-entry accumulates neurons and cost_metric_value_2 from each chunk’s meta field (sdk/.../src/lib/tools.ts:948-959).

No Server-Side Token Counting

No component in the Workers AI stack counts tokens independently. Token counts come from the model — either as COST_METRIC_input_tokens / COST_METRIC_output_tokens tensors (Triton) or as usage metrics reported via context.cost.set_usage_metric() (omni). There is no tokenizer in constellation-server, constellation-entry, or ai-scheduler.


Key Differences

AspectReplicateWorkers AI
Billing unitPredict time + per-unit metrics (tokens, images, video, etc.)Neurons (abstract GPU compute unit, ≈ 0.1 L4-seconds)
Who countsDirector estimates (untrusted) or model reports (trusted)Model always reports — via tensors (Triton) or headers (omni/infire/partner)
Token countingDirector: word count × 4/3 heuristic (untrusted). Model: actual counts (trusted).Model-reported only. No server-side tokenizer.
Cost formulaRaw metrics passed to billing system, pricing applied downstreamneurons = cost_per_infer + Σ(neuron_cost × metric_value), computed at inference time or by model code
Metric types35+ fields across audio, image, video, tokens, training, documentsArbitrary named metrics (typically 1-3 per model)
Trust modelExplicit trust_billing_metrics flag gates model-reported metricsImplicit — all models report their own metrics, no server-side estimation
Reporting pathsSingle path through DirectorMultiple: Triton tensors, omni SDK, infire headers, partner-bouncer — all converge at constellation-entry
Billing tableWeb aggregates from prediction metadataSDK RA table (written by constellation-entry) queried via BigQuery

Both platforms have a trust asymmetry. Replicate runs untrusted third-party models and must estimate billing metrics for them — only explicitly opted-in trusted models may report their own. Workers AI controls all model deployments, so every model reports its own metrics through one of several backend-specific mechanisms.

Replicate’s BillingMetrics struct with 35+ fields reflects the diversity of model types it supports (image generators, video models, audio models, LLMs, training jobs). Workers AI’s neuron abstraction collapses all of this into a single number — per-model pricing changes only require updating neuron coefficients in config, not model code.

Workload Types Overview

Workload Types in Replicate

Replicate handles several distinct workload types, each with different characteristics and configurations. Understanding these distinctions is essential for comparing with Workers AI, which has a more uniform workload model.

Overview

Replicate defines four primary workload types via the DeployableKind enum (replicate/web/models/models/deployable_config.py:114):

class DeployableKind(models.TextChoices):
    DEPLOYMENT_PREDICTION = "deployment-prediction", "deployment-prediction"
    FUNCTION_PREDICTION = "function-prediction", "function-prediction"
    VERSION_PREDICTION = "version-prediction", "version-prediction"
    VERSION_TRAINING = "version-training", "version-training"

Each workload type has its own deployable_metadata_for_* function in replicate/web/models/logic.py that generates the appropriate configuration.

1. Deployment Predictions

What: Predictions running on a Deployment - a stable, long-lived identifier that routes to a backing model version. The backing version can be changed over time and doesn’t need to be owned by the same account as the deployment.

Characteristics:

  • Persistent deployment entity (configuration/routing), but infrastructure can scale to 0 replicas
  • Custom configuration per deployment that can override many version-level settings
  • Uses dedicated deployment key for consistent routing

Configuration: deployable_metadata_for_deployment_prediction()

Code references:

  • Deployment model: replicate/web/models/models/deployment.py
  • Kind validation: logic.py:1156 - asserts kind == DeployableKind.DEPLOYMENT_PREDICTION and not deployment.used_by_model

Queue behavior: Standard shuffle-sharded queues per deployment

2. Function Predictions (Pipelines/Procedures)

What: Predictions for multi-step workflows (Replicate Pipelines). Function predictions run on a shared container image with CPU-only hardware that “swaps” in procedure source code at prediction time. Similar to hotswaps but specifically for CPU-only Python code rather than GPU model weights.

Characteristics:

  • Run on CPU hardware (no GPU)
  • Share a base container image across procedures
  • Procedure source code is swapped in at runtime (analogous to weight swapping in hotswaps)
  • Part of a larger multi-step workflow
  • Uses AbstractProcedure model, not Version

Configuration: deployable_metadata_for_procedure_prediction()

Code references:

  • Procedure model: replicate/web/models/models/procedure.py
  • Kind validation: logic.py:1165 - when kind == DeployableKind.FUNCTION_PREDICTION, asserts deployment.used_by_model

Director configuration: Director runs procedures with DIRECTOR_JOB_KIND=procedure (director/config.go:37)

Queue behavior: Standard queues, no special routing

3. Version Predictions

What: Predictions running directly on a model version (not through a deployment).

Characteristics:

  • Ephemeral infrastructure (scaled up/down based on demand)
  • Configuration comes from version’s current_prediction_deployable_config
  • Two sub-types: normal and hotswap (see below)

Configuration: deployable_metadata_for_version_prediction()

3a. Normal Version Predictions

What: Standard version predictions without hotswapping.

Characteristics:

  • One container image per version
  • Standard queue routing
  • prefer_same_stream = False (default)

Code: logic.py:1214-1222

if not version.is_hotswappable:
    metadata = DeployableConfigSerializer(
        version.current_prediction_deployable_config
    ).data

3b. Hotswap Version Predictions

What: Versions that share a base container but load different weights at runtime. Multiple “hotswap versions” can run on the same pod by swapping weights instead of restarting containers.

Characteristics:

  • Multiple versions share the same base Docker image
  • Each version has additional_weights (weights loaded at runtime)
  • Base version must be public, non-virtual, and accept Replicate weights
  • prefer_same_stream = True - workers preferentially consume from the same stream to optimize weight locality
  • Uses base version’s deployment key for infrastructure sharing

When a version is hotswappable (version.py:586):

def is_hotswappable(self) -> bool:
    if not self.additional_weights:
        return False
    if not self.base_version:
        return False
    if self.base_docker_image_id != self.base_version.docker_image_relation_id:
        return False
    return self.base_version.is_valid_hotswap_base

Configuration: logic.py:1229-1244

deployable_config_fields = {
    ...
    "docker_image": version.base_version.docker_image_relation,
    "fuse_config": None,
    "key": version.base_version.key_for_hotswap_base_predictions,
    "prefer_same_stream": True,  # KEY DIFFERENCE
}

Queue behavior:

  • Shuffle-sharded queues (like all workloads)
  • DIRECTOR_PREFER_SAME_STREAM=true set for hotswap versions
  • Director workers repeatedly consume from the same stream before checking others
  • Optimizes for weight caching locality (reduce weight download/loading overhead)

Why prefer_same_stream matters:

  • Hotswap versions load different weights into the same container
  • If a worker has already loaded weights for version A, it’s faster to keep processing version A predictions
  • Without prefer_same_stream, workers round-robin across all streams, losing weight locality benefits

Code references:

4. Version Trainings

What: Training jobs for versions marked as trainable.

Characteristics:

  • Uses separate current_training_deployable_config (not prediction config)
  • Longer timeouts and different resource requirements
  • Different billing model
  • Director runs with DIRECTOR_JOB_KIND=training

Configuration: deployable_metadata_for_version_training()

Code references:

Summary Table

Workload TypeInfrastructureConfig Sourceprefer_same_streamDirector Job Kind
Deployment PredictionPersistentDeployment configFalseprediction
Function PredictionEphemeralProcedure configFalseprocedure
Version Prediction (normal)EphemeralVersion prediction configFalseprediction
Version Prediction (hotswap)Shared baseBase version config + weightsTrueprediction
Version TrainingEphemeralVersion training configFalsetraining

Key Takeaways

  1. Replicate has explicit workload type separation with different configurations and behaviors per type
  2. Hotswap predictions are unique - they’re the only workload using prefer_same_stream=True for weight locality optimization
  3. Function predictions use code swapping - similar to hotswaps but for CPU-only Python code instead of GPU model weights
  4. Each workload type has distinct characteristics - different infrastructure models, configuration sources, and queue behaviors

These workload distinctions will be referenced throughout the document when discussing queue behavior, timeouts, scaling, and other operational characteristics.