Request Lifecycle Management
Overview
As one might expect, a given prediction or inference request touches different components on each platform. The diagrams below show the (simplified) full path from client to inference server and back.
Replicate
graph TD
Client([Client])
Web[web]
subgraph GPU cluster
API[api]
Redis[(redis)]
Director[director]
Cog[cog]
end
Client -->|HTTP| API
Web -.->|playground| API
API -->|LPUSH / XADD| Redis
Redis -->|XREADGROUP| Director
Director -->|HTTP| Cog
Cog -->|response| Director
Director -->|webhook| API
API -->|poll / webhook| Client
Key points: the request is asynchronous by default. The client submits a prediction, the API enqueues it, and Director picks it up later. The client polls, receives a webhook when the result is ready, or opts into “blocking” mode at the per-request level. Streaming predictions respond over SSE but still enter through the same queue.
Workers AI
graph TD
Client([Client])
WCE[worker-constellation-entry]
CE[constellation-entry]
CS[constellation-server]
subgraph GPU cluster
GPU[GPU container]
end
Client -->|HTTP| WCE
WCE -->|HTTP| CE
CE -->|pipefitter / HTTP| CS
CS -->|HTTP| GPU
GPU -->|response| CS
CS -->|response| CE
CE -->|response| WCE
WCE -->|response| Client
Key points: the request is synchronous. The client holds an open HTTP connection while the request flows through the stack to a GPU container and back. There is no persistent queue — requests wait in bounded in-memory queues or get rejected with a 429 if no capacity is available.
Replicate Request Processing
The Replicate request path has three major phases: API-side routing and enqueue, transient queuing in Redis, and Director-side dequeue and execution. Durable state lives in PostgreSQL (via web), not in the queue. The sections below cover each phase in detail.
Behavior varies by workload type (see Workload Types appendix). All workload types share the same core queue and worker infrastructure but differ in configuration.
Queue and Message Handling
Each GPU provider’s managed Kubernetes cluster runs its own Redis
cluster (with sentinel for HA). Director uses Redis Streams with
consumer groups for request queuing. Each deployable has its own
stream — the API pushes prediction requests to the stream, and
Director pods consume from it. Queue names combine a workload-type
prefix (input:prediction:dp-, input:prediction:fp-, etc.) with
the deployable’s key.
Queues use shuffle sharding by default (since October 2024) — each
tenant is assigned a subset of stream shards, providing fairness by
preventing one account’s batch from blocking others. Hotswap
predictions enable PreferSameStream for weight cache locality.
See Queue Management for full details on queue internals: MessageQueue fetcher/claimer goroutines, sharded queues, blue/green migration via MultiMessageQueue, and stuck message cleanup.
Worker Loop
Each Director instance runs worker goroutines that poll the queue. The number of workers
is controlled by DIRECTOR_CONCURRENT_PREDICTIONS (default: 1). This is set by cluster
when the model’s Cog config specifies concurrency.max > 1
(cluster/pkg/kubernetes/deployable.go:1149-1153).
Most models run with the default of 1 worker.
The worker loop
(director/worker.go:134)
continuously:
- Polls Redis Stream for messages using consumer groups
- Checks if prediction execution is allowed (spending validation, principal checks)
- Calculates effective deadline from prediction metadata, deployment lifetime, and timeout config
- Creates a Tracker instance for state management
- Executes prediction via Executor interface (which calls into Cog)
- Handles terminal states (success, failure, cancellation)
- Acknowledges message in Redis
The worker loop behavior is the same across all workload types. What differs is the job kind passed to the executor:
- Predictions (deployment, version):
DIRECTOR_JOB_KIND=prediction(default) - Trainings:
DIRECTOR_JOB_KIND=training - Procedures (function predictions):
DIRECTOR_JOB_KIND=procedure
Note: Internally, Director converts procedure to prediction when sending to Cog
because the API doesn’t yet support the procedure value
(director/worker.go:731-737).
Code references:
- Worker implementation:
director/worker.go - Execution validation:
director/worker.go:34-48 - Deadline calculation:
director/worker.go:63-107 - Job kind configuration:
director/config.go:37
Tracker and State Transitions
The Tracker
(director/tracker.go)
manages state transitions and notifications. It defines four webhook event types:
start- Prediction begins executionoutput- Intermediate outputs during executionlogs- Log messages during executioncompleted- Prediction reaches terminal state
The Tracker maintains:
- Prediction state (struct at
director/tracker.go:36-69) - Subscriber functions that receive state updates
- Pending webhook events (if notifications are suppressed temporarily)
- Upload tracking for async file completions
- Moderation state (if content moderation is enabled)
Webhooks are sent to the Replicate API at state transitions. The webhook sender uses a
PreemptingExecutor
(director/executor.go)
that may drop intermediate webhooks if new updates arrive before the previous one is
sent - this is intentional since the latest state supersedes earlier ones. Terminal
webhooks are always delivered.
Code references:
- Tracker struct:
director/tracker.go:36-69 - Webhook event types:
cog/types.go:45-55
Streaming
Director supports Server-Sent Events (SSE) for streaming outputs. Streaming is enabled
when the API includes a stream_publish_url in the prediction’s internal metadata
(cog/types.go:230).
The StreamClient
(director/streaming.go)
publishes output events to this URL as they occur.
Concurrency Control
Director supports concurrent predictions within a single container via
DIRECTOR_CONCURRENT_PREDICTIONS. See Concurrency Deep Dive
for details on configuration and usage patterns.
Workers AI Request Processing
Routing Architecture
Workers AI requests flow through multiple components:
-
worker-constellation-entry: Pipeline Worker (TypeScript) that receives API requests from SDK or REST API. Handles input validation, model lookup, and request formatting before forwarding to constellation-entry (
apps/worker-constellation-entry/src/worker.ts). -
constellation-entry: Rust service running on every metal. Fetches model config from QS, selects target colos using routing algorithms, resolves constellation-server addresses, and forwards via pipefitter or HTTP (
gitlab.cfdata.org/cloudflare/ai/constellation-entry). -
constellation-server: Colo-level load balancer (Rust). Manages permits and request queues per model, selects GPU container via service discovery, handles inference forwarding and cross-colo failover (
gitlab.cfdata.org/cloudflare/ai/constellation-server). -
GPU container: Inference server running the model.
Wiki references:
- System Overview - Component descriptions
- Smart(er) routing - Routing architecture
- Constellation Server Updates Q3 2025 - Detailed request flow
constellation-entry Routing
constellation-entry fetches model configuration from Quicksilver (distributed KV
store), which includes the list of colos where the model is deployed
(src/model_config.rs).
It then applies a routing algorithm to select target colos. Routing algorithms include:
- Random: Random selection among available colos (default)
- LatencySensitive: Prefer nearby colos, with step-based selection
- ForcedColo: Override to a specific colo (for debugging/testing)
- ForcedResourceFlow: Use resource-flow routing
(
src/resource_flow.rs)
Routing uses a step-based algorithm (StepBasedRouter) that generates a prioritized list
of colos. The algorithm can be customized per-model via routing_params in the model
config.
Once colos are selected, constellation-entry resolves constellation-server addresses via
QS or DNS
(src/cs_resolver.rs)
and forwards the request via pipefitter or HTTP
(src/pipefitter_transport.rs).
Code references:
- Repository:
gitlab.cfdata.org/cloudflare/ai/constellation-entry - Main entry point:
src/main.rs - Colo resolver:
src/cs_resolver.rs
Wiki references:
constellation-server Load Balancing
constellation-server runs in each colo and handles:
Request handling
(src/main.rs,
server/src/lib.rs):
- Listens on Unix domain socket (for pipefitter) and TCP port
(
src/cli.rs:23,31) - Extracts routing metadata: model ID, account ID, fallback routing info
- Tracks distributed tracing spans
- Emits metrics to Ready Analytics
Endpoint load balancing and concurrency
(server/src/endpoint_lb.rs):
- Manages
EndpointGroupper model with permit-based concurrency control - Per-model concurrency limits configured via Config API
- Permit handling in
server/src/endpoint_lb/permits.rs - Explicit request queue (
ModelRequestQueue) with FIFO or Fair algorithms (server/src/endpoint_lb/queue.rs)
Service discovery
(service-discovery/src/lib.rs):
- Finds available GPU container endpoints via Consul
(
service-discovery/src/consul.rs) - Maintains world map of available endpoints
Inference forwarding
(server/src/inference/mod.rs):
- Regular HTTP requests: Buffers and forwards to GPU container with retry logic
- WebSocket requests: Upgrades connection and relays bidirectionally (no retries)
- Backend-specific handlers for Triton, TGI, TEI, and generic HTTP
(PipeHttp — used for vLLM and others) in
server/src/inference/
Failover and forwarding
(server/src/colo_forwarding.rs):
- If target returns error or is unavailable, may forward to different colo
- Tracks inflight requests via headers (
cf-ai-requests-inflight) to coordinate across instances - For pipefitter: Returns 500 with headers telling pipefitter to retry different colos
- Handles graceful draining: continues accepting requests while failing health checks
Code references:
- Repository:
gitlab.cfdata.org/cloudflare/ai/constellation-server - CLI args (timeouts, interfaces):
src/cli.rs
Wiki references:
- Constellation Server Updates Q3 2025 - Detailed request flow
- Reliable constellation-server failover
- Constellation server graceful restarts
Queuing and Backpressure
When GPU containers are busy, backpressure propagates through the stack:
-
constellation-server:
find_upstream()attempts to acquire a permit for a GPU endpoint. If all permits are taken, the request enters a bounded local colo queue (ModelRequestQueue). The queue is a future that resolves when a permit frees up (viatokio::sync::Notify). If the queue is also full, the request getsNoSlotsAvailable→ServerError::OutOfCapacity. OOC may also trigger forwarding to another constellation-server in the same colo (server/src/endpoint_lb/endpoint_group.rs:200-277). -
constellation-entry: Receives OOC from constellation-server (error codes 3033, 3040, 3047, 3048). Has a retry loop with separate budgets for errors (up to 3) and OOC (varies by request size). For small requests, pipefitter handles cross-colo forwarding. For large requests (>256 KiB), constellation-entry retries across target colos itself. All OOC variants are mapped to generic 3040 before returning to the client (
proxy_server.rs:1509-1600,errors.rs:62-67). -
worker-constellation-entry: Has an optional retry loop on 429/424 responses, but
outOfCapacityRetriesdefaults to 0 (configurable per-model viasdkConfig). By default, OOC goes straight to the client (ai/session.ts:101,137-141). -
Client: Receives HTTP 429 with
"Capacity temporarily exceeded, please try again."(internal code 3040).
There is no persistent queue in the Workers AI stack. Backpressure is handled through permit-based admission control, bounded in-memory queues, cross-colo retry, and ultimately rejection to the client. This contrasts with Replicate’s Redis Streams-based queue, where requests persist until a Director instance dequeues them.
State and Observability
Workers AI observability has two main channels from constellation-server:
Ready Analytics
(ready-analytics/src/):
Per-inference events sent via logfwdr (Unix socket). Each InferenceEvent includes:
- Request timing:
request_time_ms,inference_time_ms,queuing_time_ms,time_to_first_token - Billing:
neurons, cost metric values - Routing:
colo_id,source_colo_id,colo_path,upstream_address - Queue state:
local_queue_size - Flags: streamed, beta, external endpoint, queued in colo
See
ready-analytics/src/inference_event.rs
for full field list.
Prometheus metrics
(metrics/src/lib.rs):
Local metrics scraped by monitoring. Includes:
requests(by status),errors(by internal/HTTP code)total_permits,mean_used_permits,mean_queue_size,full_fraction(per model)infer_per_backend,infer_per_health_stateforward_failures(per target colo)backoff_removed_endpoints(endpoints in backoff state)
Note: “Workers Analytics” in the public docs refers to analytics for Workers code execution, not GPU inference. GPU inference observability flows through Ready Analytics.
Streaming
constellation-server supports two streaming mechanisms:
HTTP streaming
(server/src/inference/pipe_http.rs):
For PipeHttp backends, the response body can be InferenceBody::Streaming (line
222-230), which passes the Incoming body through without buffering via chunked transfer
encoding. The permit is held until the connection completes (line 396-397 for HTTP/1,
line 472 for HTTP/2). If the backend sends SSE-formatted data (e.g., TGI’s streaming
responses), constellation-server passes it through transparently without parsing.
WebSocket
(server/src/inference/mod.rs:180-400):
Bidirectional WebSocket connections are handled by handle_ws_inference_request. After
the HTTP upgrade, handle_websocket_request relays messages between client and inference
server. A max connection duration is enforced (max_connection_duration_s, line 332-338).
Loopback SSE
(server/src/loopback/):
Separate from client-facing streaming. constellation-server maintains SSE connections to
inference servers that support async batch processing. These connections receive
completion events (InferenceSuccess, InferenceError, WebsocketFinished, GpuFree)
which trigger permit release and async queue acknowledgment. See
loopback/sse.rs
for the SSE client implementation.
Concurrency Control
constellation-server uses permit-based concurrency control per model via
EndpointLoadBalancer
(server/src/endpoint_lb.rs).
Concurrency limits are configured in Config API. When local queuing is enabled, requests
wait in an explicit ModelRequestQueue (FIFO or Fair algorithm) until a permit becomes
available. Different models can have different concurrency limits on the same server.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Request model | Asynchronous by default (poll/webhook); sync opt-in via Prefer: wait | Synchronous by default; async opt-in via queueRequest |
| Queue persistence | Redis Streams — requests survive component restarts | No persistent queue — bounded in-memory queues, rejection on overflow |
| Backpressure | Queue absorbs bursts; Director dequeues when ready | Permit-based admission → local queue → cross-colo retry → 429 to client |
| State management | Director Tracker with webhook notifications (start, output, logs, completed) | Stateless request-response; observability via Ready Analytics events |
| Streaming | SSE via stream_publish_url; Director publishes output events | HTTP chunked transfer passthrough; WebSocket relay; loopback SSE for async batch |
| Routing | API routes to correct GPU cluster’s Redis; Director pulls from local queue | constellation-entry selects colos via routing algorithms; constellation-server selects GPU endpoint |
| Concurrency | DIRECTOR_CONCURRENT_PREDICTIONS per container (default 1) | Permit-based per model in constellation-server; FIFO or Fair queue algorithm |
| Fairness | Per-tenant shuffle sharding at queue level | Per-account round-robin in fair queue; per-account rate limiting |
| Transport | HTTP between Director and Cog | Pipefitter (small requests) or HTTP (>256 KiB) between constellation-entry and constellation-server |
| Failure handling | Webhook terminal states (failed/canceled); message ACK in Redis | Error codes propagated through stack; OOC triggers cross-colo retry |