Timeouts & Deadlines
Overview
Both platforms enforce timeouts on inference requests, but the designs reflect different assumptions about workload duration. Replicate has a multi-phase timeout system with separate controls for model setup, prediction execution, and overall lifetime — designed for workloads ranging from seconds to 24-hour training runs. Workers AI has a two-layer system: a per-model inference timeout in constellation-server (default 30 seconds) under a hard 4-minute ceiling in constellation-entry — optimized for fast, predictable inference.
Replicate Timeout System
Timeout behavior spans multiple components:
- replicate/web - defines timeout values per workload type in deployable metadata
- replicate/api - can inject per-prediction deadlines into prediction metadata
- replicate/cluster - translates deployable config into
DIRECTOR_*environment variables when creating K8s deployments; provides default values when the deployable config doesn’t specify them (pkg/config/config.go) - director - reads env vars and enforces timeouts during execution
Configuration
Three timeout parameters control prediction lifecycle. Each flows from replicate/web
through replicate/cluster to director as an environment variable. Director reads them as
flags
(director/config.go:52-58).
Model Setup Timeout (DIRECTOR_MODEL_SETUP_TIMEOUT)
Time allowed for model container initialization. Applied during health check polling
while waiting for the container to become ready. Measures time from StatusStarting to
completion. If exceeded, Director fails the prediction with setup logs and reports to the
setup run endpoint
(director/director.go:718-830).
Resolution chain:
- web:
Model.setup_timeoutproperty — returnsDEFAULT_MODEL_SETUP_TIMEOUT(10 minutes) when the underlying DB field is None (models/model.py:1038-1042). Stored onDeployableConfig.setup_timeout, then serialized asmodel_setup_timeoutwith a multiplier and bonus applied (api_serializers.py:1678-1681). - cluster: Reads
deployable.ModelSetupTimeout. If nonzero, uses it; otherwise falls back toconfig.ModelSetupTimeout(10 minutes) (pkg/config/config.go:74,pkg/kubernetes/deployable.go:1191-1201). - director: Reads
DIRECTOR_MODEL_SETUP_TIMEOUTenv var. Own flag default is 0 (disabled), but cluster always sets it.
Prediction Timeout (DIRECTOR_PREDICT_TIMEOUT)
Time allowed for prediction execution, starting from when execution begins (not including
queue time). Acts as a duration-based fallback when no explicit deadline is set. Director
enforces a minimum of 30 minutes if configured to 0 or negative
(director/director.go:285-288).
Resolution chain:
-
web:
Model.prediction_timeout(nullable duration). When set, stored onDeployableConfig.run_timeoutand serialized asprediction_timeout(models/model.py:1010-1017,api_serializers.py:1702-1703). When None, the serializedprediction_timeoutis omitted. -
cluster:
predictTimeout()resolves the value with this priority (pkg/kubernetes/deployable.go:1886-1906):deployable.PredictionTimeout(from web, if set)- Hardcoded per-account override map
userSpecificTimeouts(pkg/kubernetes/deployable.go:88-116) config.PredictTimeoutSeconds(30 minutes) (pkg/config/config.go:71)- For training workloads, uses the higher of the above and
TrainTimeoutSeconds(24 hours)
A mirrored per-account map exists in web’s
USER_SPECIFIC_TIMEOUTS(models/prediction.py:60-107) to preventterminate_stuck_predictionsfrom killing predictions that have these extended timeouts. -
director: Reads
DIRECTOR_PREDICT_TIMEOUTenv var. Own flag default is 1800s (30 minutes), but cluster always sets it.
Max Run Lifetime (DIRECTOR_MAX_RUN_LIFETIME)
Maximum time for the entire prediction lifecycle including queue time and setup.
Calculated from prediction.CreatedAt timestamp.
There are two paths that produce a max run lifetime constraint — one via the deployable config (env var on the Director pod) and one via a per-request header that becomes a deadline on the prediction itself.
Deployable config path (env var):
- web:
DeployableConfig.max_run_lifetime— defaults toDEFAULT_MAX_RUN_LIFETIME(24 hours) at the DB level (models/deployable_config.py:259-261). For deployment predictions,deployment.max_run_lifetimecan override this (logic.py:1175-1176). Serialized asmax_run_lifetime(api_serializers.py:1675-1676). - cluster: Reads
deployable.MaxRunLifetime. If nonzero, uses it; otherwise falls back toconfig.MaxRunLifetime(24 hours) (pkg/config/config.go:75,pkg/kubernetes/deployable.go:1203-1213). - director: Reads
DIRECTOR_MAX_RUN_LIFETIMEenv var. Own flag default is 0 (disabled), but cluster always sets it.
Per-request path (Cancel-After header):
- api: The
Cancel-AfterHTTP header on a prediction request is parsed as a duration (Go-style like5mor bare seconds like300). Minimum 5 seconds (server/v1_prediction_handler.go:233-276). - api:
calculateEffectiveDeadline()picks the shorter of the request’sCancel-Aftervalue and the deployable metadata’smax_run_lifetime, computes an absolute deadline from prediction creation time, and sets it onprediction.InternalMetadata.Deadline(logic/prediction.go:1076-1109). - director: This deadline is the “per-prediction deadline” checked first in the deadline priority (see Deadline Calculation below).
Deadline Calculation and Enforcement
Director calculates an effective deadline at dequeue time using this priority
(director/worker.go:63-107):
- Per-prediction deadline (
prediction.InternalMetadata.Deadline) - Deployment deadline (
prediction.CreatedAt + MaxRunLifetime, if configured) - Prediction timeout (fallback)
The earliest applicable deadline wins. At execution time
(director/worker.go:516-528):
deadline, source, ok := getEffectiveDeadline(prediction)
var timeoutDuration time.Duration
if ok {
timeoutDuration = time.Until(deadline)
} else {
timeoutDuration = d.predictionTimeout
}
timeoutTimer := time.After(timeoutDuration)
Timeout outcomes vary based on when and why the deadline is exceeded:
Before execution (deadline already passed at dequeue):
- Per-prediction deadline → aborted
- Deployment deadline → failed
During execution (timer fires while running):
- Per-prediction deadline → canceled
- Deployment deadline → failed
- Prediction timeout (fallback) → failed
Timeout Behavior Examples
Scenario 1: Version prediction with deployment deadline
DIRECTOR_MAX_RUN_LIFETIME=300(5 minutes)DIRECTOR_PREDICT_TIMEOUT=1800(30 minutes)- Prediction created at T=0, dequeued at T=60s
- Effective deadline: T=0 + 300s = T=300s
- Execution timeout: 300s - 60s = 240s remaining
Scenario 2: Prediction with explicit deadline
- Per-prediction deadline: T=180s (3 minutes from creation)
DIRECTOR_MAX_RUN_LIFETIME=600(10 minutes)- Prediction created at T=0, dequeued at T=30s
- Effective deadline: T=180s (per-prediction deadline wins)
- Execution timeout: 180s - 30s = 150s remaining
Scenario 3: Training with no deadlines
DIRECTOR_MAX_RUN_LIFETIME=0(disabled)DIRECTOR_PREDICT_TIMEOUT=1800(30 minutes)- No per-prediction deadline
- Effective deadline: None (uses fallback)
- Execution timeout: 1800s from execution start
Code references:
- Timeout configuration:
director/config.go:52-58 - Deadline calculation:
director/worker.go:63-107 - Execution timeout:
director/worker.go:516-528 - Setup timeout:
director/director.go:718-830
Workers AI Timeout System
Workers AI has a simpler timeout model than Replicate, but the request passes through multiple layers, each with its own timeout behavior.
Request Path Layers
A request from the internet traverses three services before reaching a GPU container:
- worker-constellation-entry (TypeScript Worker): The SDK-facing entry point. Calls
constellation-entry via a Workers service binding (
this.binding.fetch(...)) with no explicit timeout (ai/session.ts:131). See note below on Workers runtime limits. - constellation-entry (Rust, runs on metal): Wraps the entire request to
constellation-server in a hardcoded 4-minute
tokio::time::timeout(proxy_server.rs:99,proxy_server.rs:831-845). On timeout, returnsEntryError::ForwardTimeout. This applies to both HTTP and WebSocket paths. - constellation-server (Rust, runs on GPU nodes): Enforces per-model inference timeout (see below).
The 4-minute constellation-entry timeout acts as a hard ceiling. If a model’s
inference_timeout in constellation-server exceeds 240 seconds, constellation-entry will
kill the connection before constellation-server’s own timeout fires. This means
constellation-entry is the effective upper bound for any single inference request, unless
constellation-entry’s constant is changed.
Workers runtime limits: worker-constellation-entry is a standard Worker deployed on
Cloudflare’s internal AI account (not a pipeline First Party Worker). It does not set
limits.cpu_ms in its
wrangler.toml,
so it gets the default 30-second CPU time limit. However, CPU time only counts active
processing — time spent awaiting the service binding fetch to constellation-entry is
I/O wait and does not count
(docs). Since the
Worker is almost entirely I/O-bound, the CPU limit is unlikely to be reached. Wall-clock
duration has no hard cap for HTTP requests, though Workers can be evicted during runtime
restarts (~once/week, 30-second drain). In practice, the Workers runtime layer is
transparent for timeout purposes — constellation-entry’s 4-minute timeout fires well
before any platform limit would.
Configuration Levels
constellation-server Default
(constellation-server/src/cli.rs:107-109):
--infer-request-timeout-secs: CLI argument, default 30 seconds- Acts as the minimum timeout floor for all inference requests
- Applied globally across all models served by the constellation-server instance
Per-Model Configuration
(model-repository/src/config.rs:143):
inference_timeout: Model-specific timeout in seconds (default: 0)- Configured via Config API (Deus) and fetched from Quicksilver (distributed KV store)
- Part of
WorkersAiModelConfigstructure - Zero value means “use constellation-server default”
Timeout Resolution
When processing an inference request, constellation-server resolves the timeout
(server/src/lib.rs):
#![allow(unused)]
fn main() {
let min_timeout = state.infer_request_timeout_secs as u64;
let mut inference_timeout = Duration::from_secs(state.infer_request_timeout_secs as u64);
if let Some(model_config) = state.endpoint_lb.get_model_config(&model_id) {
inference_timeout = Duration::from_secs(
model_config.ai_config.inference_timeout.max(min_timeout)
);
}
}
The effective timeout is:
max(model_config.inference_timeout, infer_request_timeout_secs). This ensures:
- Models can extend their timeout beyond the 30-second default
- Models cannot reduce their timeout below the constellation-server minimum
- Unconfigured models (inference_timeout=0) use the 30-second default
Enforcement
constellation-server enforces the timeout by wrapping the entire inference call in
tokio::time::timeout(inference_timeout, handle_inference_request(...))
(server/src/lib.rs:588-599).
When the timeout elapses, the future is dropped, which closes the connection to the GPU
container. The error maps to ServerError::InferenceTimeout (line 603).
The other end of the dropped connection varies by model backend
(service-discovery/src/lib.rs:39-47):
- Triton: Upstream NVIDIA
tritonserverbinary (launched viamodel-greeter). Disconnect behavior depends on Triton’s own implementation. - TGI/TEI: Upstream HuggingFace binaries. Disconnect behavior depends on the upstream Rust server implementation.
- PipeHttp/PipeHttpLlm: Custom Python framework (
WorkersAiAppinai_catalog_common, FastAPI-based). The synchronous HTTP path (_generate) does not explicitly cancel theraw_generatecoroutine on client disconnect — the inference runs to completion (catalog/common/.../workers_ai_app/app.py:880-896). WebSocket connections do cancel processing tasks in theirfinallyblock (catalog/common/.../workers_ai_app/app.py:536-553).
The constellation-server timeout covers:
- Time waiting in the constellation-server queue (permit acquisition)
- Forwarding to the GPU container
- Inference execution in the backend (Triton, TGI, TEI)
- Response transmission back through constellation-server
But constellation-entry’s 4-minute timeout wraps the entire round-trip to constellation-server, so it is the effective ceiling regardless of per-model config.
Unlike Director’s system, there is no separate setup timeout. Model containers are managed by the orchestrator (K8s) independently of request processing. Container initialization and readiness are handled by service discovery and health checks, not request-level timeouts.
Timeout Behavior Examples
Scenario 1: Text generation model with default timeout
- Model config:
inference_timeout=0(not configured) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 30 seconds
Scenario 2: Image generation model with extended timeout
- Model config:
inference_timeout=120(2 minutes) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 120 seconds (model config wins)
Scenario 3: Model timeout exceeds constellation-entry ceiling
- Model config:
inference_timeout=300(5 minutes) - constellation-server:
infer_request_timeout_secs=30 - constellation-entry: 240 seconds (hardcoded)
- constellation-server effective timeout: 300 seconds
- Actual outcome: constellation-entry kills the connection at 240 seconds
Scenario 4: Attempted timeout reduction (not allowed)
- Model config:
inference_timeout=10(10 seconds) - constellation-server:
infer_request_timeout_secs=30 - Effective timeout: 30 seconds (constellation-server minimum enforced)
Code references:
- worker-constellation-entry service binding call (no timeout):
ai/session.ts:131 - constellation-entry 4-minute timeout:
proxy_server.rs:99,proxy_server.rs:831-845 - constellation-server CLI configuration:
cli.rs:107-109 - Model config structure:
model-repository/src/config.rs:100-150 - Timeout resolution and enforcement:
server/src/lib.rs
Key Differences
Complexity and Granularity:
- Director: Multi-phase timeout system with separate controls for setup (model initialization) and execution (inference), plus deployment-level lifetime limits
- Workers AI: Two-layer timeout — constellation-entry imposes a hard 4-minute ceiling, constellation-server applies per-model timeouts within that ceiling
Deadline Calculation:
- Director: Priority-based deadline system with per-prediction, deployment, and fallback timeouts. Calculates earliest applicable deadline at dequeue time, accounting for time already spent in queue
- Workers AI: Simple max() operation between model config and constellation-server default, applied at request receipt time
Queue Time Handling:
- Director:
MaxRunLifetimeincludes queue time; execution timeout accounts for time already elapsed since creation - Workers AI: Timeout starts when constellation-server receives the request, inherently includes queue time
Setup vs Runtime:
- Director: Explicit
ModelSetupTimeoutfor container initialization, separate from prediction execution timeout - Workers AI: No request-level setup timeout; container lifecycle managed independently by orchestrator
Configuration Source:
- Director: Environment variables set by cluster service during deployment creation
- Workers AI: CLI argument (constellation-server) + distributed config store (Quicksilver/Config API)
Default Values:
- Director: 30 minutes prediction timeout (generous for long-running inference), setup and lifetime timeouts disabled by default
- Workers AI: 30 seconds base timeout (optimized for fast inference), extended per-model via config, hard ceiling of 4 minutes at constellation-entry
Minimum Enforcement:
- Director: 30-minute minimum enforced for prediction timeout if misconfigured
- Workers AI: constellation-server minimum enforced as floor, models can only extend
Use Case Alignment:
- Director: Designed for variable-length workloads including multi-hour training runs; flexible per-deployment configuration
- Workers AI: Optimized for inference workloads with known latency profiles; centralized model-specific configuration