Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Timeouts & Deadlines

Overview

Both platforms enforce timeouts on inference requests, but the designs reflect different assumptions about workload duration. Replicate has a multi-phase timeout system with separate controls for model setup, prediction execution, and overall lifetime — designed for workloads ranging from seconds to 24-hour training runs. Workers AI has a two-layer system: a per-model inference timeout in constellation-server (default 30 seconds) under a hard 4-minute ceiling in constellation-entry — optimized for fast, predictable inference.


Replicate Timeout System

Timeout behavior spans multiple components:

  • replicate/web - defines timeout values per workload type in deployable metadata
  • replicate/api - can inject per-prediction deadlines into prediction metadata
  • replicate/cluster - translates deployable config into DIRECTOR_* environment variables when creating K8s deployments; provides default values when the deployable config doesn’t specify them (pkg/config/config.go)
  • director - reads env vars and enforces timeouts during execution

Configuration

Three timeout parameters control prediction lifecycle. Each flows from replicate/web through replicate/cluster to director as an environment variable. Director reads them as flags (director/config.go:52-58).

Model Setup Timeout (DIRECTOR_MODEL_SETUP_TIMEOUT)

Time allowed for model container initialization. Applied during health check polling while waiting for the container to become ready. Measures time from StatusStarting to completion. If exceeded, Director fails the prediction with setup logs and reports to the setup run endpoint (director/director.go:718-830).

Resolution chain:

  1. web: Model.setup_timeout property — returns DEFAULT_MODEL_SETUP_TIMEOUT (10 minutes) when the underlying DB field is None (models/model.py:1038-1042). Stored on DeployableConfig.setup_timeout, then serialized as model_setup_timeout with a multiplier and bonus applied (api_serializers.py:1678-1681).
  2. cluster: Reads deployable.ModelSetupTimeout. If nonzero, uses it; otherwise falls back to config.ModelSetupTimeout (10 minutes) (pkg/config/config.go:74, pkg/kubernetes/deployable.go:1191-1201).
  3. director: Reads DIRECTOR_MODEL_SETUP_TIMEOUT env var. Own flag default is 0 (disabled), but cluster always sets it.

Prediction Timeout (DIRECTOR_PREDICT_TIMEOUT)

Time allowed for prediction execution, starting from when execution begins (not including queue time). Acts as a duration-based fallback when no explicit deadline is set. Director enforces a minimum of 30 minutes if configured to 0 or negative (director/director.go:285-288).

Resolution chain:

  1. web: Model.prediction_timeout (nullable duration). When set, stored on DeployableConfig.run_timeout and serialized as prediction_timeout (models/model.py:1010-1017, api_serializers.py:1702-1703). When None, the serialized prediction_timeout is omitted.

  2. cluster: predictTimeout() resolves the value with this priority (pkg/kubernetes/deployable.go:1886-1906):

    1. deployable.PredictionTimeout (from web, if set)
    2. Hardcoded per-account override map userSpecificTimeouts (pkg/kubernetes/deployable.go:88-116)
    3. config.PredictTimeoutSeconds (30 minutes) (pkg/config/config.go:71)
    4. For training workloads, uses the higher of the above and TrainTimeoutSeconds (24 hours)

    A mirrored per-account map exists in web’s USER_SPECIFIC_TIMEOUTS (models/prediction.py:60-107) to prevent terminate_stuck_predictions from killing predictions that have these extended timeouts.

  3. director: Reads DIRECTOR_PREDICT_TIMEOUT env var. Own flag default is 1800s (30 minutes), but cluster always sets it.

Max Run Lifetime (DIRECTOR_MAX_RUN_LIFETIME)

Maximum time for the entire prediction lifecycle including queue time and setup. Calculated from prediction.CreatedAt timestamp.

There are two paths that produce a max run lifetime constraint — one via the deployable config (env var on the Director pod) and one via a per-request header that becomes a deadline on the prediction itself.

Deployable config path (env var):

  1. web: DeployableConfig.max_run_lifetime — defaults to DEFAULT_MAX_RUN_LIFETIME (24 hours) at the DB level (models/deployable_config.py:259-261). For deployment predictions, deployment.max_run_lifetime can override this (logic.py:1175-1176). Serialized as max_run_lifetime (api_serializers.py:1675-1676).
  2. cluster: Reads deployable.MaxRunLifetime. If nonzero, uses it; otherwise falls back to config.MaxRunLifetime (24 hours) (pkg/config/config.go:75, pkg/kubernetes/deployable.go:1203-1213).
  3. director: Reads DIRECTOR_MAX_RUN_LIFETIME env var. Own flag default is 0 (disabled), but cluster always sets it.

Per-request path (Cancel-After header):

  1. api: The Cancel-After HTTP header on a prediction request is parsed as a duration (Go-style like 5m or bare seconds like 300). Minimum 5 seconds (server/v1_prediction_handler.go:233-276).
  2. api: calculateEffectiveDeadline() picks the shorter of the request’s Cancel-After value and the deployable metadata’s max_run_lifetime, computes an absolute deadline from prediction creation time, and sets it on prediction.InternalMetadata.Deadline (logic/prediction.go:1076-1109).
  3. director: This deadline is the “per-prediction deadline” checked first in the deadline priority (see Deadline Calculation below).

Deadline Calculation and Enforcement

Director calculates an effective deadline at dequeue time using this priority (director/worker.go:63-107):

  1. Per-prediction deadline (prediction.InternalMetadata.Deadline)
  2. Deployment deadline (prediction.CreatedAt + MaxRunLifetime, if configured)
  3. Prediction timeout (fallback)

The earliest applicable deadline wins. At execution time (director/worker.go:516-528):

deadline, source, ok := getEffectiveDeadline(prediction)
var timeoutDuration time.Duration

if ok {
    timeoutDuration = time.Until(deadline)
} else {
    timeoutDuration = d.predictionTimeout
}

timeoutTimer := time.After(timeoutDuration)

Timeout outcomes vary based on when and why the deadline is exceeded:

Before execution (deadline already passed at dequeue):

  • Per-prediction deadline → aborted
  • Deployment deadline → failed

During execution (timer fires while running):

  • Per-prediction deadline → canceled
  • Deployment deadline → failed
  • Prediction timeout (fallback) → failed

Timeout Behavior Examples

Scenario 1: Version prediction with deployment deadline

  • DIRECTOR_MAX_RUN_LIFETIME=300 (5 minutes)
  • DIRECTOR_PREDICT_TIMEOUT=1800 (30 minutes)
  • Prediction created at T=0, dequeued at T=60s
  • Effective deadline: T=0 + 300s = T=300s
  • Execution timeout: 300s - 60s = 240s remaining

Scenario 2: Prediction with explicit deadline

  • Per-prediction deadline: T=180s (3 minutes from creation)
  • DIRECTOR_MAX_RUN_LIFETIME=600 (10 minutes)
  • Prediction created at T=0, dequeued at T=30s
  • Effective deadline: T=180s (per-prediction deadline wins)
  • Execution timeout: 180s - 30s = 150s remaining

Scenario 3: Training with no deadlines

  • DIRECTOR_MAX_RUN_LIFETIME=0 (disabled)
  • DIRECTOR_PREDICT_TIMEOUT=1800 (30 minutes)
  • No per-prediction deadline
  • Effective deadline: None (uses fallback)
  • Execution timeout: 1800s from execution start

Code references:

Workers AI Timeout System

Workers AI has a simpler timeout model than Replicate, but the request passes through multiple layers, each with its own timeout behavior.

Request Path Layers

A request from the internet traverses three services before reaching a GPU container:

  1. worker-constellation-entry (TypeScript Worker): The SDK-facing entry point. Calls constellation-entry via a Workers service binding (this.binding.fetch(...)) with no explicit timeout (ai/session.ts:131). See note below on Workers runtime limits.
  2. constellation-entry (Rust, runs on metal): Wraps the entire request to constellation-server in a hardcoded 4-minute tokio::time::timeout (proxy_server.rs:99, proxy_server.rs:831-845). On timeout, returns EntryError::ForwardTimeout. This applies to both HTTP and WebSocket paths.
  3. constellation-server (Rust, runs on GPU nodes): Enforces per-model inference timeout (see below).

The 4-minute constellation-entry timeout acts as a hard ceiling. If a model’s inference_timeout in constellation-server exceeds 240 seconds, constellation-entry will kill the connection before constellation-server’s own timeout fires. This means constellation-entry is the effective upper bound for any single inference request, unless constellation-entry’s constant is changed.

Workers runtime limits: worker-constellation-entry is a standard Worker deployed on Cloudflare’s internal AI account (not a pipeline First Party Worker). It does not set limits.cpu_ms in its wrangler.toml, so it gets the default 30-second CPU time limit. However, CPU time only counts active processing — time spent awaiting the service binding fetch to constellation-entry is I/O wait and does not count (docs). Since the Worker is almost entirely I/O-bound, the CPU limit is unlikely to be reached. Wall-clock duration has no hard cap for HTTP requests, though Workers can be evicted during runtime restarts (~once/week, 30-second drain). In practice, the Workers runtime layer is transparent for timeout purposes — constellation-entry’s 4-minute timeout fires well before any platform limit would.

Configuration Levels

constellation-server Default (constellation-server/src/cli.rs:107-109):

  • --infer-request-timeout-secs: CLI argument, default 30 seconds
  • Acts as the minimum timeout floor for all inference requests
  • Applied globally across all models served by the constellation-server instance

Per-Model Configuration (model-repository/src/config.rs:143):

  • inference_timeout: Model-specific timeout in seconds (default: 0)
  • Configured via Config API (Deus) and fetched from Quicksilver (distributed KV store)
  • Part of WorkersAiModelConfig structure
  • Zero value means “use constellation-server default”

Timeout Resolution

When processing an inference request, constellation-server resolves the timeout (server/src/lib.rs):

#![allow(unused)]
fn main() {
let min_timeout = state.infer_request_timeout_secs as u64;
let mut inference_timeout = Duration::from_secs(state.infer_request_timeout_secs as u64);

if let Some(model_config) = state.endpoint_lb.get_model_config(&model_id) {
    inference_timeout = Duration::from_secs(
        model_config.ai_config.inference_timeout.max(min_timeout)
    );
}
}

The effective timeout is: max(model_config.inference_timeout, infer_request_timeout_secs). This ensures:

  • Models can extend their timeout beyond the 30-second default
  • Models cannot reduce their timeout below the constellation-server minimum
  • Unconfigured models (inference_timeout=0) use the 30-second default

Enforcement

constellation-server enforces the timeout by wrapping the entire inference call in tokio::time::timeout(inference_timeout, handle_inference_request(...)) (server/src/lib.rs:588-599). When the timeout elapses, the future is dropped, which closes the connection to the GPU container. The error maps to ServerError::InferenceTimeout (line 603).

The other end of the dropped connection varies by model backend (service-discovery/src/lib.rs:39-47):

  • Triton: Upstream NVIDIA tritonserver binary (launched via model-greeter). Disconnect behavior depends on Triton’s own implementation.
  • TGI/TEI: Upstream HuggingFace binaries. Disconnect behavior depends on the upstream Rust server implementation.
  • PipeHttp/PipeHttpLlm: Custom Python framework (WorkersAiApp in ai_catalog_common, FastAPI-based). The synchronous HTTP path (_generate) does not explicitly cancel the raw_generate coroutine on client disconnect — the inference runs to completion (catalog/common/.../workers_ai_app/app.py:880-896). WebSocket connections do cancel processing tasks in their finally block (catalog/common/.../workers_ai_app/app.py:536-553).

The constellation-server timeout covers:

  • Time waiting in the constellation-server queue (permit acquisition)
  • Forwarding to the GPU container
  • Inference execution in the backend (Triton, TGI, TEI)
  • Response transmission back through constellation-server

But constellation-entry’s 4-minute timeout wraps the entire round-trip to constellation-server, so it is the effective ceiling regardless of per-model config.

Unlike Director’s system, there is no separate setup timeout. Model containers are managed by the orchestrator (K8s) independently of request processing. Container initialization and readiness are handled by service discovery and health checks, not request-level timeouts.

Timeout Behavior Examples

Scenario 1: Text generation model with default timeout

  • Model config: inference_timeout=0 (not configured)
  • constellation-server: infer_request_timeout_secs=30
  • Effective timeout: 30 seconds

Scenario 2: Image generation model with extended timeout

  • Model config: inference_timeout=120 (2 minutes)
  • constellation-server: infer_request_timeout_secs=30
  • Effective timeout: 120 seconds (model config wins)

Scenario 3: Model timeout exceeds constellation-entry ceiling

  • Model config: inference_timeout=300 (5 minutes)
  • constellation-server: infer_request_timeout_secs=30
  • constellation-entry: 240 seconds (hardcoded)
  • constellation-server effective timeout: 300 seconds
  • Actual outcome: constellation-entry kills the connection at 240 seconds

Scenario 4: Attempted timeout reduction (not allowed)

  • Model config: inference_timeout=10 (10 seconds)
  • constellation-server: infer_request_timeout_secs=30
  • Effective timeout: 30 seconds (constellation-server minimum enforced)

Code references:

Key Differences

Complexity and Granularity:

  • Director: Multi-phase timeout system with separate controls for setup (model initialization) and execution (inference), plus deployment-level lifetime limits
  • Workers AI: Two-layer timeout — constellation-entry imposes a hard 4-minute ceiling, constellation-server applies per-model timeouts within that ceiling

Deadline Calculation:

  • Director: Priority-based deadline system with per-prediction, deployment, and fallback timeouts. Calculates earliest applicable deadline at dequeue time, accounting for time already spent in queue
  • Workers AI: Simple max() operation between model config and constellation-server default, applied at request receipt time

Queue Time Handling:

  • Director: MaxRunLifetime includes queue time; execution timeout accounts for time already elapsed since creation
  • Workers AI: Timeout starts when constellation-server receives the request, inherently includes queue time

Setup vs Runtime:

  • Director: Explicit ModelSetupTimeout for container initialization, separate from prediction execution timeout
  • Workers AI: No request-level setup timeout; container lifecycle managed independently by orchestrator

Configuration Source:

  • Director: Environment variables set by cluster service during deployment creation
  • Workers AI: CLI argument (constellation-server) + distributed config store (Quicksilver/Config API)

Default Values:

  • Director: 30 minutes prediction timeout (generous for long-running inference), setup and lifetime timeouts disabled by default
  • Workers AI: 30 seconds base timeout (optimized for fast inference), extended per-model via config, hard ceiling of 4 minutes at constellation-entry

Minimum Enforcement:

  • Director: 30-minute minimum enforced for prediction timeout if misconfigured
  • Workers AI: constellation-server minimum enforced as floor, models can only extend

Use Case Alignment:

  • Director: Designed for variable-length workloads including multi-hour training runs; flexible per-deployment configuration
  • Workers AI: Optimized for inference workloads with known latency profiles; centralized model-specific configuration