Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

State Reporting

Overview

Both platforms need to know what their model instances are doing — whether they’re booting, idle, processing requests, or failing.

Replicate’s Director reports instance state to the API every 15 seconds and sends a report when model setup finishes. Individual predictions are tracked via OTel spans. Workers AI logs a structured event for every inference request into Cloudflare’s internal analytics pipeline, and exposes Prometheus metrics for aggregate capacity signals.


Replicate: HTTP Reporting and OTel Spans

Instance State Monitor

Director runs an instance state monitor (instancestate/monitor.go) that tracks what the model instance is doing at any given moment. Three activity states (instancestate/types.go:10-14):

StateMeaning
BootInstance is starting up (setup running)
IdleReady but no active predictions
ActiveProcessing ≥1 prediction

The monitor maintains a concurrency counter (activityLevel). When a prediction starts, the level increments; when it finishes, it decrements. Any level > 0 means Active (types.go:18-23).

Every 15 seconds (DefaultUtilizationInterval), the monitor computes a Metrics payload (types.go:39-55) and POSTs it as JSON to the cluster-local API instance:

FieldDescription
active_time, boot_time, idle_timeSeconds in each state this interval
metric_durationTotal interval length
utilizationactive_time / metric_duration
mean_concurrencyWeighted average concurrency level
deployment_key, docker_image_uriDeployable identity
hardware, compute_unit, compute_unit_countResource info
instance_id, instance_metadata, versionPod-level metadata

Utilization is active_time / metric_duration — a number between 0 and 1 representing what fraction of the interval the instance was processing at least one prediction. Mean concurrency captures how many predictions were running simultaneously on average (monitor.go:80-91).

Transport: HTTP POST to DIRECTOR_REPORT_INSTANCE_STATE_URL. Auth via WEBHOOK_AUTH_TOKEN Bearer header. Retries via httpclient.ApplyRetryPolicy (reporter.go:102-140).

Setup Run Reporting

Once when model setup completes (or times out), Director sends an HTTP POST to DIRECTOR_REPORT_SETUP_RUN_URL (reporter.go:142-212). The payload includes:

  • status — terminal state (“ready”, “failed”, etc.)
  • started_at, completed_at — RFC3339 timestamps
  • logs — setup output (truncated to last 20 KiB for OTel spans)
  • instance_metadata — pod name, namespace, etc.
  • runtime_metadata — cog version, cog version override, pod name

Director also records OTel span attributes for setup timing: setuprun.scheduled_to_setup_started, setuprun.scheduled_to_ready, setuprun.duration_seconds, and setuprun.status.

Scale State Snapshots

The cluster autoscaler captures periodic snapshots of each deployable’s queue depth and replica count, stored in Redis (scalestate/scalestate.go):

FieldDescription
queue_lengthTotal across blue+green Redis
queue_length_blue, queue_length_greenPer-Redis-instance queue depth
replica_countReady replicas from K8s
timeUnix timestamp

Up to 120 snapshots are retained per deployable (MaxSnapshots). The autoscaler uses this history when making scaling decisions — the snapshot window provides context about recent queue pressure and capacity.

Two Prometheus gauges are emitted per snapshot (scalestate.go:69-76):

  • autoscaler_snapshot_queue_length — queue depth, labeled by deployable attributes
  • autoscaler_snapshot_replica_count — ready replicas

Scaling decisions are recorded as OTel spans (autoscaler.scaleVersion) with the full snapshot history JSON, old/new replica counts, and all deployable attributes.

Prediction Tracker

Each prediction’s lifecycle is managed by a Tracker (director/tracker.go) that wraps an OTel span and notifies subscribers on state changes. The tracker is not thread-safe — it’s owned by a single worker goroutine.

Prediction states: Processing, Succeeded, Failed, Canceled, Aborted. The tracker records span attributes for cancel reasons, error codes (e.g. E1234), compliance check results, moderation outcomes, and billing metrics (token counts, image counts).

Subscribers receive PredictionUpdate{Prediction, Events} notifications. Events are webhook event types (WebhookEventCompleted, WebhookEventOutput). Terminal updates are held until Stop() to avoid sending duplicate completion events.

pget Download Metrics

Director exposes a POST /metrics endpoint (server/metrics.go) that pget uses to report download metrics — URL, file size, throughput, and errors (pget/pkg/pget.go:148-180). Each metric is converted to an OTel span (director.metric) with attributes from the payload. Rate-limited to 100 req/s. HuggingFace and S3 presigned URLs are normalized to strip query parameters, avoiding high-cardinality span attributes.


Workers AI: Inference Events and Prometheus Metrics

Ready Analytics (Inference Events)

Every inference request produces an InferenceEvent that is serialized as a Cap’n Proto message and sent to logfwdr (Cloudflare’s internal log forwarding daemon) via a Unix domain socket (ready-analytics/src/inference_event.rs, ready-analytics/src/logfwdr.rs). This feeds into Cloudflare’s Ready Analytics pipeline.

Key fields per event (inference_event.rs:42-77):

FieldDescription
account_idCloudflare account ID
model_idModel identifier (e.g. @cf/meta/llama-2-7b)
request_time_msTotal request time
inference_time_msActual inference duration
queuing_time_msTime spent in local queue
time_to_first_tokenTTFT for streaming responses
neuronsNeuron cost metric (billing)
error_codeError code if failed
colo_id / source_colo_idColo identifiers
local_queue_sizeQueue depth at request time
model_container_idContainer identifier

Bitfield flags track request properties: streamed, beta model, external endpoint, queued in colo, Unix socket connection (inference_event.rs:13-19).

RequestSource distinguishes Worker bindings, Pages bindings, and REST API requests (inference_event.rs:22-28).

Transport: LogfwdrClient connects to a Unix socket with a 64-slot mpsc channel buffer. 10-second watchdog reconnect on failure. Metrics: connect_err, write_err, ok (logfwdr.rs).

Endpoint Analytics (Lifecycle Events)

The AvailableEndpointTracker (endpoint-analytics/src/lib.rs) tracks endpoint membership changes — when endpoints appear or disappear from the world map. It generates lifecycle events sent to a Workers pipeline via LifecycleEventSender:

FieldDescription
begin_tsWhen endpoint appeared
end_tsWhen endpoint disappeared (0 if still present)
event_typeEndpoint
model_idModel identifier
hostnameTracker’s hostname

Updates are tied to the world map refresh interval.

Prometheus Metrics

constellation-server exposes Prometheus metrics (metrics/src/lib.rs). Key gauges and counters:

Request-level:

  • requests — total requests by HTTP status
  • errors — errors by internal + HTTP code
  • infer_per_backend — inferences per backend type (Triton, TGI, etc.)
  • infer_per_health_state — inferences by endpoint health state

Queue and capacity:

  • turnstile_outcomes — permit acquisition outcomes (leased/queued/rejected)
  • full_fraction — fraction of time a model is fully saturated
  • total_permits / mean_used_permits — permit availability and usage per model
  • mean_queue_size — average queue depth per model

These queue/capacity metrics are computed from a ModelPermitBuffer — a rolling window of 600 samples at 500 ms intervals (5-minute window) tracking (total_permits, available_permits, queue_size) per model (metrics/src/lib.rs).

Infrastructure:

  • constellation_server_forward_failures — cross-colo forward failures
  • constellation_server_endpoints_removed — endpoints in backoff state
  • constellation_server_external_endpoints — external endpoint health by host/port

ai-scheduler Metrics

The scheduler emits its own Prometheus metrics (prefixed ai_scheduler_) every 60 seconds by polling the Cloudchamber API (crates/metrics/src/lib.rs):

  • model_deployments — deployment count by model, tier, region, status, GPU model
  • requested_instances / min_requested_instances / max_requested_instances — scaling bounds
  • autoscaler_measured_utilization / estimated_utilization / forecast_utilization — utilization signals
  • autoscaler_required_instances — minimum instances to meet target utilization
  • external_instances — external instance count by provider, region, model, phase
  • external_nodes — external GPU count by provider, region, state

Key Differences

AspectReplicateWorkers AI
Instance state3 states (Boot/Idle/Active) with concurrency level, reported every 15sPer-request inference events; permit/queue metrics from 5-min rolling window
TransportHTTP POST (JSON) to Replicate APIUnix socket (Cap’n Proto) to logfwdr pipeline
GranularityPeriodic snapshots (15s utilization, setup completion)Per-request events + rolling metric windows
Setup reportingReport sent once when setup completes, with logs, timing, metadataNo direct equivalent; endpoint lifecycle events track appearance/disappearance
Scaling telemetryRedis-cached snapshot history (120 snapshots), OTel spans per decisionPrometheus gauges polled every 60s from Cloudchamber API
Prediction trackingPer-prediction OTel span with subscriber notificationsPer-request inference event to analytics pipeline
Metrics systemOTel spans → tracing backendPrometheus (scraped) + Ready Analytics (logfwdr → pipeline)

Both platforms have multiple reporting streams at different granularities. Replicate has per-prediction OTel spans (Tracker), pget download metrics (POST /metrics), 15-second instance utilization snapshots, and a setup report sent once when setup completes. Workers AI has per-request inference events (Ready Analytics), endpoint lifecycle events, Prometheus metrics (backed by internally-sampled 5-minute rolling windows), and 60-second scheduler polls.

The main structural difference is where aggregation happens. Director’s instance state monitor pre-aggregates utilization into 15-second windows before sending to the API, which forwards to Web for customer-facing reporting. Workers AI’s inference events are sent raw and aggregated by downstream consumers in the analytics pipeline. The Prometheus metrics layer provides pre-computed summaries (like full_fraction and mean_queue_size) for operational use.