State Reporting
Overview
Both platforms need to know what their model instances are doing — whether they’re booting, idle, processing requests, or failing.
Replicate’s Director reports instance state to the API every 15 seconds and sends a report when model setup finishes. Individual predictions are tracked via OTel spans. Workers AI logs a structured event for every inference request into Cloudflare’s internal analytics pipeline, and exposes Prometheus metrics for aggregate capacity signals.
Replicate: HTTP Reporting and OTel Spans
Instance State Monitor
Director runs an instance state monitor (instancestate/monitor.go) that
tracks what the model instance is doing at any given moment. Three activity states
(instancestate/types.go:10-14):
| State | Meaning |
|---|---|
Boot | Instance is starting up (setup running) |
Idle | Ready but no active predictions |
Active | Processing ≥1 prediction |
The monitor maintains a concurrency counter (activityLevel). When a prediction starts,
the level increments; when it finishes, it decrements. Any level > 0 means Active
(types.go:18-23).
Every 15 seconds (DefaultUtilizationInterval), the monitor computes a Metrics
payload (types.go:39-55) and POSTs it as JSON to the cluster-local
API instance:
| Field | Description |
|---|---|
active_time, boot_time, idle_time | Seconds in each state this interval |
metric_duration | Total interval length |
utilization | active_time / metric_duration |
mean_concurrency | Weighted average concurrency level |
deployment_key, docker_image_uri | Deployable identity |
hardware, compute_unit, compute_unit_count | Resource info |
instance_id, instance_metadata, version | Pod-level metadata |
Utilization is active_time / metric_duration — a number between 0 and 1 representing
what fraction of the interval the instance was processing at least one prediction. Mean
concurrency captures how many predictions were running simultaneously on average
(monitor.go:80-91).
Transport: HTTP POST to DIRECTOR_REPORT_INSTANCE_STATE_URL. Auth via
WEBHOOK_AUTH_TOKEN Bearer header. Retries via httpclient.ApplyRetryPolicy
(reporter.go:102-140).
Setup Run Reporting
Once when model setup completes (or times out), Director sends an HTTP POST to
DIRECTOR_REPORT_SETUP_RUN_URL (reporter.go:142-212). The payload
includes:
status— terminal state (“ready”, “failed”, etc.)started_at,completed_at— RFC3339 timestampslogs— setup output (truncated to last 20 KiB for OTel spans)instance_metadata— pod name, namespace, etc.runtime_metadata— cog version, cog version override, pod name
Director also records OTel span attributes for setup timing:
setuprun.scheduled_to_setup_started, setuprun.scheduled_to_ready,
setuprun.duration_seconds, and setuprun.status.
Scale State Snapshots
The cluster autoscaler captures periodic snapshots of each deployable’s queue depth and
replica count, stored in Redis (scalestate/scalestate.go):
| Field | Description |
|---|---|
queue_length | Total across blue+green Redis |
queue_length_blue, queue_length_green | Per-Redis-instance queue depth |
replica_count | Ready replicas from K8s |
time | Unix timestamp |
Up to 120 snapshots are retained per deployable (MaxSnapshots). The autoscaler uses
this history when making scaling decisions — the snapshot window provides context about
recent queue pressure and capacity.
Two Prometheus gauges are emitted per snapshot (scalestate.go:69-76):
autoscaler_snapshot_queue_length— queue depth, labeled by deployable attributesautoscaler_snapshot_replica_count— ready replicas
Scaling decisions are recorded as OTel spans (autoscaler.scaleVersion) with the full
snapshot history JSON, old/new replica counts, and all deployable attributes.
Prediction Tracker
Each prediction’s lifecycle is managed by a Tracker
(director/tracker.go) that wraps an OTel span and notifies subscribers on
state changes. The tracker is not thread-safe — it’s owned by a single worker goroutine.
Prediction states: Processing, Succeeded, Failed, Canceled, Aborted. The tracker
records span attributes for cancel reasons, error codes (e.g. E1234), compliance check
results, moderation outcomes, and billing metrics (token counts, image counts).
Subscribers receive PredictionUpdate{Prediction, Events} notifications. Events are
webhook event types (WebhookEventCompleted, WebhookEventOutput). Terminal updates are
held until Stop() to avoid sending duplicate completion events.
pget Download Metrics
Director exposes a POST /metrics endpoint (server/metrics.go) that
pget uses to report download metrics — URL, file size, throughput, and errors
(pget/pkg/pget.go:148-180). Each metric is converted to an OTel span
(director.metric) with attributes from the payload. Rate-limited to 100 req/s.
HuggingFace and S3 presigned URLs are normalized to strip query parameters, avoiding
high-cardinality span attributes.
Workers AI: Inference Events and Prometheus Metrics
Ready Analytics (Inference Events)
Every inference request produces an InferenceEvent that is serialized as a Cap’n Proto
message and sent to logfwdr (Cloudflare’s internal log forwarding daemon) via a Unix
domain socket (ready-analytics/src/inference_event.rs,
ready-analytics/src/logfwdr.rs). This feeds into Cloudflare’s Ready
Analytics pipeline.
Key fields per event (inference_event.rs:42-77):
| Field | Description |
|---|---|
account_id | Cloudflare account ID |
model_id | Model identifier (e.g. @cf/meta/llama-2-7b) |
request_time_ms | Total request time |
inference_time_ms | Actual inference duration |
queuing_time_ms | Time spent in local queue |
time_to_first_token | TTFT for streaming responses |
neurons | Neuron cost metric (billing) |
error_code | Error code if failed |
colo_id / source_colo_id | Colo identifiers |
local_queue_size | Queue depth at request time |
model_container_id | Container identifier |
Bitfield flags track request properties: streamed, beta model, external endpoint, queued
in colo, Unix socket connection (inference_event.rs:13-19).
RequestSource distinguishes Worker bindings, Pages bindings, and REST API requests
(inference_event.rs:22-28).
Transport: LogfwdrClient connects to a Unix socket with a 64-slot mpsc channel
buffer. 10-second watchdog reconnect on failure. Metrics: connect_err, write_err, ok
(logfwdr.rs).
Endpoint Analytics (Lifecycle Events)
The AvailableEndpointTracker
(endpoint-analytics/src/lib.rs) tracks endpoint membership
changes — when endpoints appear or disappear from the world map. It generates lifecycle
events sent to a Workers pipeline via LifecycleEventSender:
| Field | Description |
|---|---|
begin_ts | When endpoint appeared |
end_ts | When endpoint disappeared (0 if still present) |
event_type | Endpoint |
model_id | Model identifier |
hostname | Tracker’s hostname |
Updates are tied to the world map refresh interval.
Prometheus Metrics
constellation-server exposes Prometheus metrics
(metrics/src/lib.rs). Key gauges and counters:
Request-level:
requests— total requests by HTTP statuserrors— errors by internal + HTTP codeinfer_per_backend— inferences per backend type (Triton, TGI, etc.)infer_per_health_state— inferences by endpoint health state
Queue and capacity:
turnstile_outcomes— permit acquisition outcomes (leased/queued/rejected)full_fraction— fraction of time a model is fully saturatedtotal_permits/mean_used_permits— permit availability and usage per modelmean_queue_size— average queue depth per model
These queue/capacity metrics are computed from a ModelPermitBuffer — a rolling window of
600 samples at 500 ms intervals (5-minute window) tracking (total_permits, available_permits, queue_size) per model (metrics/src/lib.rs).
Infrastructure:
constellation_server_forward_failures— cross-colo forward failuresconstellation_server_endpoints_removed— endpoints in backoff stateconstellation_server_external_endpoints— external endpoint health by host/port
ai-scheduler Metrics
The scheduler emits its own Prometheus metrics (prefixed ai_scheduler_) every 60
seconds by polling the Cloudchamber API (crates/metrics/src/lib.rs):
model_deployments— deployment count by model, tier, region, status, GPU modelrequested_instances/min_requested_instances/max_requested_instances— scaling boundsautoscaler_measured_utilization/estimated_utilization/forecast_utilization— utilization signalsautoscaler_required_instances— minimum instances to meet target utilizationexternal_instances— external instance count by provider, region, model, phaseexternal_nodes— external GPU count by provider, region, state
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Instance state | 3 states (Boot/Idle/Active) with concurrency level, reported every 15s | Per-request inference events; permit/queue metrics from 5-min rolling window |
| Transport | HTTP POST (JSON) to Replicate API | Unix socket (Cap’n Proto) to logfwdr pipeline |
| Granularity | Periodic snapshots (15s utilization, setup completion) | Per-request events + rolling metric windows |
| Setup reporting | Report sent once when setup completes, with logs, timing, metadata | No direct equivalent; endpoint lifecycle events track appearance/disappearance |
| Scaling telemetry | Redis-cached snapshot history (120 snapshots), OTel spans per decision | Prometheus gauges polled every 60s from Cloudchamber API |
| Prediction tracking | Per-prediction OTel span with subscriber notifications | Per-request inference event to analytics pipeline |
| Metrics system | OTel spans → tracing backend | Prometheus (scraped) + Ready Analytics (logfwdr → pipeline) |
Both platforms have multiple reporting streams at different granularities. Replicate has
per-prediction OTel spans (Tracker), pget download metrics (POST /metrics), 15-second
instance utilization snapshots, and a setup report sent once when setup completes. Workers
AI has per-request inference events (Ready Analytics), endpoint lifecycle events,
Prometheus metrics (backed by internally-sampled 5-minute rolling windows), and 60-second
scheduler polls.
The main structural difference is where aggregation happens. Director’s instance state
monitor pre-aggregates utilization into 15-second windows before sending to the API, which
forwards to Web for customer-facing reporting. Workers AI’s inference events are sent raw
and aggregated by downstream consumers in the analytics pipeline. The Prometheus metrics
layer provides pre-computed summaries (like full_fraction and mean_queue_size) for
operational use.