State Reporting

Overview

Both platforms need to know what their model instances are doing — whether they’re booting, idle, processing requests, or failing.

Replicate’s Director reports instance state to the API every 15 seconds and sends a report when model setup finishes. Individual predictions are tracked via OTel spans. Workers AI logs a structured event for every inference request into Cloudflare’s internal analytics pipeline, and exposes Prometheus metrics for aggregate capacity signals.

Replicate: HTTP Reporting and OTel Spans

Instance State Monitor

Director runs an instance state monitor (instancestate/monitor.go) that tracks what the model instance is doing at any given moment. Three activity states (instancestate/types.go:10-14):

State	Meaning
`Boot`	Instance is starting up (setup running)
`Idle`	Ready but no active predictions
`Active`	Processing ≥1 prediction

The monitor maintains a concurrency counter (activityLevel). When a prediction starts, the level increments; when it finishes, it decrements. Any level > 0 means Active (types.go:18-23).

Every 15 seconds (DefaultUtilizationInterval), the monitor computes a Metrics payload (types.go:39-55) and POSTs it as JSON to the cluster-local API instance:

Field	Description
`active_time`, `boot_time`, `idle_time`	Seconds in each state this interval
`metric_duration`	Total interval length
`utilization`	`active_time / metric_duration`
`mean_concurrency`	Weighted average concurrency level
`deployment_key`, `docker_image_uri`	Deployable identity
`hardware`, `compute_unit`, `compute_unit_count`	Resource info
`instance_id`, `instance_metadata`, `version`	Pod-level metadata

Utilization is active_time / metric_duration — a number between 0 and 1 representing what fraction of the interval the instance was processing at least one prediction. Mean concurrency captures how many predictions were running simultaneously on average (monitor.go:80-91).

Transport: HTTP POST to DIRECTOR_REPORT_INSTANCE_STATE_URL. Auth via WEBHOOK_AUTH_TOKEN Bearer header. Retries via httpclient.ApplyRetryPolicy (reporter.go:102-140).

Setup Run Reporting

Once when model setup completes (or times out), Director sends an HTTP POST to DIRECTOR_REPORT_SETUP_RUN_URL (reporter.go:142-212). The payload includes:

status — terminal state (“ready”, “failed”, etc.)
started_at, completed_at — RFC3339 timestamps
logs — setup output (truncated to last 20 KiB for OTel spans)
instance_metadata — pod name, namespace, etc.
runtime_metadata — cog version, cog version override, pod name

Director also records OTel span attributes for setup timing: setuprun.scheduled_to_setup_started, setuprun.scheduled_to_ready, setuprun.duration_seconds, and setuprun.status.

Scale State Snapshots

The cluster autoscaler captures periodic snapshots of each deployable’s queue depth and replica count, stored in Redis (scalestate/scalestate.go):

Field	Description
`queue_length`	Total across blue+green Redis
`queue_length_blue`, `queue_length_green`	Per-Redis-instance queue depth
`replica_count`	Ready replicas from K8s
`time`	Unix timestamp

Up to 120 snapshots are retained per deployable (MaxSnapshots). The autoscaler uses this history when making scaling decisions — the snapshot window provides context about recent queue pressure and capacity.

Two Prometheus gauges are emitted per snapshot (scalestate.go:69-76):

autoscaler_snapshot_queue_length — queue depth, labeled by deployable attributes
autoscaler_snapshot_replica_count — ready replicas

Scaling decisions are recorded as OTel spans (autoscaler.scaleVersion) with the full snapshot history JSON, old/new replica counts, and all deployable attributes.

Prediction Tracker

Each prediction’s lifecycle is managed by a Tracker (director/tracker.go) that wraps an OTel span and notifies subscribers on state changes. The tracker is not thread-safe — it’s owned by a single worker goroutine.

Prediction states: Processing, Succeeded, Failed, Canceled, Aborted. The tracker records span attributes for cancel reasons, error codes (e.g. E1234), compliance check results, moderation outcomes, and billing metrics (token counts, image counts).

Subscribers receive PredictionUpdate{Prediction, Events} notifications. Events are webhook event types (WebhookEventCompleted, WebhookEventOutput). Terminal updates are held until Stop() to avoid sending duplicate completion events.

pget Download Metrics

Director exposes a POST /metrics endpoint (server/metrics.go) that pget uses to report download metrics — URL, file size, throughput, and errors (pget/pkg/pget.go:148-180). Each metric is converted to an OTel span (director.metric) with attributes from the payload. Rate-limited to 100 req/s. HuggingFace and S3 presigned URLs are normalized to strip query parameters, avoiding high-cardinality span attributes.

Workers AI: Inference Events and Prometheus Metrics

Ready Analytics (Inference Events)

Every inference request produces an InferenceEvent that is serialized as a Cap’n Proto message and sent to logfwdr (Cloudflare’s internal log forwarding daemon) via a Unix domain socket (ready-analytics/src/inference_event.rs, ready-analytics/src/logfwdr.rs). This feeds into Cloudflare’s Ready Analytics pipeline.

Key fields per event (inference_event.rs:42-77):

Field	Description
`account_id`	Cloudflare account ID
`model_id`	Model identifier (e.g. `@cf/meta/llama-2-7b`)
`request_time_ms`	Total request time
`inference_time_ms`	Actual inference duration
`queuing_time_ms`	Time spent in local queue
`time_to_first_token`	TTFT for streaming responses
`neurons`	Neuron cost metric (billing)
`error_code`	Error code if failed
`colo_id` / `source_colo_id`	Colo identifiers
`local_queue_size`	Queue depth at request time
`model_container_id`	Container identifier

Bitfield flags track request properties: streamed, beta model, external endpoint, queued in colo, Unix socket connection (inference_event.rs:13-19).

RequestSource distinguishes Worker bindings, Pages bindings, and REST API requests (inference_event.rs:22-28).

Transport: LogfwdrClient connects to a Unix socket with a 64-slot mpsc channel buffer. 10-second watchdog reconnect on failure. Metrics: connect_err, write_err, ok (logfwdr.rs).

Endpoint Analytics (Lifecycle Events)

The AvailableEndpointTracker (endpoint-analytics/src/lib.rs) tracks endpoint membership changes — when endpoints appear or disappear from the world map. It generates lifecycle events sent to a Workers pipeline via LifecycleEventSender:

Field	Description
`begin_ts`	When endpoint appeared
`end_ts`	When endpoint disappeared (0 if still present)
`event_type`	`Endpoint`
`model_id`	Model identifier
`hostname`	Tracker’s hostname

Updates are tied to the world map refresh interval.

Prometheus Metrics

constellation-server exposes Prometheus metrics (metrics/src/lib.rs). Key gauges and counters:

Request-level:

requests — total requests by HTTP status
errors — errors by internal + HTTP code
infer_per_backend — inferences per backend type (Triton, TGI, etc.)
infer_per_health_state — inferences by endpoint health state

Queue and capacity:

turnstile_outcomes — permit acquisition outcomes (leased/queued/rejected)
full_fraction — fraction of time a model is fully saturated
total_permits / mean_used_permits — permit availability and usage per model
mean_queue_size — average queue depth per model

These queue/capacity metrics are computed from a ModelPermitBuffer — a rolling window of 600 samples at 500 ms intervals (5-minute window) tracking (total_permits, available_permits, queue_size) per model (metrics/src/lib.rs).

Infrastructure:

constellation_server_forward_failures — cross-colo forward failures
constellation_server_endpoints_removed — endpoints in backoff state
constellation_server_external_endpoints — external endpoint health by host/port

ai-scheduler Metrics

The scheduler emits its own Prometheus metrics (prefixed ai_scheduler_) every 60 seconds by polling the Cloudchamber API (crates/metrics/src/lib.rs):

model_deployments — deployment count by model, tier, region, status, GPU model
requested_instances / min_requested_instances / max_requested_instances — scaling bounds
autoscaler_measured_utilization / estimated_utilization / forecast_utilization — utilization signals
autoscaler_required_instances — minimum instances to meet target utilization
external_instances — external instance count by provider, region, model, phase
external_nodes — external GPU count by provider, region, state

Key Differences

Aspect	Replicate	Workers AI
Instance state	3 states (Boot/Idle/Active) with concurrency level, reported every 15s	Per-request inference events; permit/queue metrics from 5-min rolling window
Transport	HTTP POST (JSON) to Replicate API	Unix socket (Cap’n Proto) to logfwdr pipeline
Granularity	Periodic snapshots (15s utilization, setup completion)	Per-request events + rolling metric windows
Setup reporting	Report sent once when setup completes, with logs, timing, metadata	No direct equivalent; endpoint lifecycle events track appearance/disappearance
Scaling telemetry	Redis-cached snapshot history (120 snapshots), OTel spans per decision	Prometheus gauges polled every 60s from Cloudchamber API
Prediction tracking	Per-prediction OTel span with subscriber notifications	Per-request inference event to analytics pipeline
Metrics system	OTel spans → tracing backend	Prometheus (scraped) + Ready Analytics (logfwdr → pipeline)

Both platforms have multiple reporting streams at different granularities. Replicate has per-prediction OTel spans (Tracker), pget download metrics (POST /metrics), 15-second instance utilization snapshots, and a setup report sent once when setup completes. Workers AI has per-request inference events (Ready Analytics), endpoint lifecycle events, Prometheus metrics (backed by internally-sampled 5-minute rolling windows), and 60-second scheduler polls.

The main structural difference is where aggregation happens. Director’s instance state monitor pre-aggregates utilization into 15-second windows before sending to the API, which forwards to Web for customer-facing reporting. Workers AI’s inference events are sent raw and aggregated by downstream consumers in the analytics pipeline. The Prometheus metrics layer provides pre-computed summaries (like full_fraction and mean_queue_size) for operational use.

Keyboard shortcuts

Replicate vs Workers AI