Billing Metrics
Overview
Both platforms need to measure what happened during inference so they can bill for it. The billing models are different: Replicate bills on predict time plus per-unit metrics (tokens, images, etc.), while Workers AI bills on “neurons” — an abstract unit representing GPU compute.
How those metrics are collected also differs. Director estimates billing metrics for most models. Only explicitly opted-in trusted models may report their own billing metrics. Workers AI has multiple paths depending on the model server software: Triton-backed models report cost metrics as output tensors that constellation-server converts to neurons, while non-Triton models (omni, infire, partner) calculate neurons themselves and report them via HTTP response headers. Both paths converge at constellation-entry.
Replicate: Model-Reported and Director-Estimated Metrics
Three Flags
Director has three flags that control billing metric behavior
(director/config.go:46-48):
| Flag | Purpose |
|---|---|
DIRECTOR_TRUST_BILLING_METRICS | When true, pass through the model’s billing metrics to downstream systems |
DIRECTOR_CALCULATE_TOKEN_METRICS | When true, Director independently estimates token counts |
DIRECTOR_CALCULATE_IMAGE_METRICS | When true, Director independently estimates image counts |
The calculate flags are the default path for untrusted models — Director estimates billing metrics from the prediction input and output rather than trusting the model to report them.
Both calculate flags can be enabled alongside trustBillingMetrics. When they are,
Director compares its own estimates against the model’s reported values and records
match/mismatch as OTel span attributes (token_input_metrics.match,
image_count_metrics.match, etc.). This is useful for validating that trusted models
report accurately.
Model-Reported Metrics (Trusted Path)
When trustBillingMetrics is true, Director passes through three sets of metrics from
the model’s prediction response (tracker.go:524-585):
PublicMetrics — user-visible metrics stored on the prediction. Covers image count, batch size, input/output token counts, and predict time share.
BillingMetrics — internal metrics stored in prediction.InternalMetadata["billing_metrics"].
35+ fields covering (cog/types.go:68-125):
- Audio (input/output count, duration)
- Characters (input/output count)
- Images (input/output count, megapixels, pixel dimensions, step count)
- Tokens (input/output count)
- Video (input/output count, duration, frame counts, megapixel-seconds)
- Documents (page input count)
- Training (step count)
BillingCriteria — model variant and configuration that affects pricing, stored in
prediction.InternalMetadata["billing_criteria"]
(cog/types.go:58-66). Covers model variant, resolution
target, motion mode, source/target FPS, and audio flag.
When trustBillingMetrics is false, all three are silently dropped.
Director-Estimated Token Counts
When calculateTokenMetrics is true, Director estimates token counts from the prediction
input and output (tracker.go:656-704):
Input tokens: Scans prediction input keys for any containing the substring "prompt"
(matches prompt, system_prompt, prompt_template, etc.). Each matching string value
is tokenized and the results are summed.
The tokenizer (tracker.go:714-717) is a rough heuristic: split
on blankspace, count words, multiply by 4/3.
Output tokens: If the output is a string, run countTokens on it. If it’s an array,
count the array length (for streaming models, each element is typically one token).
Timing: When output tokens exist, Director also calculates TimeToFirstToken and
TokensPerSecond.
Director-Estimated Image Counts
When calculateImageMetrics is true, Director estimates image counts from the prediction
output (tracker.go:629-653):
- Array output → count = array length
- Non-empty string output → count = 1
- Empty or nil output → count = 0
There’s no content inspection — a TODO: Check we're counting images and not something else acknowledges this. If the model already reported via BillingMetrics, Director uses
that value instead.
Metric Flow
- Model reports
PublicMetrics,BillingMetrics, andBillingCriteriain its prediction response. - Director processes them based on the three flags — passing through trusted metrics, estimating where needed, comparing when both paths are active.
- Director sends the prediction to the API via internal webhook.
BillingMetricsandBillingCriteriatravel inInternalMetadata(opaque to users).PublicMetricsare on the prediction itself (visible to users). - The API forwards to Web for billing aggregation.
Workers AI: Neurons and Multiple Reporting Paths
Neurons
Workers AI bills in neurons — an abstract unit representing GPU compute.
Internally, 1 neuron ≈ 0.1 L4 GPU-seconds (baselined at Sept 2023 efficiency).
Each model has per-metric neuron coefficients that convert raw usage into a
neuron total. The formula
(neuron/src/lib.rs:7-36):
total_neurons = cost_per_infer + Σ(neuron_cost × metric_value)
cost_per_infer is an optional flat cost per request. The per-metric
multipliers (e.g., input_tokens, output_tokens, image_steps) are
configured per model via Consul service config or Deus env vars like
INPUT_TOKEN_NEURONS and OUTPUT_TOKEN_NEURONS.
Triton Path: Cost Metric Tensors
Triton-backed models — including triton-vllm, which covers the majority of
LLMs — report billing metrics as output tensors prefixed with COST_METRIC_.
constellation-server extracts these in the Triton result handler
(inference/triton/result.rs:196-249):
- Iterate over result tensors.
- For each tensor named
COST_METRIC_*, strip the prefix to get the metric name (e.g.,input_tokens,output_tokens). - Sum the tensor values (handles scalar, 1D array, and [N×1] shapes).
- Call
calculate_neurons(neuron_config, &cost_metric_values)to compute total neurons. - Strip
COST_METRIC_*tensors from the response before sending to the client. - Put the neuron total and up to two named cost metrics in the response
cf-ai-cserver-metaJSON header.
The client never sees the raw cost metric tensors.
Non-Triton Path: Model-Reported Headers
Non-Triton backends report neurons via HTTP response headers that constellation-server passes through transparently:
Omni models (PipeHttp software type) — the omni framework
(cloudflare/ai/omni) provides a Python API where model
code calls context.cost.set_neurons(value) and
context.cost.set_usage_metric(name, value). Omni converts these to
cf-ai-neurons and cf-ai-cost-metric-{name,value}-N response headers
(omni/shared/src/cost.rs:82-96). Models read their neuron
coefficients from the cf-ai-model-config request header that
constellation-server forwards from the model’s Consul config.
Infire models (PipeHttpLlm software type) — the newer vLLM deployment
path. Neuron coefficients are defined in workers-ai.yaml (e.g.,
input_token: 0.02561, output_token: 0.07515). The model server
calculates and reports neurons via the same cf-ai-neurons header
convention.
Partner models (PartnerPipeHttp) — partner-bouncer calculates neurons
and emits cf-ai-neurons headers
(partner-bouncer/src/server/model.rs:178-211).
Convergence at constellation-entry
All paths converge at constellation-entry, the edge Worker that assembles
the billing event
(sdk/.../src/lib/headers.ts:20-82):
- Parse
cf-ai-neuronsfrom response headers (non-Triton path). - Parse
cf-ai-cserver-metaJSON and merge its fields — includingneurons— into the metrics context (Triton path). - Both sources write to the same
raMetrics.neuronsfield. - Send a Ready Analytics event with the neuron total to the SDK RA table.
Billing reads from the SDK RA table
(aiinference_sdk_production_by_namespace_account_sampled), summing
neurons × _sample_interval per account.
Streaming
For streaming responses, cost metrics are accumulated across chunks. In the
Triton path, constellation-server’s InferenceEventBuilder adds metric
values to running totals
(inference_event.rs:153-156). In the non-Triton path,
constellation-entry accumulates neurons and cost_metric_value_2 from
each chunk’s meta field
(sdk/.../src/lib/tools.ts:948-959).
No Server-Side Token Counting
No component in the Workers AI stack counts tokens independently. Token
counts come from the model — either as COST_METRIC_input_tokens /
COST_METRIC_output_tokens tensors (Triton) or as usage metrics reported
via context.cost.set_usage_metric() (omni). There is no tokenizer in
constellation-server, constellation-entry, or ai-scheduler.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Billing unit | Predict time + per-unit metrics (tokens, images, video, etc.) | Neurons (abstract GPU compute unit, ≈ 0.1 L4-seconds) |
| Who counts | Director estimates (untrusted) or model reports (trusted) | Model always reports — via tensors (Triton) or headers (omni/infire/partner) |
| Token counting | Director: word count × 4/3 heuristic (untrusted). Model: actual counts (trusted). | Model-reported only. No server-side tokenizer. |
| Cost formula | Raw metrics passed to billing system, pricing applied downstream | neurons = cost_per_infer + Σ(neuron_cost × metric_value), computed at inference time or by model code |
| Metric types | 35+ fields across audio, image, video, tokens, training, documents | Arbitrary named metrics (typically 1-3 per model) |
| Trust model | Explicit trust_billing_metrics flag gates model-reported metrics | Implicit — all models report their own metrics, no server-side estimation |
| Reporting paths | Single path through Director | Multiple: Triton tensors, omni SDK, infire headers, partner-bouncer — all converge at constellation-entry |
| Billing table | Web aggregates from prediction metadata | SDK RA table (written by constellation-entry) queried via BigQuery |
Both platforms have a trust asymmetry. Replicate runs untrusted third-party models and must estimate billing metrics for them — only explicitly opted-in trusted models may report their own. Workers AI controls all model deployments, so every model reports its own metrics through one of several backend-specific mechanisms.
Replicate’s BillingMetrics struct with 35+ fields reflects the diversity
of model types it supports (image generators, video models, audio models,
LLMs, training jobs). Workers AI’s neuron abstraction collapses all of this
into a single number — per-model pricing changes only require updating
neuron coefficients in config, not model code.