Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Billing Metrics

Overview

Both platforms need to measure what happened during inference so they can bill for it. The billing models are different: Replicate bills on predict time plus per-unit metrics (tokens, images, etc.), while Workers AI bills on “neurons” — an abstract unit representing GPU compute.

How those metrics are collected also differs. Director estimates billing metrics for most models. Only explicitly opted-in trusted models may report their own billing metrics. Workers AI has multiple paths depending on the model server software: Triton-backed models report cost metrics as output tensors that constellation-server converts to neurons, while non-Triton models (omni, infire, partner) calculate neurons themselves and report them via HTTP response headers. Both paths converge at constellation-entry.


Replicate: Model-Reported and Director-Estimated Metrics

Three Flags

Director has three flags that control billing metric behavior (director/config.go:46-48):

FlagPurpose
DIRECTOR_TRUST_BILLING_METRICSWhen true, pass through the model’s billing metrics to downstream systems
DIRECTOR_CALCULATE_TOKEN_METRICSWhen true, Director independently estimates token counts
DIRECTOR_CALCULATE_IMAGE_METRICSWhen true, Director independently estimates image counts

The calculate flags are the default path for untrusted models — Director estimates billing metrics from the prediction input and output rather than trusting the model to report them.

Both calculate flags can be enabled alongside trustBillingMetrics. When they are, Director compares its own estimates against the model’s reported values and records match/mismatch as OTel span attributes (token_input_metrics.match, image_count_metrics.match, etc.). This is useful for validating that trusted models report accurately.

Model-Reported Metrics (Trusted Path)

When trustBillingMetrics is true, Director passes through three sets of metrics from the model’s prediction response (tracker.go:524-585):

PublicMetrics — user-visible metrics stored on the prediction. Covers image count, batch size, input/output token counts, and predict time share.

BillingMetrics — internal metrics stored in prediction.InternalMetadata["billing_metrics"]. 35+ fields covering (cog/types.go:68-125):

  • Audio (input/output count, duration)
  • Characters (input/output count)
  • Images (input/output count, megapixels, pixel dimensions, step count)
  • Tokens (input/output count)
  • Video (input/output count, duration, frame counts, megapixel-seconds)
  • Documents (page input count)
  • Training (step count)

BillingCriteria — model variant and configuration that affects pricing, stored in prediction.InternalMetadata["billing_criteria"] (cog/types.go:58-66). Covers model variant, resolution target, motion mode, source/target FPS, and audio flag.

When trustBillingMetrics is false, all three are silently dropped.

Director-Estimated Token Counts

When calculateTokenMetrics is true, Director estimates token counts from the prediction input and output (tracker.go:656-704):

Input tokens: Scans prediction input keys for any containing the substring "prompt" (matches prompt, system_prompt, prompt_template, etc.). Each matching string value is tokenized and the results are summed.

The tokenizer (tracker.go:714-717) is a rough heuristic: split on blankspace, count words, multiply by 4/3.

Output tokens: If the output is a string, run countTokens on it. If it’s an array, count the array length (for streaming models, each element is typically one token).

Timing: When output tokens exist, Director also calculates TimeToFirstToken and TokensPerSecond.

Director-Estimated Image Counts

When calculateImageMetrics is true, Director estimates image counts from the prediction output (tracker.go:629-653):

  • Array output → count = array length
  • Non-empty string output → count = 1
  • Empty or nil output → count = 0

There’s no content inspection — a TODO: Check we're counting images and not something else acknowledges this. If the model already reported via BillingMetrics, Director uses that value instead.

Metric Flow

  1. Model reports PublicMetrics, BillingMetrics, and BillingCriteria in its prediction response.
  2. Director processes them based on the three flags — passing through trusted metrics, estimating where needed, comparing when both paths are active.
  3. Director sends the prediction to the API via internal webhook. BillingMetrics and BillingCriteria travel in InternalMetadata (opaque to users). PublicMetrics are on the prediction itself (visible to users).
  4. The API forwards to Web for billing aggregation.

Workers AI: Neurons and Multiple Reporting Paths

Neurons

Workers AI bills in neurons — an abstract unit representing GPU compute. Internally, 1 neuron ≈ 0.1 L4 GPU-seconds (baselined at Sept 2023 efficiency). Each model has per-metric neuron coefficients that convert raw usage into a neuron total. The formula (neuron/src/lib.rs:7-36):

total_neurons = cost_per_infer + Σ(neuron_cost × metric_value)

cost_per_infer is an optional flat cost per request. The per-metric multipliers (e.g., input_tokens, output_tokens, image_steps) are configured per model via Consul service config or Deus env vars like INPUT_TOKEN_NEURONS and OUTPUT_TOKEN_NEURONS.

Triton Path: Cost Metric Tensors

Triton-backed models — including triton-vllm, which covers the majority of LLMs — report billing metrics as output tensors prefixed with COST_METRIC_. constellation-server extracts these in the Triton result handler (inference/triton/result.rs:196-249):

  1. Iterate over result tensors.
  2. For each tensor named COST_METRIC_*, strip the prefix to get the metric name (e.g., input_tokens, output_tokens).
  3. Sum the tensor values (handles scalar, 1D array, and [N×1] shapes).
  4. Call calculate_neurons(neuron_config, &cost_metric_values) to compute total neurons.
  5. Strip COST_METRIC_* tensors from the response before sending to the client.
  6. Put the neuron total and up to two named cost metrics in the response cf-ai-cserver-meta JSON header.

The client never sees the raw cost metric tensors.

Non-Triton Path: Model-Reported Headers

Non-Triton backends report neurons via HTTP response headers that constellation-server passes through transparently:

Omni models (PipeHttp software type) — the omni framework (cloudflare/ai/omni) provides a Python API where model code calls context.cost.set_neurons(value) and context.cost.set_usage_metric(name, value). Omni converts these to cf-ai-neurons and cf-ai-cost-metric-{name,value}-N response headers (omni/shared/src/cost.rs:82-96). Models read their neuron coefficients from the cf-ai-model-config request header that constellation-server forwards from the model’s Consul config.

Infire models (PipeHttpLlm software type) — the newer vLLM deployment path. Neuron coefficients are defined in workers-ai.yaml (e.g., input_token: 0.02561, output_token: 0.07515). The model server calculates and reports neurons via the same cf-ai-neurons header convention.

Partner models (PartnerPipeHttp) — partner-bouncer calculates neurons and emits cf-ai-neurons headers (partner-bouncer/src/server/model.rs:178-211).

Convergence at constellation-entry

All paths converge at constellation-entry, the edge Worker that assembles the billing event (sdk/.../src/lib/headers.ts:20-82):

  1. Parse cf-ai-neurons from response headers (non-Triton path).
  2. Parse cf-ai-cserver-meta JSON and merge its fields — including neurons — into the metrics context (Triton path).
  3. Both sources write to the same raMetrics.neurons field.
  4. Send a Ready Analytics event with the neuron total to the SDK RA table.

Billing reads from the SDK RA table (aiinference_sdk_production_by_namespace_account_sampled), summing neurons × _sample_interval per account.

Streaming

For streaming responses, cost metrics are accumulated across chunks. In the Triton path, constellation-server’s InferenceEventBuilder adds metric values to running totals (inference_event.rs:153-156). In the non-Triton path, constellation-entry accumulates neurons and cost_metric_value_2 from each chunk’s meta field (sdk/.../src/lib/tools.ts:948-959).

No Server-Side Token Counting

No component in the Workers AI stack counts tokens independently. Token counts come from the model — either as COST_METRIC_input_tokens / COST_METRIC_output_tokens tensors (Triton) or as usage metrics reported via context.cost.set_usage_metric() (omni). There is no tokenizer in constellation-server, constellation-entry, or ai-scheduler.


Key Differences

AspectReplicateWorkers AI
Billing unitPredict time + per-unit metrics (tokens, images, video, etc.)Neurons (abstract GPU compute unit, ≈ 0.1 L4-seconds)
Who countsDirector estimates (untrusted) or model reports (trusted)Model always reports — via tensors (Triton) or headers (omni/infire/partner)
Token countingDirector: word count × 4/3 heuristic (untrusted). Model: actual counts (trusted).Model-reported only. No server-side tokenizer.
Cost formulaRaw metrics passed to billing system, pricing applied downstreamneurons = cost_per_infer + Σ(neuron_cost × metric_value), computed at inference time or by model code
Metric types35+ fields across audio, image, video, tokens, training, documentsArbitrary named metrics (typically 1-3 per model)
Trust modelExplicit trust_billing_metrics flag gates model-reported metricsImplicit — all models report their own metrics, no server-side estimation
Reporting pathsSingle path through DirectorMultiple: Triton tensors, omni SDK, infire headers, partner-bouncer — all converge at constellation-entry
Billing tableWeb aggregates from prediction metadataSDK RA table (written by constellation-entry) queried via BigQuery

Both platforms have a trust asymmetry. Replicate runs untrusted third-party models and must estimate billing metrics for them — only explicitly opted-in trusted models may report their own. Workers AI controls all model deployments, so every model reports its own metrics through one of several backend-specific mechanisms.

Replicate’s BillingMetrics struct with 35+ fields reflects the diversity of model types it supports (image generators, video models, audio models, LLMs, training jobs). Workers AI’s neuron abstraction collapses all of this into a single number — per-model pricing changes only require updating neuron coefficients in config, not model code.