Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Resource Allocation

Overview

Both platforms need to map model requirements to GPU hardware, but they approach it from opposite directions. Replicate exposes hardware selection to users as a first-class concept — users pick a SKU, and the platform provisions dedicated resources. Workers AI abstracts hardware entirely — operators configure GPU memory requirements per model in Config API, and the platform handles placement across a shared fleet.


Replicate Resource Model

Hardware SKUs

The Hardware Django model (web/models/hardware.py) defines every available hardware option as a SKU. Each SKU combines a compute unit (GPU type) with a compute units per instance count (how many GPUs per pod).

Current compute units (cluster/pkg/kubernetes/compute_unit.go):

Compute UnitGPUStatus
cpuNoneActive
gpu-t4Nvidia T4 (16 GB)Active
gpu-a100Nvidia A100 (40 GB)Active
gpu-a100-80gNvidia A100 (80 GB)Active
gpu-h100Nvidia H100 (80 GB)Active
gpu-h200Nvidia H200 (141 GB)Active
gpu-l40sNvidia L40S (48 GB)Active
gpu-a40, gpu-a40-small, gpu-a40-largeNvidia A40 (48 GB)Legacy
gpu-t4-highmem, gpu-t4-lowmemNvidia T4 variantsLegacy
gpu-rtx-a4000, gpu-rtx-a5000, gpu-rtx-a6000Nvidia RTX AxxxxLegacy
gpu-flex-ampere-min-40gAny Ampere ≥40 GBLegacy

Multi-GPU SKUs use the same compute unit with a higher compute_units_per_instance. For example, gpu-2x-a100 is compute unit gpu-a100 with compute_units_per_instance=2. The Hardware model tracks this as a separate SKU with its own pricing.

Hardware availability is gated by flags on the model: allow_for_models, allow_for_deployments, is_legacy, is_preview. The HardwareQuerySet methods (available_for_models(), available_for_deployments()) filter to billable, non-legacy, public hardware (web/models/hardware.py:162-166).

K8s Resource Limits

The computeUnitLimits map (cluster/pkg/kubernetes/resources.go) defines CPU, memory, and GPU limits per compute unit:

Compute UnitCPUMemoryGPUs
cpu12 Gi0
gpu-t4416 Gi1
gpu-a1001072 Gi1
gpu-a100-80g10144 Gi1
gpu-h10013144 Gi1
gpu-h20013144 Gi1
gpu-l40s1072 Gi1

For multi-GPU pods, limits are multiplied by compute_units_per_instance with special cases for 8x configurations:

  • 8x A40/A40-Large: 6 CPU, 85 Gi per unit (reduced from 10/72)
  • 8x A100-80G: 10 CPU, 120 Gi per unit (reduced from 10/144)
  • T4 requests: memory request fudged to 13 Gi (limit stays 16 Gi) to fit 4x T4 on a node

Director’s sidecar container adds its own overhead: 256 Mi for most GPUs, 512 Mi for L40S/H100/A100/A100-80G, or 1 Gi when DirectorExtraMemory is set (deployable.go:1450-1465).

The model container also gets /dev/shm sized to 50% of its memory limit (deployable.go:681-686).

Node Placement and Bin-Packing

Replicate runs on three cloud providers (config/config.go):

  • CoreWeave CKS — primary, most models
  • Nebius MK8S — H200 capacity
  • GKE — T4 capacity

On CoreWeave CKS, placement uses a combination of node affinity, bin-packing, priority classes, and topology spread (deployable_cks.go):

Node affinity (required): pods are pinned to nodes matching the GPU class label (gpu.nvidia.com/class). CPU models go on GPU nodes but are excluded from H100/H200 nodes to preserve expensive capacity. Procedure (pipeline) workloads go to a dedicated customer-cpu-nodepool.

Bin-packing: GPU models use CoreWeave’s binpack-scheduler to pack pods tightly, leaving room for 4x and 8x pods. Pod affinity preferences (weight 10) group pods of the same compute unit on the same node. A second pod affinity groups pods by grace period (standard vs extended predict timeout) so that preemption doesn’t kill long-running predictions.

Priority node pools: 8x GPU pods get a node preference (weight 100) for dedicated priority node pools. Non-8x pods prefer to avoid these pools. On-demand node pools are also deprioritized (weight 100 against).

Priority classes:

  • r8-high — 4x+ GPU pods that are allowed to preempt others
  • r8-high-no-preempt — 4x+ GPU pods that won’t preempt
  • r8-cpu-model — CPU models, lowest priority, preemptible by GPU

Topology spread: CPU models on CKS use TopologySpreadConstraints with maxSkew=2 to prevent bunching on a single node and starving GPU models of CPU resources.

On Nebius MK8S, placement is similar to CKS but narrower in scope (deployable_mk8s.go):

GPU support: H200 only. The switch statement handles ComputeUnitCPU and ComputeUnitH200 — everything else gets a NO_SUCH_GPU sentinel that prevents scheduling. Node affinity uses the standard K8s label node.kubernetes.io/instance-type (value gpu-h200-sxm).

Bin-packing: Same pod affinity logic as CKS — non-8x GPU pods get compute-unit bin-packing (weight 10) and grace-period bin-packing (weight 10). No custom scheduler though — uses the default K8s scheduler.

CPU models: Excluded from H200 nodes via NodeSelectorOpNotIn on the GPU class label, but no dedicated CPU node pool — they land on whatever non-H200 nodes are available.

On GKE, placement is minimal (deployable_gke.go):

Node affinity (required): pods are pinned to nodes matching a replicate/role label with value model-{compute_unit} (e.g. model-gpu-t4). This is a Replicate-defined label on the node pool, not a vendor GPU class label.

Tolerations: Three tolerations per pod — the default GKE nvidia.com/gpu taint, a legacy model-hardware taint, and a newer model-compute-unit taint. The code comments indicate the compute-unit taint is the intended replacement for the hardware taint.

Tenancy Model

The isolation boundary is the deployable (model version or deployment), not the account. Each deployable gets its own K8s Deployment with dedicated pods and exclusive GPU access. But multiple accounts can send prediction requests to the same deployable — the GPU pods themselves are shared across all authorized requesters.

Public models and deployments are multi-tenant: any account can run predictions against them, and all requests land on the same pool of Director pods. This is the common case for official models.

Private models and deployments are single-tenant by default: only the owning account can run predictions. Access can be granted to other accounts via an allow-list, similar to GitHub’s private repository permissions model. The access control is enforced at the API layer (web/api check permissions before enqueuing) — there is no infrastructure-level isolation (no separate namespaces, no network policies between tenants).

The “shared” vs “dedicated” billing distinction (instance_tenancy in Metronome pricing config) is orthogonal to tenancy. “Shared” means serverless pay-per-run pricing; “dedicated” means the customer pays for reserved uptime on a deployment. Both billing models serve requests from potentially multiple accounts on the same GPU pods.


Workers AI Resource Model

GPU Types and Memory Sizing

Workers AI models specify GPU requirements in Config API (config_api/src/lib.rs):

FieldDefaultDescription
gpu_memory22 (GB)GPU memory requested
cpu_memorysame as gpu_memoryCPU memory override
vcpu(none)CPU core request
dual_gpufalseNeeds 2 GPUs
gpu_model(none)GPU model override, e.g. "NVIDIA H100"

The platform supports three GPU types (cloudchamber/src/application.rs:259-263):

  • L4 — Nvidia L4 (internal colos)
  • H100 — Nvidia H100 80 GB (internal colos + external capacity)
  • H200 — Nvidia H200 141 GB (external capacity)

GPU type is inferred from the gpu_model string in the application config. If no model is specified, H100 is assumed (application.rs:286-291).

Multi-GPU allocation is calculated from total gpu_memory divided by per-GPU maximum: 80 GB for H100, 141 GB for H200. The result is clamped to 1–8 GPUs (external_nodes/.../model.rs:27-34).

Unlike Replicate, users never select hardware. The model’s GPU requirements are set by Workers AI operators via Config API, and the platform handles placement.

Placement and Scheduling

Workers AI uses two scheduling paths:

Internal capacity (Cloudchamber): Models are scheduled via Cloudchamber, Cloudflare’s internal container orchestrator. The create_application_request function (application.rs:247-255) maps model config to a Cloudchamber application with SchedulingPolicy::Gpu, placement constraints (colo tier, region, PoPs), and optional scheduling priority.

Cloudchamber’s placement is controlled by constraints on the application object. These map from Config API properties to ApplicationConstraints fields (application.rs:216-233):

Colo tier (colo_tier): Cloudflare datacenters are grouped into tiers by GPU capacity. Setting colo_tier restricts a model’s instances to datacenters at that tier. For example, when deploying in “tiger mode” (canary), default single-GPU models are pinned to tier 3 (application.rs:231-232).

Colo region (colo_region): Restricts instances to datacenters in specific geographic regions. Accepts a comma-separated list.

Colo PoPs (colo_pops): The most specific constraint — restricts instances to named Points of Presence (individual datacenters).

Scheduling priority (scheduling_priority): Sets the Cloudchamber scheduling priority for the application. Default is 50. Currently the only non-default value is Leonardo = 75, used for Leonardo partnership models to give them preferential placement (scheduling_priority.rs).

Separately, blacklisted_colos removes specific colos from the model’s routing table (not scheduling). This is consumed by the routing app when building the colo list that constellation-entry uses for request forwarding — it doesn’t affect where Cloudchamber places instances (routing/src/lib.rs:201-208).

External capacity (Kubernetes): For external GPU providers (OCI, etc.), ai-scheduler creates a Model custom resource (external_nodes/.../model.rs:73-150) that the IKE operator reconciles into K8s Deployments. GPU limits are set as nvidia.com/gpu resource requests when >1 GPU is needed.

Tenancy Model

Workers AI is multi-tenant at the model level. The default is that all accounts share the same model instances — a single GPU container serving @cf/meta/llama-3.1-8b-instruct handles requests from every account. The isolation boundary is the model ID, not the caller.

However, models can be restricted to specific accounts. Two mechanisms exist in worker-constellation-entry (ai.ts:49-64):

  • allowed_accounts — a comma-separated list of account IDs in Config API. When set, only listed accounts can use the model. Requests from other accounts get a 403. This is how partnership models (e.g. @cf/leonardo/phoenix-1.0) are restricted to the partner’s account.
  • is_private — a boolean flag marking a model as private.

Both are enforced at the API layer in worker-constellation-entry, not at the infrastructure level. A “private” Leonardo model still runs on the same GPU fleet as public models — the access gate is just earlier in the request path. This parallels Replicate’s approach where private model access control is also API-layer only.

Since all accounts share the same GPU instances, fairness is enforced at the request level — per-account fair queuing and rate limiting in constellation-server. See Queue Management for details.


Key Differences

AspectReplicateWorkers AI
Hardware selectionUser-facing SKU pickerOperator-configured, abstracted from users
GPU typesT4, A100 (40/80), H100, H200, L40S + legacyL4, H100, H200
Multi-GPUcompute_units_per_instance (1–8)gpu_memory / per-GPU max (1–8)
Resource limitsExplicit per-unit CPU/mem/GPU in Go mapGPU memory-based, CPU/mem optional
Isolation boundaryDeployable (model version or deployment) — dedicated GPU per deployable, multiple accounts share itModel ID — all accounts share the same instances, no per-account isolation
Access controlAPI-layer: public models open to all, private models gated by allow-listAPI-layer: public models open to all, allowed_accounts / is_private restrict partnership and private models
FairnessNo fairness controls — all requests to a deployable are equalPer-account fair queuing + rate limiting (see Queue Management)
SchedulingK8s with bin-pack scheduler + affinityCloudchamber (internal) + K8s via IKE (external)
Cloud providersCoreWeave CKS, Nebius, GKE (legacy)Cloudflare internal colos + external (OCI, etc.)
PreemptionPriority classes, configurable per-deployableScheduling priority in Cloudchamber
Placement constraintsGPU class node affinity, priority node poolsColo tier/region/PoP (scheduling); blacklisted colos (routing only)

The fundamental architectural difference is where the isolation boundary sits. Replicate isolates at the deployable level — each model version or deployment gets dedicated GPU pods, but multiple accounts can share those pods (public models) or access is allow-listed (private models). Workers AI isolates at the model level — all accounts share the same instances for a given model, with fairness enforced at the request level via queuing and rate limiting rather than resource partitioning.