Resource Allocation

Overview

Both platforms need to map model requirements to GPU hardware, but they approach it from opposite directions. Replicate exposes hardware selection to users as a first-class concept — users pick a SKU, and the platform provisions dedicated resources. Workers AI abstracts hardware entirely — operators configure GPU memory requirements per model in Config API, and the platform handles placement across a shared fleet.

Replicate Resource Model

Hardware SKUs

The Hardware Django model (web/models/hardware.py) defines every available hardware option as a SKU. Each SKU combines a compute unit (GPU type) with a compute units per instance count (how many GPUs per pod).

Current compute units (cluster/pkg/kubernetes/compute_unit.go):

Compute Unit	GPU	Status
`cpu`	None	Active
`gpu-t4`	Nvidia T4 (16 GB)	Active
`gpu-a100`	Nvidia A100 (40 GB)	Active
`gpu-a100-80g`	Nvidia A100 (80 GB)	Active
`gpu-h100`	Nvidia H100 (80 GB)	Active
`gpu-h200`	Nvidia H200 (141 GB)	Active
`gpu-l40s`	Nvidia L40S (48 GB)	Active
`gpu-a40`, `gpu-a40-small`, `gpu-a40-large`	Nvidia A40 (48 GB)	Legacy
`gpu-t4-highmem`, `gpu-t4-lowmem`	Nvidia T4 variants	Legacy
`gpu-rtx-a4000`, `gpu-rtx-a5000`, `gpu-rtx-a6000`	Nvidia RTX Axxxx	Legacy
`gpu-flex-ampere-min-40g`	Any Ampere ≥40 GB	Legacy

Multi-GPU SKUs use the same compute unit with a higher compute_units_per_instance. For example, gpu-2x-a100 is compute unit gpu-a100 with compute_units_per_instance=2. The Hardware model tracks this as a separate SKU with its own pricing.

Hardware availability is gated by flags on the model: allow_for_models, allow_for_deployments, is_legacy, is_preview. The HardwareQuerySet methods (available_for_models(), available_for_deployments()) filter to billable, non-legacy, public hardware (web/models/hardware.py:162-166).

K8s Resource Limits

The computeUnitLimits map (cluster/pkg/kubernetes/resources.go) defines CPU, memory, and GPU limits per compute unit:

Compute Unit	CPU	Memory	GPUs
`cpu`	1	2 Gi	0
`gpu-t4`	4	16 Gi	1
`gpu-a100`	10	72 Gi	1
`gpu-a100-80g`	10	144 Gi	1
`gpu-h100`	13	144 Gi	1
`gpu-h200`	13	144 Gi	1
`gpu-l40s`	10	72 Gi	1

For multi-GPU pods, limits are multiplied by compute_units_per_instance with special cases for 8x configurations:

8x A40/A40-Large: 6 CPU, 85 Gi per unit (reduced from 10/72)
8x A100-80G: 10 CPU, 120 Gi per unit (reduced from 10/144)
T4 requests: memory request fudged to 13 Gi (limit stays 16 Gi) to fit 4x T4 on a node

Director’s sidecar container adds its own overhead: 256 Mi for most GPUs, 512 Mi for L40S/H100/A100/A100-80G, or 1 Gi when DirectorExtraMemory is set (deployable.go:1450-1465).

The model container also gets /dev/shm sized to 50% of its memory limit (deployable.go:681-686).

Node Placement and Bin-Packing

Replicate runs on three cloud providers (config/config.go):

CoreWeave CKS — primary, most models
Nebius MK8S — H200 capacity
GKE — T4 capacity

On CoreWeave CKS, placement uses a combination of node affinity, bin-packing, priority classes, and topology spread (deployable_cks.go):

Node affinity (required): pods are pinned to nodes matching the GPU class label (gpu.nvidia.com/class). CPU models go on GPU nodes but are excluded from H100/H200 nodes to preserve expensive capacity. Procedure (pipeline) workloads go to a dedicated customer-cpu-nodepool.

Bin-packing: GPU models use CoreWeave’s binpack-scheduler to pack pods tightly, leaving room for 4x and 8x pods. Pod affinity preferences (weight 10) group pods of the same compute unit on the same node. A second pod affinity groups pods by grace period (standard vs extended predict timeout) so that preemption doesn’t kill long-running predictions.

Priority node pools: 8x GPU pods get a node preference (weight 100) for dedicated priority node pools. Non-8x pods prefer to avoid these pools. On-demand node pools are also deprioritized (weight 100 against).

Priority classes:

r8-high — 4x+ GPU pods that are allowed to preempt others
r8-high-no-preempt — 4x+ GPU pods that won’t preempt
r8-cpu-model — CPU models, lowest priority, preemptible by GPU

Topology spread: CPU models on CKS use TopologySpreadConstraints with maxSkew=2 to prevent bunching on a single node and starving GPU models of CPU resources.

On Nebius MK8S, placement is similar to CKS but narrower in scope (deployable_mk8s.go):

GPU support: H200 only. The switch statement handles ComputeUnitCPU and ComputeUnitH200 — everything else gets a NO_SUCH_GPU sentinel that prevents scheduling. Node affinity uses the standard K8s label node.kubernetes.io/instance-type (value gpu-h200-sxm).

Bin-packing: Same pod affinity logic as CKS — non-8x GPU pods get compute-unit bin-packing (weight 10) and grace-period bin-packing (weight 10). No custom scheduler though — uses the default K8s scheduler.

CPU models: Excluded from H200 nodes via NodeSelectorOpNotIn on the GPU class label, but no dedicated CPU node pool — they land on whatever non-H200 nodes are available.

On GKE, placement is minimal (deployable_gke.go):

Node affinity (required): pods are pinned to nodes matching a replicate/role label with value model-{compute_unit} (e.g. model-gpu-t4). This is a Replicate-defined label on the node pool, not a vendor GPU class label.

Tolerations: Three tolerations per pod — the default GKE nvidia.com/gpu taint, a legacy model-hardware taint, and a newer model-compute-unit taint. The code comments indicate the compute-unit taint is the intended replacement for the hardware taint.

Tenancy Model

The isolation boundary is the deployable (model version or deployment), not the account. Each deployable gets its own K8s Deployment with dedicated pods and exclusive GPU access. But multiple accounts can send prediction requests to the same deployable — the GPU pods themselves are shared across all authorized requesters.

Public models and deployments are multi-tenant: any account can run predictions against them, and all requests land on the same pool of Director pods. This is the common case for official models.

Private models and deployments are single-tenant by default: only the owning account can run predictions. Access can be granted to other accounts via an allow-list, similar to GitHub’s private repository permissions model. The access control is enforced at the API layer (web/api check permissions before enqueuing) — there is no infrastructure-level isolation (no separate namespaces, no network policies between tenants).

The “shared” vs “dedicated” billing distinction (instance_tenancy in Metronome pricing config) is orthogonal to tenancy. “Shared” means serverless pay-per-run pricing; “dedicated” means the customer pays for reserved uptime on a deployment. Both billing models serve requests from potentially multiple accounts on the same GPU pods.

Workers AI Resource Model

GPU Types and Memory Sizing

Workers AI models specify GPU requirements in Config API (config_api/src/lib.rs):

Field	Default	Description
`gpu_memory`	22 (GB)	GPU memory requested
`cpu_memory`	same as `gpu_memory`	CPU memory override
`vcpu`	(none)	CPU core request
`dual_gpu`	`false`	Needs 2 GPUs
`gpu_model`	(none)	GPU model override, e.g. `"NVIDIA H100"`

The platform supports three GPU types (cloudchamber/src/application.rs:259-263):

L4 — Nvidia L4 (internal colos)
H100 — Nvidia H100 80 GB (internal colos + external capacity)
H200 — Nvidia H200 141 GB (external capacity)

GPU type is inferred from the gpu_model string in the application config. If no model is specified, H100 is assumed (application.rs:286-291).

Multi-GPU allocation is calculated from total gpu_memory divided by per-GPU maximum: 80 GB for H100, 141 GB for H200. The result is clamped to 1–8 GPUs (external_nodes/.../model.rs:27-34).

Unlike Replicate, users never select hardware. The model’s GPU requirements are set by Workers AI operators via Config API, and the platform handles placement.

Placement and Scheduling

Workers AI uses two scheduling paths:

Internal capacity (Cloudchamber): Models are scheduled via Cloudchamber, Cloudflare’s internal container orchestrator. The create_application_request function (application.rs:247-255) maps model config to a Cloudchamber application with SchedulingPolicy::Gpu, placement constraints (colo tier, region, PoPs), and optional scheduling priority.

Cloudchamber’s placement is controlled by constraints on the application object. These map from Config API properties to ApplicationConstraints fields (application.rs:216-233):

Colo tier (colo_tier): Cloudflare datacenters are grouped into tiers by GPU capacity. Setting colo_tier restricts a model’s instances to datacenters at that tier. For example, when deploying in “tiger mode” (canary), default single-GPU models are pinned to tier 3 (application.rs:231-232).

Colo region (colo_region): Restricts instances to datacenters in specific geographic regions. Accepts a comma-separated list.

Colo PoPs (colo_pops): The most specific constraint — restricts instances to named Points of Presence (individual datacenters).

Scheduling priority (scheduling_priority): Sets the Cloudchamber scheduling priority for the application. Default is 50. Currently the only non-default value is Leonardo = 75, used for Leonardo partnership models to give them preferential placement (scheduling_priority.rs).

Separately, blacklisted_colos removes specific colos from the model’s routing table (not scheduling). This is consumed by the routing app when building the colo list that constellation-entry uses for request forwarding — it doesn’t affect where Cloudchamber places instances (routing/src/lib.rs:201-208).

External capacity (Kubernetes): For external GPU providers (OCI, etc.), ai-scheduler creates a Model custom resource (external_nodes/.../model.rs:73-150) that the IKE operator reconciles into K8s Deployments. GPU limits are set as nvidia.com/gpu resource requests when >1 GPU is needed.

Tenancy Model

Workers AI is multi-tenant at the model level. The default is that all accounts share the same model instances — a single GPU container serving @cf/meta/llama-3.1-8b-instruct handles requests from every account. The isolation boundary is the model ID, not the caller.

However, models can be restricted to specific accounts. Two mechanisms exist in worker-constellation-entry (ai.ts:49-64):

allowed_accounts — a comma-separated list of account IDs in Config API. When set, only listed accounts can use the model. Requests from other accounts get a 403. This is how partnership models (e.g. @cf/leonardo/phoenix-1.0) are restricted to the partner’s account.
is_private — a boolean flag marking a model as private.

Both are enforced at the API layer in worker-constellation-entry, not at the infrastructure level. A “private” Leonardo model still runs on the same GPU fleet as public models — the access gate is just earlier in the request path. This parallels Replicate’s approach where private model access control is also API-layer only.

Since all accounts share the same GPU instances, fairness is enforced at the request level — per-account fair queuing and rate limiting in constellation-server. See Queue Management for details.

Key Differences

Aspect	Replicate	Workers AI
Hardware selection	User-facing SKU picker	Operator-configured, abstracted from users
GPU types	T4, A100 (40/80), H100, H200, L40S + legacy	L4, H100, H200
Multi-GPU	`compute_units_per_instance` (1–8)	`gpu_memory` / per-GPU max (1–8)
Resource limits	Explicit per-unit CPU/mem/GPU in Go map	GPU memory-based, CPU/mem optional
Isolation boundary	Deployable (model version or deployment) — dedicated GPU per deployable, multiple accounts share it	Model ID — all accounts share the same instances, no per-account isolation
Access control	API-layer: public models open to all, private models gated by allow-list	API-layer: public models open to all, `allowed_accounts` / `is_private` restrict partnership and private models
Fairness	No fairness controls — all requests to a deployable are equal	Per-account fair queuing + rate limiting (see Queue Management)
Scheduling	K8s with bin-pack scheduler + affinity	Cloudchamber (internal) + K8s via IKE (external)
Cloud providers	CoreWeave CKS, Nebius, GKE (legacy)	Cloudflare internal colos + external (OCI, etc.)
Preemption	Priority classes, configurable per-deployable	Scheduling priority in Cloudchamber
Placement constraints	GPU class node affinity, priority node pools	Colo tier/region/PoP (scheduling); blacklisted colos (routing only)

The fundamental architectural difference is where the isolation boundary sits. Replicate isolates at the deployable level — each model version or deployment gets dedicated GPU pods, but multiple accounts can share those pods (public models) or access is allow-listed (private models). Workers AI isolates at the model level — all accounts share the same instances for a given model, with fairness enforced at the request level via queuing and rate limiting rather than resource partitioning.

Keyboard shortcuts

Replicate vs Workers AI