Resource Allocation
Overview
Both platforms need to map model requirements to GPU hardware, but they approach it from opposite directions. Replicate exposes hardware selection to users as a first-class concept — users pick a SKU, and the platform provisions dedicated resources. Workers AI abstracts hardware entirely — operators configure GPU memory requirements per model in Config API, and the platform handles placement across a shared fleet.
Replicate Resource Model
Hardware SKUs
The Hardware Django model
(web/models/hardware.py) defines every available
hardware option as a SKU. Each SKU combines a compute unit (GPU type)
with a compute units per instance count (how many GPUs per pod).
Current compute units
(cluster/pkg/kubernetes/compute_unit.go):
| Compute Unit | GPU | Status |
|---|---|---|
cpu | None | Active |
gpu-t4 | Nvidia T4 (16 GB) | Active |
gpu-a100 | Nvidia A100 (40 GB) | Active |
gpu-a100-80g | Nvidia A100 (80 GB) | Active |
gpu-h100 | Nvidia H100 (80 GB) | Active |
gpu-h200 | Nvidia H200 (141 GB) | Active |
gpu-l40s | Nvidia L40S (48 GB) | Active |
gpu-a40, gpu-a40-small, gpu-a40-large | Nvidia A40 (48 GB) | Legacy |
gpu-t4-highmem, gpu-t4-lowmem | Nvidia T4 variants | Legacy |
gpu-rtx-a4000, gpu-rtx-a5000, gpu-rtx-a6000 | Nvidia RTX Axxxx | Legacy |
gpu-flex-ampere-min-40g | Any Ampere ≥40 GB | Legacy |
Multi-GPU SKUs use the same compute unit with a higher
compute_units_per_instance. For example, gpu-2x-a100 is compute unit
gpu-a100 with compute_units_per_instance=2. The Hardware model tracks
this as a separate SKU with its own pricing.
Hardware availability is gated by flags on the model:
allow_for_models, allow_for_deployments, is_legacy, is_preview.
The HardwareQuerySet methods (available_for_models(),
available_for_deployments()) filter to billable, non-legacy, public
hardware
(web/models/hardware.py:162-166).
K8s Resource Limits
The computeUnitLimits map
(cluster/pkg/kubernetes/resources.go) defines CPU,
memory, and GPU limits per compute unit:
| Compute Unit | CPU | Memory | GPUs |
|---|---|---|---|
cpu | 1 | 2 Gi | 0 |
gpu-t4 | 4 | 16 Gi | 1 |
gpu-a100 | 10 | 72 Gi | 1 |
gpu-a100-80g | 10 | 144 Gi | 1 |
gpu-h100 | 13 | 144 Gi | 1 |
gpu-h200 | 13 | 144 Gi | 1 |
gpu-l40s | 10 | 72 Gi | 1 |
For multi-GPU pods, limits are multiplied by compute_units_per_instance
with special cases for 8x configurations:
- 8x A40/A40-Large: 6 CPU, 85 Gi per unit (reduced from 10/72)
- 8x A100-80G: 10 CPU, 120 Gi per unit (reduced from 10/144)
- T4 requests: memory request fudged to 13 Gi (limit stays 16 Gi) to fit 4x T4 on a node
Director’s sidecar container adds its own overhead: 256 Mi for most
GPUs, 512 Mi for L40S/H100/A100/A100-80G, or 1 Gi when
DirectorExtraMemory is set
(deployable.go:1450-1465).
The model container also gets /dev/shm sized to 50% of its memory
limit
(deployable.go:681-686).
Node Placement and Bin-Packing
Replicate runs on three cloud providers
(config/config.go):
- CoreWeave CKS — primary, most models
- Nebius MK8S — H200 capacity
- GKE — T4 capacity
On CoreWeave CKS, placement uses a combination of node affinity,
bin-packing, priority classes, and topology spread
(deployable_cks.go):
Node affinity (required): pods are pinned to nodes matching the GPU
class label (gpu.nvidia.com/class). CPU models go on GPU nodes but
are excluded from H100/H200 nodes to preserve expensive capacity.
Procedure (pipeline) workloads go to a dedicated
customer-cpu-nodepool.
Bin-packing: GPU models use CoreWeave’s binpack-scheduler to pack
pods tightly, leaving room for 4x and 8x pods. Pod affinity preferences
(weight 10) group pods of the same compute unit on the same node. A
second pod affinity groups pods by grace period (standard vs extended
predict timeout) so that preemption doesn’t kill long-running
predictions.
Priority node pools: 8x GPU pods get a node preference (weight 100) for dedicated priority node pools. Non-8x pods prefer to avoid these pools. On-demand node pools are also deprioritized (weight 100 against).
Priority classes:
r8-high— 4x+ GPU pods that are allowed to preempt othersr8-high-no-preempt— 4x+ GPU pods that won’t preemptr8-cpu-model— CPU models, lowest priority, preemptible by GPU
Topology spread: CPU models on CKS use TopologySpreadConstraints
with maxSkew=2 to prevent bunching on a single node and starving GPU
models of CPU resources.
On Nebius MK8S, placement is similar to CKS but narrower in scope
(deployable_mk8s.go):
GPU support: H200 only. The switch statement handles ComputeUnitCPU
and ComputeUnitH200 — everything else gets a NO_SUCH_GPU sentinel
that prevents scheduling. Node affinity uses the standard K8s label
node.kubernetes.io/instance-type (value gpu-h200-sxm).
Bin-packing: Same pod affinity logic as CKS — non-8x GPU pods get compute-unit bin-packing (weight 10) and grace-period bin-packing (weight 10). No custom scheduler though — uses the default K8s scheduler.
CPU models: Excluded from H200 nodes via NodeSelectorOpNotIn on
the GPU class label, but no dedicated CPU node pool — they land on
whatever non-H200 nodes are available.
On GKE, placement is minimal
(deployable_gke.go):
Node affinity (required): pods are pinned to nodes matching a
replicate/role label with value model-{compute_unit} (e.g.
model-gpu-t4). This is a Replicate-defined label on the node pool,
not a vendor GPU class label.
Tolerations: Three tolerations per pod — the default GKE
nvidia.com/gpu taint, a legacy model-hardware taint, and a newer
model-compute-unit taint. The code comments indicate the
compute-unit taint is the intended replacement for the hardware taint.
Tenancy Model
The isolation boundary is the deployable (model version or deployment), not the account. Each deployable gets its own K8s Deployment with dedicated pods and exclusive GPU access. But multiple accounts can send prediction requests to the same deployable — the GPU pods themselves are shared across all authorized requesters.
Public models and deployments are multi-tenant: any account can run predictions against them, and all requests land on the same pool of Director pods. This is the common case for official models.
Private models and deployments are single-tenant by default: only the owning account can run predictions. Access can be granted to other accounts via an allow-list, similar to GitHub’s private repository permissions model. The access control is enforced at the API layer (web/api check permissions before enqueuing) — there is no infrastructure-level isolation (no separate namespaces, no network policies between tenants).
The “shared” vs “dedicated” billing distinction (instance_tenancy
in Metronome pricing config) is orthogonal to tenancy. “Shared”
means serverless pay-per-run pricing; “dedicated” means the customer
pays for reserved uptime on a deployment. Both billing models serve
requests from potentially multiple accounts on the same GPU pods.
Workers AI Resource Model
GPU Types and Memory Sizing
Workers AI models specify GPU requirements in Config API
(config_api/src/lib.rs):
| Field | Default | Description |
|---|---|---|
gpu_memory | 22 (GB) | GPU memory requested |
cpu_memory | same as gpu_memory | CPU memory override |
vcpu | (none) | CPU core request |
dual_gpu | false | Needs 2 GPUs |
gpu_model | (none) | GPU model override, e.g. "NVIDIA H100" |
The platform supports three GPU types
(cloudchamber/src/application.rs:259-263):
- L4 — Nvidia L4 (internal colos)
- H100 — Nvidia H100 80 GB (internal colos + external capacity)
- H200 — Nvidia H200 141 GB (external capacity)
GPU type is inferred from the gpu_model string in the application
config. If no model is specified, H100 is assumed
(application.rs:286-291).
Multi-GPU allocation is calculated from total gpu_memory divided by
per-GPU maximum: 80 GB for H100, 141 GB for H200. The result is clamped
to 1–8 GPUs
(external_nodes/.../model.rs:27-34).
Unlike Replicate, users never select hardware. The model’s GPU requirements are set by Workers AI operators via Config API, and the platform handles placement.
Placement and Scheduling
Workers AI uses two scheduling paths:
Internal capacity (Cloudchamber): Models are scheduled via
Cloudchamber, Cloudflare’s internal container orchestrator. The
create_application_request function
(application.rs:247-255) maps model config to a
Cloudchamber application with SchedulingPolicy::Gpu, placement
constraints (colo tier, region, PoPs), and optional scheduling priority.
Cloudchamber’s placement is controlled by constraints on the
application object. These map from Config API properties to
ApplicationConstraints fields
(application.rs:216-233):
Colo tier (colo_tier): Cloudflare datacenters are grouped into
tiers by GPU capacity. Setting colo_tier restricts a model’s
instances to datacenters at that tier. For example, when deploying in
“tiger mode” (canary), default single-GPU models are pinned to tier 3
(application.rs:231-232).
Colo region (colo_region): Restricts instances to datacenters in
specific geographic regions. Accepts a comma-separated list.
Colo PoPs (colo_pops): The most specific constraint — restricts
instances to named Points of Presence (individual datacenters).
Scheduling priority (scheduling_priority): Sets the Cloudchamber
scheduling priority for the application. Default is 50. Currently the
only non-default value is Leonardo = 75, used for Leonardo
partnership models to give them preferential placement
(scheduling_priority.rs).
Separately, blacklisted_colos removes specific colos from the
model’s routing table (not scheduling). This is consumed by the
routing app when building the colo list that constellation-entry
uses for request forwarding — it doesn’t affect where Cloudchamber
places instances
(routing/src/lib.rs:201-208).
External capacity (Kubernetes): For external GPU providers (OCI,
etc.), ai-scheduler creates a Model custom resource
(external_nodes/.../model.rs:73-150) that the IKE
operator reconciles into K8s Deployments. GPU limits are set as
nvidia.com/gpu resource requests when >1 GPU is needed.
Tenancy Model
Workers AI is multi-tenant at the model level. The default is
that all accounts share the same model instances — a single GPU
container serving @cf/meta/llama-3.1-8b-instruct handles requests
from every account. The isolation boundary is the model ID, not the
caller.
However, models can be restricted to specific accounts. Two
mechanisms exist in worker-constellation-entry
(ai.ts:49-64):
allowed_accounts— a comma-separated list of account IDs in Config API. When set, only listed accounts can use the model. Requests from other accounts get a 403. This is how partnership models (e.g.@cf/leonardo/phoenix-1.0) are restricted to the partner’s account.is_private— a boolean flag marking a model as private.
Both are enforced at the API layer in worker-constellation-entry, not at the infrastructure level. A “private” Leonardo model still runs on the same GPU fleet as public models — the access gate is just earlier in the request path. This parallels Replicate’s approach where private model access control is also API-layer only.
Since all accounts share the same GPU instances, fairness is enforced at the request level — per-account fair queuing and rate limiting in constellation-server. See Queue Management for details.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Hardware selection | User-facing SKU picker | Operator-configured, abstracted from users |
| GPU types | T4, A100 (40/80), H100, H200, L40S + legacy | L4, H100, H200 |
| Multi-GPU | compute_units_per_instance (1–8) | gpu_memory / per-GPU max (1–8) |
| Resource limits | Explicit per-unit CPU/mem/GPU in Go map | GPU memory-based, CPU/mem optional |
| Isolation boundary | Deployable (model version or deployment) — dedicated GPU per deployable, multiple accounts share it | Model ID — all accounts share the same instances, no per-account isolation |
| Access control | API-layer: public models open to all, private models gated by allow-list | API-layer: public models open to all, allowed_accounts / is_private restrict partnership and private models |
| Fairness | No fairness controls — all requests to a deployable are equal | Per-account fair queuing + rate limiting (see Queue Management) |
| Scheduling | K8s with bin-pack scheduler + affinity | Cloudchamber (internal) + K8s via IKE (external) |
| Cloud providers | CoreWeave CKS, Nebius, GKE (legacy) | Cloudflare internal colos + external (OCI, etc.) |
| Preemption | Priority classes, configurable per-deployable | Scheduling priority in Cloudchamber |
| Placement constraints | GPU class node affinity, priority node pools | Colo tier/region/PoP (scheduling); blacklisted colos (routing only) |
The fundamental architectural difference is where the isolation boundary sits. Replicate isolates at the deployable level — each model version or deployment gets dedicated GPU pods, but multiple accounts can share those pods (public models) or access is allow-listed (private models). Workers AI isolates at the model level — all accounts share the same instances for a given model, with fairness enforced at the request level via queuing and rate limiting rather than resource partitioning.