Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Model Storage & Images

Overview

Both platforms need to get model weights and inference code onto GPU nodes before serving requests. The approaches diverge sharply. Replicate builds Docker images via cog, then optionally layers on a multi-tier caching system (pget → Hermes → object store) to get weights close to GPU nodes. Workers AI stores model weights in R2 and fetches them at container startup via a dedicated model-greeter init container, with a disk-reaper sidecar managing local cache on each GPU node.


Replicate: Images, Weights, and Caches

Container Image Resolution

Every deployable has a docker_image URI set during cog push (e.g. r8.im/user/model@sha256:...). At pod creation time, the cluster autoscaler resolves this to an internal registry URI (deployable.go:1999-2040):

The r8.im/ prefix is stripped and replaced with the internal MODEL_ARTIFACT_REGISTRY_BASE (an unauthenticated Artifact Registry mirror). Non-r8.im URIs are passed through unmodified.

Two-Container Pods

Every model pod has two containers (deployable.go:460-476):

  1. director — orchestrates the model lifecycle (queue consumption, health polling, state reporting, cache restore/persist). Image comes from the services registry.
  2. model — runs the cog server. Image is the resolved model image.

They primarily interact via HTTP, and also share a supervisor-ipc volume (50 MB tmpfs) for some cases, and a run-cache volume for runtime state.

Standard Startup Path

The model container’s entrypoint (ModelEntrypointScript.sh) runs:

  1. Updates pget binary from PGET_DOWNLOAD_URL (or uses the monobase-bundled version if available).
  2. Optionally upgrades cog (via ENTRYPOINT_COG_OVERRIDE_PACKAGE) or installs hf_transfer.
  3. Starts the cog HTTP server (python -m cog.server.http).

The model’s setup() method runs inside the cog server and is often the step during which weights are downloaded via the pget tool proxied through Hermes as a pull-through cache.

FUSE (deprecated)

Replicate also built a FUSE-based path (fuse/fuse.go) that separated weights from code entirely: a host-level FUSE daemon served weights on demand, and the model container ran a lightweight monobase image instead of a Docker image with weights baked in. Director acted as a gRPC client to the FUSE mounter, managing mount lifecycle via Start/Heartbeat/Stop RPCs. This eliminated the weight download step from cold starts — weights were read lazily from the FUSE mount as the model accessed them.

The approach is being wound down and remaining FUSE-enabled models are slated for migration to the standard path. The code still exists in getImageURI (monobase fallback) and the FUSE entrypoint script, but no new models use it.

pget: Parallel Chunk Downloader

pget (replicate/pget) downloads model weights in parallel chunks. Key configuration (download/options.go:24-43):

  • CacheableURIPrefixes: Allowlist of domains+path-prefixes eligible for pull-through caching.
  • CacheHosts: Ordered list of cache hostnames used with consistent hashing — the same URL always routes to the same cache host.
  • ForceCachePrefixRewrite: When enabled, rewrites all requests to the first cache host (used for Hermes routing).
  • CacheUsePathProxy: Prepends the original host to the cache request path instead of using host-based routing.

Config is injected via K8s ConfigMaps: pget-config (standard) or pget-hermes-config (Hermes-enabled regions).

Hermes: Regional Edge Cache

Hermes (replicate/hermes) is an HTTP read-aside cache deployed in GPU serving regions (CKS, Nebius). It caches model weights in region-local S3-compatible object storage (CoreWeave CAIOS) to avoid repeated cross-region downloads.

Three components:

  • Cache server (server/cacherouter.go): Receives requests from pget. If the file is cached, returns a 307 redirect to a presigned S3 URL. If not cached, redirects to the origin and enqueues a background cache job.
  • Processor: Downloads from origin and uploads to regional S3.
  • Pruner: TTL-based cleanup of cached objects.

HuggingFace traffic is routed through Hermes by setting HF_ENDPOINT=http://hermes.../huggingface.co on all containers in CKS/Nebius regions. The weights.replicate.delivery domain is also rewritten through Hermes when ForceCachePrefixRewrite is enabled.

Torch and CUDA Caches

The director container manages two S3-backed caches (cache/cache.go):

  • Torch Compile Cache: Backs TORCHINDUCTOR_CACHE_DIR. Size range 10 MB–10 GB, 7-day TTL, refreshed daily.
  • CUDA Checkpoint Cache: Backs DIRECTOR_CUDA_CHECKPOINT_DIR.

On startup, Director restores these caches from S3 (after the model reports StatusReady). On shutdown, it persists any changes back. Cache files are stored as .tar.zst archives with timestamp-based naming for ordering.


Workers AI: R2, model-greeter, and disk-reaper

SoftwareConfig and Image Resolution

The ai-scheduler determines what container image and configuration to use for each model. It reads from two sources:

  1. Config API: Model properties including software_name, gpu_memory, and optionally cog_image (config_api/lib.rs:248-249).
  2. R2 model-catalog bucket: SoftwareConfig YAML files at {version}/{software_name}.yaml (catalog/r2.rs).

SoftwareConfig (config/software.rs) defines:

  • image — container image URI
  • api — inference backend type (e.g. pipe-http, tgi)
  • ports — network port configuration
  • mounts — volume mounts (cache dir, disk-reaper socket, etc.)
  • network — firewall allow rules (slirpnetstack)
  • entrypoint — optional override

For Cog models, SoftwareConfig::cog(image) constructs a config with the cog_image from Config API and hardcoded network allow rules for HuggingFace, R2, PyPI, and Replicate domains (software.rs:156-240).

Container Image Protocol

Container images use a cf:// protocol prefix that resolves to registry.cloudchamber.cfdata.org (image_registry_protocol.rs). This is a Cloudchamber concept — the protocol maps to one or more registry domains, providing fallback if one registry is unavailable. In practice, cf://image:tag resolves to registry.cloudchamber.cfdata.org/image:tag.

model-greeter: Weight Downloader

model-greeter downloads model files from R2 at container startup. On IKE (external) clusters it runs as a K8s init container; on Cloudchamber (internal) it runs as a sidecar.

The IKE operator configures model-greeter as an init container named install-model-greeter (deployment.rs:615-623) that copies its binary into a shared volume. The main container then uses that binary to download weights.

Environment variables injected by the scheduler (cloudchamber/lib.rs:88-192):

  • R2_ENDPOINT — R2 API endpoint
  • MODEL_CATALOG_R2_BUCKET — bucket name
  • MODEL_CATALOG_VERSION — current catalog version (from Release Manager)
  • SOFTWARE_TO_LOAD — software config name
  • MODEL_TO_LOAD — model identifier

Weights are downloaded to /cache (mounted from the disk-reaper managed volume).

disk-reaper: Local Cache Management

disk-reaper manages the local model cache on GPU nodes. Three modes (model-crd/crd.rs:80-86):

  • Shared (default): Cache backed by a PersistentVolumeClaim shared across pods on the same node. disk-reaper runs as a separate DaemonSet, communicating via Unix socket. Multiple models share the same cache volume.
  • Ephemeral: Cache backed by an emptyDir volume. disk-reaper runs as a sidecar container inside the model pod (deployment.rs:583-585). Cache is lost when the pod terminates.
  • Disabled: No cache management. Cache volume is still an emptyDir but no reaper process runs.

The IKE operator selects the mode based on operator config and the Model CRD’s disk_reaper.mode field. When the operator is configured for ephemeral-only mode, any model requesting Shared is downgraded to Ephemeral (deployment.rs:226-239).

Inference Backends

Workers AI supports multiple inference server backends, determined by the ai-software= tag on the container and the api field in SoftwareConfig:

BackendSoftwareConfig apiUse Case
TritontritonNVIDIA Triton Inference Server
TGItgiHuggingFace Text Generation Inference
TEIteiHuggingFace Text Embeddings Inference
PipeHttppipe-httpvLLM, Cog models
PipeHttpLlmpipe-http-llmLLM-specific pipe variant
PartnerPipeHttppartner-pipe-httpPartner-hosted models

This is a significant difference from Replicate, where Cog is the only inference server.


Key Differences

AspectReplicateWorkers AI
Weight sourcesMixed — some baked into Docker layers, many downloaded during setup() from HuggingFace, weights.replicate.delivery, and other originsSingle source — R2 model-catalog bucket
Weight downloadModel’s setup() via pget (parallel chunks, consistent-hash caching)model-greeter init container/sidecar downloads from R2
Regional cachingHermes (HTTP read-aside cache, 307 redirects to regional S3)R2 serves from nearest edge node (region-less by design)
Local cacheNo persistent local cache for weightsdisk-reaper manages shared PVC or ephemeral emptyDir
Compilation cachesS3-backed torch compile + CUDA checkpoint caches (7-day TTL)None
Container layoutTwo containers: director sidecar + modelOne main container + model-greeter init + optional disk-reaper sidecar
Inference serversCog onlyTriton, TGI, TEI, PipeHttp, PipeHttpLlm, PartnerPipeHttp
Image protocolr8.im/ rewritten to internal Artifact Registrycf:// protocol with multi-registry fallback

Replicate’s weight delivery is messy by nature. Model authors control what happens in setup() — some models have weights baked into Docker layers, others download from HuggingFace, others pull from weights.replicate.delivery, and some combine approaches. pget and Hermes are optimizations layered on top: pget parallelizes downloads from whatever origin the model uses, and Hermes caches results in regional S3 so subsequent cold starts in the same region avoid cross-region transfers. Neither is strictly required — models work without them, but cold starts are slower.

Workers AI’s approach is more uniform: weights always come from R2 via model-greeter, giving the platform full control over the download path. R2 operates region-lessly — it serves from the nearest edge node that has the data, so there’s no need for explicit regional copies the way Hermes populates region-local S3. The disk-reaper shared PVC adds another layer by keeping weights on-node across pod restarts.

The inference server flexibility is a notable Workers AI advantage. Supporting Triton, TGI, TEI, and vLLM alongside Cog means Workers AI can use purpose-built servers optimized for specific workload types, while Replicate routes everything through Cog.