Model Storage & Images
Overview
Both platforms need to get model weights and inference code onto GPU nodes before serving
requests. The approaches diverge sharply. Replicate builds Docker images via cog, then
optionally layers on a multi-tier caching system (pget → Hermes → object store) to get
weights close to GPU nodes. Workers AI stores model weights in R2 and fetches them at
container startup via a dedicated model-greeter init container, with a disk-reaper
sidecar managing local cache on each GPU node.
Replicate: Images, Weights, and Caches
Container Image Resolution
Every deployable has a docker_image URI set during cog push (e.g.
r8.im/user/model@sha256:...). At pod creation time, the cluster autoscaler resolves this
to an internal registry URI (deployable.go:1999-2040):
The r8.im/ prefix is stripped and replaced with the internal
MODEL_ARTIFACT_REGISTRY_BASE (an unauthenticated Artifact Registry mirror). Non-r8.im
URIs are passed through unmodified.
Two-Container Pods
Every model pod has two containers
(deployable.go:460-476):
director— orchestrates the model lifecycle (queue consumption, health polling, state reporting, cache restore/persist). Image comes from the services registry.model— runs the cog server. Image is the resolved model image.
They primarily interact via HTTP, and also share a supervisor-ipc volume (50 MB tmpfs)
for some cases, and a run-cache volume for runtime state.
Standard Startup Path
The model container’s entrypoint (ModelEntrypointScript.sh) runs:
- Updates
pgetbinary fromPGET_DOWNLOAD_URL(or uses the monobase-bundled version if available). - Optionally upgrades cog (via
ENTRYPOINT_COG_OVERRIDE_PACKAGE) or installshf_transfer. - Starts the cog HTTP server (
python -m cog.server.http).
The model’s setup() method runs inside the cog server and is often the step during which
weights are downloaded via the pget tool proxied through Hermes as a pull-through cache.
FUSE (deprecated)
Replicate also built a FUSE-based path (fuse/fuse.go) that separated
weights from code entirely: a host-level FUSE daemon served weights on demand, and the
model container ran a lightweight monobase image instead of a Docker image with weights
baked in. Director acted as a gRPC client to the FUSE mounter, managing mount lifecycle
via Start/Heartbeat/Stop RPCs. This eliminated the weight download step from cold
starts — weights were read lazily from the FUSE mount as the model accessed them.
The approach is being wound down and remaining FUSE-enabled models are slated for
migration to the standard path. The code still exists in getImageURI (monobase fallback)
and the FUSE entrypoint script, but no new models use it.
pget: Parallel Chunk Downloader
pget (replicate/pget) downloads model weights in parallel chunks. Key
configuration (download/options.go:24-43):
- CacheableURIPrefixes: Allowlist of domains+path-prefixes eligible for pull-through caching.
- CacheHosts: Ordered list of cache hostnames used with consistent hashing — the same URL always routes to the same cache host.
- ForceCachePrefixRewrite: When enabled, rewrites all requests to the first cache host (used for Hermes routing).
- CacheUsePathProxy: Prepends the original host to the cache request path instead of using host-based routing.
Config is injected via K8s ConfigMaps: pget-config (standard) or pget-hermes-config
(Hermes-enabled regions).
Hermes: Regional Edge Cache
Hermes (replicate/hermes) is an HTTP read-aside cache deployed in GPU
serving regions (CKS, Nebius). It caches model weights in region-local S3-compatible
object storage (CoreWeave CAIOS) to avoid repeated cross-region downloads.
Three components:
- Cache server (
server/cacherouter.go): Receives requests from pget. If the file is cached, returns a 307 redirect to a presigned S3 URL. If not cached, redirects to the origin and enqueues a background cache job. - Processor: Downloads from origin and uploads to regional S3.
- Pruner: TTL-based cleanup of cached objects.
HuggingFace traffic is routed through Hermes by setting
HF_ENDPOINT=http://hermes.../huggingface.co on all containers in CKS/Nebius regions. The
weights.replicate.delivery domain is also rewritten through Hermes when
ForceCachePrefixRewrite is enabled.
Torch and CUDA Caches
The director container manages two S3-backed caches (cache/cache.go):
- Torch Compile Cache: Backs
TORCHINDUCTOR_CACHE_DIR. Size range 10 MB–10 GB, 7-day TTL, refreshed daily. - CUDA Checkpoint Cache: Backs
DIRECTOR_CUDA_CHECKPOINT_DIR.
On startup, Director restores these caches from S3 (after the model reports
StatusReady). On shutdown, it persists any changes back. Cache files are stored as
.tar.zst archives with timestamp-based naming for ordering.
Workers AI: R2, model-greeter, and disk-reaper
SoftwareConfig and Image Resolution
The ai-scheduler determines what container image and configuration to use for each model. It reads from two sources:
- Config API: Model properties including
software_name,gpu_memory, and optionallycog_image(config_api/lib.rs:248-249). - R2 model-catalog bucket: SoftwareConfig YAML files at
{version}/{software_name}.yaml(catalog/r2.rs).
SoftwareConfig (config/software.rs) defines:
image— container image URIapi— inference backend type (e.g.pipe-http,tgi)ports— network port configurationmounts— volume mounts (cache dir, disk-reaper socket, etc.)network— firewall allow rules (slirpnetstack)entrypoint— optional override
For Cog models, SoftwareConfig::cog(image) constructs a config with the cog_image from
Config API and hardcoded network allow rules for HuggingFace, R2, PyPI, and Replicate
domains (software.rs:156-240).
Container Image Protocol
Container images use a cf:// protocol prefix that resolves to
registry.cloudchamber.cfdata.org (image_registry_protocol.rs).
This is a Cloudchamber concept — the protocol maps to one or more registry domains,
providing fallback if one registry is unavailable. In practice, cf://image:tag resolves
to registry.cloudchamber.cfdata.org/image:tag.
model-greeter: Weight Downloader
model-greeter downloads model files from R2 at container startup. On IKE (external) clusters it runs as a K8s init container; on Cloudchamber (internal) it runs as a sidecar.
The IKE operator configures model-greeter as an init container named
install-model-greeter (deployment.rs:615-623) that copies its
binary into a shared volume. The main container then uses that binary to download weights.
Environment variables injected by the scheduler
(cloudchamber/lib.rs:88-192):
R2_ENDPOINT— R2 API endpointMODEL_CATALOG_R2_BUCKET— bucket nameMODEL_CATALOG_VERSION— current catalog version (from Release Manager)SOFTWARE_TO_LOAD— software config nameMODEL_TO_LOAD— model identifier
Weights are downloaded to /cache (mounted from the disk-reaper managed volume).
disk-reaper: Local Cache Management
disk-reaper manages the local model cache on GPU nodes. Three modes
(model-crd/crd.rs:80-86):
- Shared (default): Cache backed by a PersistentVolumeClaim shared across pods on the same node. disk-reaper runs as a separate DaemonSet, communicating via Unix socket. Multiple models share the same cache volume.
- Ephemeral: Cache backed by an
emptyDirvolume. disk-reaper runs as a sidecar container inside the model pod (deployment.rs:583-585). Cache is lost when the pod terminates. - Disabled: No cache management. Cache volume is still an
emptyDirbut no reaper process runs.
The IKE operator selects the mode based on operator config and the Model CRD’s
disk_reaper.mode field. When the operator is configured for ephemeral-only mode, any
model requesting Shared is downgraded to Ephemeral
(deployment.rs:226-239).
Inference Backends
Workers AI supports multiple inference server backends, determined by the ai-software=
tag on the container and the api field in SoftwareConfig:
| Backend | SoftwareConfig api | Use Case |
|---|---|---|
| Triton | triton | NVIDIA Triton Inference Server |
| TGI | tgi | HuggingFace Text Generation Inference |
| TEI | tei | HuggingFace Text Embeddings Inference |
| PipeHttp | pipe-http | vLLM, Cog models |
| PipeHttpLlm | pipe-http-llm | LLM-specific pipe variant |
| PartnerPipeHttp | partner-pipe-http | Partner-hosted models |
This is a significant difference from Replicate, where Cog is the only inference server.
Key Differences
| Aspect | Replicate | Workers AI |
|---|---|---|
| Weight sources | Mixed — some baked into Docker layers, many downloaded during setup() from HuggingFace, weights.replicate.delivery, and other origins | Single source — R2 model-catalog bucket |
| Weight download | Model’s setup() via pget (parallel chunks, consistent-hash caching) | model-greeter init container/sidecar downloads from R2 |
| Regional caching | Hermes (HTTP read-aside cache, 307 redirects to regional S3) | R2 serves from nearest edge node (region-less by design) |
| Local cache | No persistent local cache for weights | disk-reaper manages shared PVC or ephemeral emptyDir |
| Compilation caches | S3-backed torch compile + CUDA checkpoint caches (7-day TTL) | None |
| Container layout | Two containers: director sidecar + model | One main container + model-greeter init + optional disk-reaper sidecar |
| Inference servers | Cog only | Triton, TGI, TEI, PipeHttp, PipeHttpLlm, PartnerPipeHttp |
| Image protocol | r8.im/ rewritten to internal Artifact Registry | cf:// protocol with multi-registry fallback |
Replicate’s weight delivery is messy by nature. Model authors control what happens in
setup() — some models have weights baked into Docker layers, others download from
HuggingFace, others pull from weights.replicate.delivery, and some combine approaches.
pget and Hermes are optimizations layered on top: pget parallelizes downloads from
whatever origin the model uses, and Hermes caches results in regional S3 so subsequent
cold starts in the same region avoid cross-region transfers. Neither is strictly
required — models work without them, but cold starts are slower.
Workers AI’s approach is more uniform: weights always come from R2 via model-greeter, giving the platform full control over the download path. R2 operates region-lessly — it serves from the nearest edge node that has the data, so there’s no need for explicit regional copies the way Hermes populates region-local S3. The disk-reaper shared PVC adds another layer by keeping weights on-node across pod restarts.
The inference server flexibility is a notable Workers AI advantage. Supporting Triton, TGI, TEI, and vLLM alongside Cog means Workers AI can use purpose-built servers optimized for specific workload types, while Replicate routes everything through Cog.