summaryrefslogblamecommitdiff
path: root/docs/cloudflare-containers-research.md
blob: c6e95df880f27e3536a9c216bba5a5ea1f2f1f39 (plain) (tree)
























































































































































































































































































































































                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
# Cloudflare Containers for Makima Daemon: Research Document

**Date:** 2026-02-22
**Status:** Research Complete
**Author:** makima research task

---

## Executive Summary

**Verdict: Partially Feasible — with significant caveats.**

Cloudflare Containers (currently in open beta) can technically run the makima daemon binary inside a full Linux container with process spawning, git, and outbound networking. However, **ephemeral storage** is the critical blocker: all disk is wiped on container restart, meaning git worktrees, cloned repositories, and the SQLite database would be lost whenever the container sleeps or restarts. This makes Cloudflare Containers unsuitable as a drop-in replacement for the current Kubernetes-based daemon deployment without architectural changes to externalize state. A hybrid approach — using Cloudflare Containers for stateless task execution while persisting state to R2 or Durable Objects — is possible but requires substantial re-architecture of the daemon.

---

## Table of Contents

1. [What Are Cloudflare Containers?](#1-what-are-cloudflare-containers)
2. [Technical Capabilities](#2-technical-capabilities)
3. [Makima Daemon Requirements Analysis](#3-makima-daemon-requirements-analysis)
4. [Feasibility Assessment](#4-feasibility-assessment)
5. [Comparison: Current Architecture vs. Cloudflare Containers](#5-comparison-current-architecture-vs-cloudflare-containers)
6. [Cost Analysis](#6-cost-analysis)
7. [Implementation Options](#7-implementation-options)
8. [Recommended Next Steps](#8-recommended-next-steps)
9. [Limitations and Risks](#9-limitations-and-risks)
10. [Sources](#10-sources)

---

## 1. What Are Cloudflare Containers?

Cloudflare Containers is a beta platform that lets developers run standard Docker containers on Cloudflare's global network. Containers are managed through Workers and Durable Objects — you write JavaScript/TypeScript Worker code that orchestrates container lifecycle, routing, and scaling.

**Key characteristics:**
- Containers run in isolated VMs on `linux/amd64`
- Controlled via Worker code (no kubectl, no K8s operators)
- Built-in integration with Workers, Durable Objects, R2, and other Cloudflare services
- Images built from Dockerfiles and pushed to Cloudflare's managed registry
- Global distribution — images are pre-fetched to edge locations worldwide
- Scale-to-zero billing model (10ms granularity)

**Current status:** Open beta (launched June/July 2025). Available to all users on Workers Paid plans ($5/month base). NOT yet GA — Cloudflare is still adding autoscaling, load balancing, persistent storage, and co-location features.

---

## 2. Technical Capabilities

### Instance Types

| Type | vCPU | Memory | Disk | Use Case |
|------|------|--------|------|----------|
| lite | 1/16 | 256 MiB | 2 GB | Minimal tasks |
| basic | 1/4 | 1 GiB | 4 GB | Light workloads |
| standard-1 | 1/2 | 4 GiB | 8 GB | General purpose |
| standard-2 | 1 | 6 GiB | 12 GB | Compute-intensive |
| standard-3 | 2 | 8 GiB | 16 GB | Heavy workloads |
| standard-4 | 4 | 12 GiB | 20 GB | Maximum resources |

Custom configurations allowed within bounds: 1–4 vCPU, max 12 GiB RAM, max 20 GB disk.

### Account Limits (Beta)

| Resource | Limit |
|----------|-------|
| Concurrent memory | 400 GiB |
| Concurrent vCPU | 100 |
| Concurrent disk | 2 TB |
| Total image storage | 50 GB |

### Runtime Environment

- **OS:** Full Linux environment (amd64)
- **Process model:** Standard Linux process spawning (fork/exec) — child processes supported
- **Filesystem:** Full read/write filesystem — **but ephemeral** (wiped on restart)
- **Networking (outbound):** Enabled via `enableInternet = true` property. Supports outbound HTTPS, TCP, and general internet access
- **Networking (inbound):** HTTP/WebSocket only, routed through Workers. No direct TCP/UDP from clients
- **Cold start:** 2–3 seconds typically (varies by image size)
- **Graceful shutdown:** SIGTERM, then SIGKILL after 15 minutes
- **Max runtime:** Indefinite (no hard timeout), but Cloudflare does not guarantee any minimum uptime
- **Docker-in-Docker:** Supported (`docker:dind-rootless`)

### Storage

- **Ephemeral disk:** Included per instance type (2–20 GB)
- **Persistent storage:** **NOT available** — Cloudflare is "exploring" persistent disk but it is "not slated for the near term"
- **Workarounds:** Use R2 (S3-compatible object storage) or Durable Objects SQLite (10 GB limit per DO)

### Image Management

- Images built from Dockerfiles via `wrangler deploy`
- Pushed to Cloudflare's managed registry (`registry.cloudflare.com`)
- Also supports pulling from Amazon ECR
- Image size limited by instance disk allocation
- CI/CD: `wrangler containers build --push` for automation

---

## 3. Makima Daemon Requirements Analysis

Based on the existing `k8s/daemon/Dockerfile`, the makima daemon requires:

| Requirement | Cloudflare Containers Support | Notes |
|-------------|-------------------------------|-------|
| **Rust binary execution** | ✅ Full support | linux/amd64, standard binary execution |
| **git CLI** | ✅ Can be installed | Include in Dockerfile, requires internet access |
| **gh CLI (GitHub)** | ✅ Can be installed | Include in Dockerfile |
| **curl** | ✅ Can be installed | Standard Linux tool |
| **openssh-client** | ✅ Can be installed | For SSH-based git auth |
| **jq** | ✅ Can be installed | Standard Linux tool |
| **Process spawning** | ✅ Full support | Fork/exec, child process management |
| **Writable filesystem** | ⚠️ Ephemeral only | Available during runtime, lost on restart |
| **Git worktrees** | ⚠️ Ephemeral only | Must re-clone after every restart |
| **SQLite database** | ⚠️ Ephemeral only | Lost on restart; must externalize to DO SQLite or R2 |
| **WebSocket to wss://api.makima.jp** | ⚠️ Indirect | Outbound WebSocket from inside container (via `enableInternet`). Inbound WebSocket routed through Worker |
| **Outbound HTTPS** | ✅ Via enableInternet | Full internet access when enabled |
| **Persistent state** | ❌ Not natively supported | Critical gap — repos, DB, worktrees all need persistence |

### Critical Gaps

1. **Ephemeral storage is the primary blocker.** The daemon maintains:
   - Git repository clones in `/app/workdir/repos/`
   - Git worktrees in `/app/workdir/`
   - SQLite database at `/app/data/daemon.db`
   - SSH keys and configuration

   All of this is lost when the container goes to sleep or restarts. Re-cloning large repositories on every cold start would add minutes of latency.

2. **WebSocket connectivity model differs.** The daemon currently establishes an outbound WebSocket to `wss://api.makima.jp`. In Cloudflare Containers, all inbound traffic routes through Workers. The daemon would need to either:
   - Make an outbound WebSocket connection from inside the container (requires `enableInternet`)
   - Or be refactored to accept connections routed through the Worker

3. **No guaranteed uptime.** Cloudflare explicitly states: "Cloudflare does not guarantee that any instance will run for any set period of time." The container can be stopped/migrated at any time.

---

## 4. Feasibility Assessment

### Option A: Direct Port (Drop-in Replacement)
**Feasibility: ❌ Not viable without modifications**

The existing Dockerfile and daemon binary cannot be deployed as-is because:
- The daemon assumes persistent filesystem for git repos and worktrees
- SQLite database must survive restarts
- Cold starts after sleep would require full re-cloning of all repositories

### Option B: Modified Daemon with Externalized State
**Feasibility: ⚠️ Partially viable with significant refactoring**

The daemon could be modified to:
- Store SQLite database in Durable Objects (10 GB limit, accessible via DO's built-in SQLite API)
- Sync git repos to/from R2 on start/stop (tar + upload/download)
- Accept WebSocket connections routed through the Worker instead of making outbound connections
- Handle graceful shutdown (SIGTERM) to persist state before container stops

**Estimated effort:** 2–4 weeks of engineering work to refactor state management.

### Option C: Hybrid Architecture
**Feasibility: ✅ Most practical approach**

Keep the existing Cloudflare Workers edge relay (`cloudflare-agent/`) as-is, and optionally use Cloudflare Containers for:
- Short-lived task execution (clone → execute → report → destroy)
- Tasks that don't require persistent worktrees
- Specific workloads where geographic distribution matters

The existing K8s daemon handles long-running, stateful workloads.

---

## 5. Comparison: Current Architecture vs. Cloudflare Containers

| Factor | K8s Daemon + CF Workers Relay | Full Daemon in CF Container |
|--------|-------------------------------|----------------------------|
| **Architecture** | Server → CF Worker relay → Native daemon (K8s) | Server → CF Worker → CF Container |
| **Process spawning** | ✅ Full Linux | ✅ Full Linux |
| **Git operations** | ✅ Persistent repos | ⚠️ Must re-clone on cold start |
| **SQLite persistence** | ✅ Persistent volume | ❌ Lost on restart (must externalize) |
| **WebSocket to server** | ✅ Direct outbound | ⚠️ Outbound possible but indirect model |
| **Cold start latency** | ~0s (always running) | 2–3s + repo clone time |
| **Geographic distribution** | Single region (K8s cluster) | ✅ Global (300+ cities) |
| **Scaling** | Manual (K8s HPA) | Per-task containers possible |
| **Cost model** | Fixed (K8s nodes) | Pay-per-use (10ms billing) |
| **Management overhead** | K8s cluster ops | Minimal (Wrangler + Workers) |
| **Persistent storage** | ✅ K8s PV/PVC | ❌ No persistent disk |
| **Reliability** | ✅ K8s pod restart guarantees | ⚠️ No uptime guarantees |
| **Maturity** | Production-grade | Beta (pre-GA) |
| **Max resources** | Cluster-dependent | 4 vCPU, 12 GiB RAM, 20 GB disk |
| **Secrets management** | K8s Secrets | Worker Secrets / Secrets Store |

---

## 6. Cost Analysis

### Current: Kubernetes
- Fixed cost for K8s nodes (cloud provider dependent)
- Always running, paying for idle time
- Predictable monthly cost

### Cloudflare Containers (estimated for daemon workload)

Assuming a `standard-1` instance (1/2 vCPU, 4 GiB RAM, 8 GB disk) running 8 hours/day:

| Component | Monthly Cost |
|-----------|-------------|
| Base Workers plan | $5.00 |
| vCPU (0.5 × 8h × 30d = 432,000s) | ~$4.32 |
| Memory (4 GiB × 8h × 30d = 864,000 GiB-s) | ~$0.00 (within included 25 GiB-hours ≈ 90,000 GiB-s; overage: ~$1.94) |
| Disk (8 GB × 8h × 30d) | ~$0.00 (within included) |
| Egress (estimated 10 GB/month) | ~$0.00 (within 1 TB included) |
| **Estimated total** | **~$10–15/month** |

For always-on (24/7) with standard-2 (1 vCPU):
- vCPU: 1 × 2,592,000s = ~$51.84/month
- Memory: 6 GiB × 2,592,000s = ~$33.70/month
- **Estimated total: ~$90–100/month**

**Verdict:** Cloudflare Containers are cost-effective for bursty, scale-to-zero workloads. For always-on daemons, they can be 5–10× more expensive than a small K8s node or VPS.

---

## 7. Implementation Options

### Option 1: Stateless Task Runner (Recommended if pursuing CF Containers)

Instead of running the full daemon, use Cloudflare Containers as ephemeral task executors:

```
Makima Server → CF Worker (orchestrator) → CF Container (per-task)
                                           ├── git clone
                                           ├── execute task
                                           ├── push results
                                           └── destroy
```

**Pros:**
- No persistence needed (clone fresh per task)
- True scale-to-zero
- Global distribution for latency optimization
- Each task gets isolated container

**Cons:**
- Cold start + clone adds 10–60s per task
- No shared repo cache between tasks
- Requires new orchestration layer in Workers

**Implementation:**
1. Create a new Container class in `cloudflare-agent/`
2. Worker receives task → spawns container → passes task config as env vars
3. Container clones repo, executes task, reports back via HTTP to Worker
4. Worker relays results to Makima server
5. Container auto-destroys after `sleepAfter` timeout

### Option 2: Long-Running Daemon with State Sync

Run the full daemon binary but add state persistence hooks:

**Implementation:**
1. Modify daemon to sync SQLite to Durable Objects on writes
2. On startup: download repos from R2, restore SQLite from DO
3. On SIGTERM: upload repos to R2, sync SQLite to DO
4. Keep `sleepAfter` long (e.g., 30m) to avoid frequent restarts

**Pros:**
- Closer to current architecture
- Maintains WebSocket connection while running

**Cons:**
- Significant daemon refactoring needed
- Cold starts still expensive (download all repos from R2)
- Risk of data loss if container killed without SIGTERM
- Beta platform with no uptime guarantees

### Option 3: Keep Current Architecture (Recommended)

Continue with K8s daemon + Cloudflare Workers relay. The relay architecture is already working and handles the serverless limitations elegantly.

**When to revisit:** When Cloudflare Containers reaches GA and adds:
- Persistent storage
- Autoscaling
- Better uptime guarantees
- Co-location of DOs with containers

---

## 8. Recommended Next Steps

### Short Term (Now)
1. **Do not migrate the daemon to Cloudflare Containers.** The ephemeral storage limitation and beta status make it unsuitable for production workloads that require state persistence.
2. **Continue investing in the current architecture:** K8s daemon + Cloudflare Workers relay (`cloudflare-agent/`).

### Medium Term (3–6 months)
3. **Monitor Cloudflare Containers roadmap** for:
   - Persistent storage announcement
   - GA release
   - Autoscaling and load balancing improvements
   - Co-location features
4. **Prototype a stateless task runner** (Option 1) for specific use cases where geographic distribution or scale-to-zero matters. This could complement (not replace) the K8s daemon.

### Long Term (6–12 months)
5. **Re-evaluate when persistent storage ships.** If Cloudflare adds persistent volumes, the full daemon deployment becomes much more viable.
6. **Consider hybrid:** K8s for primary daemon, CF Containers for burst capacity or specific regions.

---

## 9. Limitations and Risks

### Critical Limitations
| Limitation | Impact | Mitigation |
|------------|--------|------------|
| **No persistent storage** | Git repos, SQLite, worktrees lost on restart | Must externalize to R2/DO or re-clone |
| **Beta status** | Breaking changes possible, no SLA | Not suitable for production-critical workloads |
| **No uptime guarantee** | Container can be stopped anytime | Must handle graceful degradation |
| **2–3s cold starts** | Latency for first request after sleep | Keep `sleepAfter` high, or accept latency |
| **Max 20 GB disk** | Large repos may not fit | Shallow clones, selective checkouts |
| **Max 4 vCPU, 12 GiB RAM** | May limit heavy compilation tasks | Sufficient for most makima tasks |

### Risks
1. **Data loss:** Without persistent storage, any crash or unplanned restart loses all in-progress work. The daemon must be redesigned for crash resilience.
2. **Vendor lock-in:** Container orchestration via Workers JS is Cloudflare-specific. The Dockerfile itself is portable, but the orchestration layer is not.
3. **Cost unpredictability:** Pay-per-use pricing can spike with unexpected load. Always-on workloads are more expensive than dedicated infrastructure.
4. **Beta instability:** Feature availability, pricing, and limits may change before GA.
5. **No autoscaling (yet):** Manual scaling through code means the orchestration Worker must handle load balancing itself.
6. **Image size constraints:** The existing daemon Dockerfile produces a moderately sized image (~500MB+). Must fit within instance disk allocation after accounting for runtime data.

---

## 10. Sources

- [Cloudflare Containers Overview](https://developers.cloudflare.com/containers/)
- [Getting Started Guide](https://developers.cloudflare.com/containers/get-started/)
- [Pricing](https://developers.cloudflare.com/containers/pricing/)
- [Beta Info & Roadmap](https://developers.cloudflare.com/containers/beta-info/)
- [Limits and Instance Types](https://developers.cloudflare.com/containers/platform-details/limits/)
- [Lifecycle / Architecture](https://developers.cloudflare.com/containers/platform-details/architecture/)
- [FAQ](https://developers.cloudflare.com/containers/faq/)
- [Image Management](https://developers.cloudflare.com/containers/platform-details/image-management/)
- [WebSocket Example](https://developers.cloudflare.com/containers/examples/websocket/)
- [Announcement Blog Post (April 2025)](https://blog.cloudflare.com/cloudflare-containers-coming-2025/)
- [Public Beta Blog Post (June 2025)](https://blog.cloudflare.com/containers-are-available-in-public-beta-for-simple-global-and-programmable/)
- [Container Platform Preview (GPUs)](https://blog.cloudflare.com/container-platform-preview/)
- [New CPU Pricing (Nov 2025)](https://developers.cloudflare.com/changelog/2025-11-21-new-cpu-pricing/)
- [Cloudflare Containers Pricing Comparison (hamy.xyz)](https://hamy.xyz/blog/2025-04_cloudflare-containers-comparison)
- [Cloudflare Containers Alternatives (Northflank)](https://northflank.com/blog/top-cloudflare-containers-alternatives)
- [Sliplane Analysis](https://sliplane.io/blog/cloudflare-released-containers-everything-you-need-to-know)