Introducing Pilum: A Recipe-Driven Deployment Orchestrator

Today I’m open-sourcing Pilum, a multi-service deployment orchestrator written in Go. This post covers the problem it solves, the architecture decisions, and how to use and extend it.

The Problem

At SID Technologies, we run services across multiple cloud providers and distribution channels. A typical release involves:

  • API services deployed to GCP Cloud Run
  • CLI tools distributed via Homebrew
  • Background workers on Cloud Run with different scaling configs
  • Static assets synced to S3

Each platform has its own deployment CLI, authentication model, and configuration format. The cognitive overhead compounds: remembering gcloud run deploy flags versus brew tap semantics versus aws s3 sync options.

Deployments became a copy-paste ritual from shell scripts scattered across our monorepo. Each service had its own deploy.sh, and the scripts inevitably drifted. Authentication got hardcoded. Flags changed between services. When we needed to add a new environment variable to all services, it was 19 manual edits.

I wanted:

  1. One command to deploy any service to any target
  2. Declarative configuration instead of imperative scripts
  3. Parallel execution across services
  4. Provider-agnostic core with pluggable providers

A Complete Example

Before diving into architecture, let’s see what using Pilum looks like.

Step 1: Create service.yaml in your service directory

name: api-gateway
provider: gcp
project: sid-production
region: us-central1

build:
  language: go
  version: "1.23"
  binary_name: api-gateway

Step 2: Validate your configuration

$ pilum check

✓ Found service: api-gateway
✓ Recipe: gcp-cloud-run
✓ Required fields present: project, region
✓ Build config valid

Step 3: Preview what would happen

$ pilum deploy --tag=v1.2.0 --dry-run

[api-gateway] Step 1/4: build binary
  Command: go build -ldflags "-X main.version=v1.2.0" -o dist/api-gateway .
  Working dir: services/api-gateway
  Timeout: 300s
  
[api-gateway] Step 2/4: build docker image
  Command: docker build -t gcr.io/sid-production/api-gateway:v1.2.0 .
  Working dir: services/api-gateway
  Timeout: 300s
  
[api-gateway] Step 3/4: publish to registry
  Command: docker push gcr.io/sid-production/api-gateway:v1.2.0
  Working dir: /
  Timeout: 120s
  
[api-gateway] Step 4/4: deploy to cloud run
  Command: gcloud run deploy api-gateway \
      --image=gcr.io/sid-production/api-gateway:v1.2.0 \
      --region=us-central1 \
      --project=sid-production
  Working dir: /
  Timeout: 180s
  Retries: 2

Dry run complete. No commands executed.

Step 4: Deploy

$ pilum deploy --tag=v1.2.0

[api-gateway] ⏳ build binary
[api-gateway] ✓ build binary (2.3s)
[api-gateway] ⏳ build docker image
[api-gateway] ✓ build docker image (45.2s)
[api-gateway] ⏳ publish to registry
[api-gateway] ✓ publish to registry (12.1s)
[api-gateway] ⏳ deploy to cloud run
[api-gateway] ✓ deploy to cloud run (18.4s)

Deployment complete: 1 service deployed in 78.0s

That’s it. One command. One service deployed.

With multiple services:

$ pilum deploy --tag=v1.2.0 --services=api-gateway,auth-service,billing-service

Step 1/4: build binary
  [api-gateway] ✓ (2.1s)
  [auth-service] ✓ (1.9s)
  [billing-service] ✓ (2.4s)

Step 2/4: build docker image
  [api-gateway] ✓ (43.2s)
  [auth-service] ✓ (41.8s)
  [billing-service] ✓ (44.1s)

Step 3/4: publish to registry
  [api-gateway] ✓ (11.2s)
  [auth-service] ✓ (10.9s)
  [billing-service] ✓ (11.8s)

Step 4/4: deploy to cloud run
  [api-gateway] ✓ (17.2s)
  [auth-service] ✓ (18.1s)
  [billing-service] ✓ (17.8s)

Deployment complete: 3 services deployed in 82.3s

All services build in parallel. All images push in parallel. All deploys happen in parallel. But each step completes before the next begins.

Inspiration: The Roman Pilum

The pilum was the javelin of the Roman legions. Its design was elegant in its specificity: a long iron shank connected to a wooden shaft, with a weighted pyramidal tip. It was engineered for a single purpose—to be thrown once and penetrate the target.

The soft iron shank would bend on impact, preventing the enemy from throwing it back and rendering their shield useless. One weapon. One throw. Mission accomplished.

This resonated with what I wanted from a deployment tool: define the target once, execute once, hit precisely.

How Pilum Relates to Other Tools

Terraform and Pulumi

Terraform/Pulumi excel at provisioning infrastructure: creating VPCs, databases, load balancers. They’re declarative about what resources should exist. But they’re not optimized for the deployment workflow: building code, pushing images, rolling out new versions.

You can make Terraform deploy a Cloud Run service, but you’re fighting the tool’s abstractions. Terraform wants to manage resource state. When you “deploy” via Terraform, you’re really updating a resource’s image tag. The build and push happen outside Terraform in custom scripts or CI.

They’re complementary: Use Terraform to provision your Cloud Run service (define the resource, set IAM policies, configure scaling). Use Pilum to deploy new versions to it (build, push, update).

Tilt and Skaffold

Tilt is excellent for development workflows—live reloading, local Kubernetes clusters, fast iteration. But it’s development-focused. Pilum is production-focused: tag-based deployments, multi-provider support, CI/CD integration.

If Tilt is your hot-reload development server, Pilum is your production deployment pipeline.

Skaffold is Google’s deployment tool for Kubernetes. If you’re all-in on K8s, it’s great. But we deploy to:

  • Cloud Run (managed containers, not K8s)
  • Homebrew (binaries, not containers)
  • S3 (static assets, not workloads)
  • Skaffold doesn’t model these workflows. Pilum does.

ko and Earthly

ko is fantastic for deploying Go containers to Kubernetes. Simple, fast, purpose-built. But it’s single-language (Go only) and single-platform (Kubernetes only). We needed multi-language (Go + TypeScript) and multi-platform (Cloud Run + Homebrew + S3).

If you’re deploying one Go service to K8s, use ko. It’s simpler.

Earthly is a build tool that can handle deployment. But Earthfiles are complex—you’re writing imperative scripts in a DSL. And it doesn’t have Pilum’s recipe system. Every service needs its own Earthfile with duplicated logic.

Pilum’s niche: Multi-service, multi-provider deployments with declarative recipes and parallel execution. If you’re deploying one service to one platform, use the platform’s native tool. If you’re orchestrating 20 services across 3 platforms, Pilum might fit.

Shell Scripts

Shell scripts work until they don’t. They’re imperative, hard to test, and the “deployment logic” gets scattered across Makefiles, CI configs, and random bash files. Adding a new provider means duplicating logic across all services.

Pilum inverts this: the deployment logic (the recipe) is centralized and reusable. Services declare what they need (service.yaml), recipes declare how to deploy (recipe.yaml), and the orchestrator coordinates execution.

Architecture

Pilum’s architecture separates three concerns:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Service Config │     │     Recipe      │     │    Handlers     │
│  (service.yaml) │────▶│  (recipe.yaml)  │────▶│  (Go functions) │
└─────────────────┘     └─────────────────┘     └─────────────────┘
      WHAT                    HOW                   IMPLEMENTATION

Service configs declare what you’re deploying: name, provider, region, build settings. These live in your repo alongside your code.

Recipes define how to deploy to a provider: the ordered sequence of steps, required fields, timeouts. These are YAML files that ship with Pilum.

Handlers implement the actual commands: building Docker images, pushing to registries, calling cloud CLIs. These are Go functions registered at startup.

Service Configuration

A minimal service.yaml:

name: api-gateway
provider: gcp
project: my-project
region: us-central1

build:
  language: go
  version: "1.23"

The provider field determines which recipe is used. All other fields are validated against that recipe’s requirements.

Recipe System

Recipes are the core abstraction. Here’s the GCP Cloud Run recipe:

name: gcp-cloud-run
description: Deploy to Google Cloud Run
provider: gcp
service: cloud_run

required_fields:
  - name: project
    description: GCP project ID
    type: string
  - name: region
    description: GCP region to deploy to
    type: string

steps:
  - name: build binary
    execution_mode: service_dir
    timeout: 300

  - name: build docker image
    execution_mode: service_dir
    timeout: 300

  - name: publish to registry
    execution_mode: root
    timeout: 120

  - name: deploy to cloud run
    execution_mode: root
    timeout: 180
    default_retries: 2

The recipe declares:

  • Required fields: Validated before any execution. If your service.yaml is missing project, you get an error immediately, not 10 minutes into a build.
  • Steps: Ordered sequence of operations. Each step has a name, execution mode, and timeout.
  • Execution mode: service_dir runs in the service’s directory, root runs from the project root.
  • Retries: Some steps (like deploy) can retry on failure.

Recipe Validation

The validation logic uses reflection to check service configs against recipe requirements:

func (r *Recipe) ValidateService(svc *serviceinfo.ServiceInfo) error {
    for _, field := range r.RequiredFields {
        value := getServiceField(svc, field.Name)
        if value == "" && field.Default == "" {
            return errors.New("recipe '%s' requires field '%s': %s",
                r.Name, field.Name, field.Description)
        }
    }
    return nil
}

The getServiceField function first checks a hardcoded map of common fields, then falls back to the raw config map, and finally uses reflection as a last resort. This gives us type safety for known fields while remaining flexible for custom recipe requirements.

Command Registry

Steps map to handlers via a pattern-matching registry:

type StepHandler func(ctx StepContext) any

type CommandRegistry struct {
    handlers map[string]StepHandler
}

func (cr *CommandRegistry) Register(pattern string, provider string, handler StepHandler) {
    key := cr.buildKey(pattern, provider)
    cr.handlers[key] = handler
}

func (cr *CommandRegistry) GetHandler(stepName string, provider string) (StepHandler, bool) {
    // Try provider-specific first, fall back to generic
    // Pattern matching is case-insensitive with partial match support
}

This design allows:

  • Generic handlers (build works for any provider)
  • Provider-specific overrides (deploy:gcp vs deploy:aws)
  • Partial matching (build binary matches the build handler)

Handlers return any because commands can be strings or string slices:

func buildHandler(ctx StepContext) any {
    return []string{
        "go", "build",
        "-ldflags", fmt.Sprintf("-X main.version=%s", ctx.Tag),
        "-o", fmt.Sprintf("dist/%s", ctx.Service.Name),
        ".",
    }
}

Parallel Execution with Step Barriers

The orchestrator executes services in parallel within each step, but steps execute sequentially:

Time →

     │  Step 1: Build Binary
     │  ┌────────────────────────────────────┐
     │  │ service-a │ service-b │ service-c  │  ← Parallel
     │  └────────────────────────────────────┘
     │         ↓            ↓            ↓
     │  ════════════════════════════════════  ← Barrier
     │         ↓            ↓            ↓
     │  Step 2: Build Docker Image
     │  ┌────────────────────────────────────┐
     │  │ service-a │ service-b │ service-c  │  ← Parallel
     │  └────────────────────────────────────┘
     │         ↓            ↓            ↓
     │  ════════════════════════════════════  ← Barrier
     │         ↓            ↓            ↓
     │  Step 3: Deploy
     │  ┌────────────────────────────────────┐
     │  │ service-a │ service-b │ service-c  │  ← Parallel
     │  └────────────────────────────────────┘

This ensures dependencies are satisfied: you can’t push an image that hasn’t been built yet.

The implementation:

func (r *Runner) Run() error {
    maxSteps := r.findMaxSteps()

    for stepIdx := 0; stepIdx < maxSteps; stepIdx++ {
        err := r.executeStep(stepIdx)
        if err != nil {
            return err // Fail fast
        }
    }
    return nil
}

Within each step, a worker pool processes services concurrently:

func (r *Runner) executeTasksParallel(tasks []stepTask) error {
    semaphore := make(chan struct{}, r.getWorkerCount())

    for _, t := range tasks {
        go func() {
            semaphore <- struct{}{}        // acquire
            defer func() { <-semaphore }() // release

            result := r.executeTask(task.service, task.step)
            // ...
        }()
    }
}

This means:

  • All services build in parallel (step 1)
  • Once all builds complete, all pushes happen in parallel (step 2)
  • Once all pushes complete, all deploys happen in parallel (step 3)

The barrier between steps ensures dependencies are satisfied. You can’t push an image that hasn’t been built.

Variable Substitution

Recipe commands support variable substitution:

func (r *Runner) substituteVars(cmd any, svc serviceinfo.ServiceInfo) any {
    replacer := strings.NewReplacer(
        "${name}", svc.Name,
        "${service.name}", svc.Name,
        "${provider}", svc.Provider,
        "${region}", svc.Region,
        "${project}", svc.Project,
        "${tag}", r.options.Tag,
    )
    // Handle string, []string, and []any
}

This allows recipes to use service-specific values without hardcoding:

steps:
  - name: deploy
    command: gcloud run deploy ${name} --region=${region} --project=${project}

Error Handling and Retries

Steps can specify retry behavior:

steps:
  - name: deploy to cloud run
    timeout: 180
    default_retries: 2

If a deploy fails (network timeout, rate limit, transient cloud error), Pilum retries automatically with exponential backoff:

func (r *Runner) executeWithRetry(task stepTask) error {
    maxRetries := task.step.DefaultRetries
    backoff := 1 * time.Second

    for attempt := 0; attempt <= maxRetries; attempt++ {
        err := r.execute(task)
        if err == nil {
            return nil
        }

        if attempt < maxRetries {
            time.Sleep(backoff)
            backoff *= 2
        }
    }
    return fmt.Errorf("failed after %d attempts", maxRetries+1)
}

Fail-fast by default: If a step fails for any service (after retries), the entire deployment stops. No partial deployments. This is configurable via —continue-on-error, but we recommend fail-fast for production.

Rollback: Not yet implemented. Currently, rollback is manual (redeploy the previous tag). This is on the roadmap.

Testing and Validation

Recipe validation happens at load time:

func LoadRecipe(path string) (*Recipe, error) {
    data, err := os.ReadFile(path)
    if err != nil {
        return nil, err
    }

    var recipe Recipe
    if err := yaml.Unmarshal(data, &recipe); err != nil {
        return nil, fmt.Errorf("invalid recipe YAML: %w", err)
    }

    // Validate required fields are present
    if recipe.Name == "" || recipe.Provider == "" {
        return nil, errors.New("recipe missing required fields")
    }

    // Validate steps
    for _, step := range recipe.Steps {
        if step.Name == "" {
            return nil, errors.New("step missing name")
        }
        if step.ExecutionMode != "service_dir" && step.ExecutionMode != "root" {
            return nil, fmt.Errorf("invalid execution mode: %s", step.ExecutionMode)
        }
    }

    return &recipe, nil
}

Invalid recipes fail immediately, not during deployment.

Service validation happens before execution:

$ pilum check

✓ Recipe found: gcp-cloud-run
✓ All required fields present: project, region
✓ Build config valid
✗ Error: field 'region' is required but missing

Fix your service.yaml and try again.

Dry-run mode lets you preview without executing:

$ pilum deploy --dry-run --tag=v1.0.0

This shows exactly what commands would run, with variable substitution applied, without actually running them.

Handler testing uses Go’s standard testing:

func TestBuildHandler(t *testing.T) {
    ctx := StepContext{
        Service: serviceinfo.ServiceInfo{Name: "test-service"},
        Tag:     "v1.0.0",
    }
    
    result := buildHandler(ctx)
    
    cmd, ok := result.([]string)
    if !ok {
        t.Fatal("expected []string")
    }
    
    if cmd[0] != "go" || cmd[1] != "build" {
        t.Errorf("unexpected command: %v", cmd)
    }
}

We test each handler in isolation, then integration test full recipes against staging environments.

Creating Custom Recipes

Want to deploy to a platform Pilum doesn’t support yet? Create a recipe in your fork or publish a PR.

Example: Deploy to Fly.io

Create recipes/fly-io.yaml

name: fly-io
description: Deploy to Fly.io
provider: fly
service: fly_io

required_fields:
  - name: app_name
    description: Fly.io app name
    type: string
  - name: region
    description: Fly.io region
    type: string
    default: "sea"

steps:
  - name: build docker image
    execution_mode: service_dir
    timeout: 300

  - name: deploy to fly
    execution_mode: service_dir
    command: flyctl deploy --app ${app_name} --region ${region}
    timeout: 180
    default_retries: 1

Now create a service that uses it:

name: my-api
provider: fly
app_name: my-api-production
region: sea

build:
  language: go
  version: "1.23"

If the build docker image step matches an existing handler (Pilum has a generic Docker build handler), it just works. If not, you can register a custom handler:

package handlers

import "github.com/SID-Technologies/Pilum/pkg/runner"

func init() {
    runner.DefaultRegistry.Register("deploy to fly", "fly", flyDeployHandler)
}

func flyDeployHandler(ctx runner.StepContext) any {
    return []string{
        "flyctl", "deploy",
        "--app", ctx.Service.Get("app_name"),
        "--region", ctx.Service.Region,
    }
}

Compile your fork:

go build -o pilum ./cmd/pilum

Now Pilum supports Fly.io.

We’d love PRs for new providers. See CONTRIBUTING.md for guidelines.

Why Open Source

Deployment tools have network effects. The more providers supported, the more useful the tool. But I can’t personally add recipes for every platform—I don’t use AWS ECS, Azure Container Apps, or Render.

Open source lets the community extend Pilum to their platforms. If you deploy to Render, you can contribute a recipe. If you use Earthly for builds, you can add a handler. The recipe system is designed for this.

Trust and transparency matter for deployment tools. These tools run in CI/CD, have access to credentials, and can push to production. Closed source deployment tools ask for a lot of trust. Open source lets you:

Audit the code

  • Verify it’s not doing anything sketchy
  • Fork it if we make decisions you disagree with
  • Contribute fixes when you find bugs

Dogfooding as validation: Pilum deploys itself to Homebrew. The recipes/homebrew.yaml recipe is how we release new versions:

name: homebrew
description: Build and release to Homebrew tap
provider: homebrew
service: package

required_fields:
  - name: name
    description: Binary name and Homebrew formula name
    type: string

  - name: description
    description: Short description for the Homebrew formula
    type: string

  - name: license
    description: SPDX license identifier (e.g., MIT, Apache-2.0, BSL-1.1)
    type: string

  - name: homebrew.project_url
    description: Full repository URL where releases are hosted (e.g., https://github.com/org/project)
    type: string

  - name: homebrew.tap_url
    description: Full repository URL for the Homebrew tap (e.g., https://github.com/org/Homebrew-tap)
    type: string

  - name: homebrew.token_env
    description: Environment variable name containing the auth token (e.g., GH_TOKEN, HOMEBREW_TAP_TOKEN)
    type: string

steps:
  # Step 1: Build binaries for all platforms (darwin/linux, amd64/arm64)
  - name: build binaries
    execution_mode: root
    timeout: 300
    tags:
      - build

  # Step 2: Create tar.gz archives for each binary
  - name: create archives
    execution_mode: root
    timeout: 60
    tags:
      - build

  # Step 3: Generate SHA256 checksums
  - name: generate checksums
    execution_mode: root
    timeout: 30
    tags:
      - build

  # Step 4: Update Homebrew formula with new version and checksums
  - name: update formula
    execution_mode: root
    timeout: 30
    tags:
      - deploy

  # Step 5: Push updated formula to Homebrew tap repository
  - name: push to tap
    execution_mode: root
    timeout: 60
    tags:
      - deploy

If this breaks, we can’t ship. That’s a powerful incentive to keep it working.

Getting Started

Install via Homebrew:

brew tap sid-technologies/pilum
brew install pilum

Initialize in your project:

cd my-project
pilum init

This creates a sample service.yaml:

name: my-service
provider: gcp  # or aws, homebrew, etc.
project: my-gcp-project
region: us-central1

build:
  language: go
  version: "1.23"

Validate your configuration:

pilum check

Deploy:

pilum deploy --tag=v1.0.0

Deploy specific services:

pilum deploy --tag=v1.0.0 --services=api,worker

Preview without executing:

pilum deploy --dry-run --tag=v1.0.0

Real-World Usage at SID

At SID, we use Pilum to deploy 19 services:

$ pilum deploy --tag=v2.3.0

Step 1/4: build binary
  [authentication] ✓ (2.1s)
  [billing] ✓ (2.3s)
  [calendar] ✓ (1.9s)
  [kanban] ✓ (2.2s)
  [notifications] ✓ (2.0s)
  ... (14 more services)

Step 2/4: build docker image
  [authentication] ✓ (43s)
  [billing] ✓ (41s)
  ... (17 more in parallel)

Step 3/4: publish to registry
  [authentication] ✓ (11s)
  [billing] ✓ (12s)
  ... (17 more in parallel)

Step 4/4: deploy to cloud run
  [authentication] ✓ (18s)
  [billing] ✓ (17s)
  ... (17 more in parallel)

Deployment complete: 19 services deployed in 45s

Before Pilum, this took 30+ minutes (deploying services serially). Now it takes under 3 minutes (parallel execution).

Our metrics after 3 months of using Pilum:

  • Deployment time: 30 min → 45s (-97.5%)
  • Failed deployments: 12% → 2% (validation catches issues early)
  • Time to add new service: 30 min → 5 min (copy service.yaml template)

Known Limitations

Pilum is young. Here’s what it doesn’t do well yet:

No built-in rollback: If a deployment succeeds but causes issues, you need to manually deploy the previous tag. We’re working on pilum rollback.

Limited to sequential steps: All services must complete step N before any service can start step N+1. For some workflows (independent services), you’d want fully parallel execution. This is a design trade-off for simplicity.

Recipe changes require Pilum updates: If you want to modify a built-in recipe, you need to fork Pilum or wait for a new release. We’re considering a way to override recipes locally.

No secret management: Pilum assumes your cloud CLI is already authenticated. It doesn’t handle secrets, credentials, or environment variable management. Use your existing secret management solution.

Limited observability: No built-in dashboard, no metrics export, no Slack/email notifications on completion. It’s a CLI tool that outputs to stdout. For production monitoring, you’ll need to wrap it.

These aren’t deal-breakers for our use case (small team, fast iteration). They might be for yours. Feedback welcome.

When NOT to Use Pilum

Single service, single platform: If you’re deploying one Go service to Kubernetes, use ko. If you’re deploying one container to Cloud Run, use gcloud run deploy. Pilum’s value is orchestration across multiple services and platforms.

Kubernetes-native workflows: If your entire stack is Kubernetes and you want GitOps, use ArgoCD or Flux. Pilum doesn’t manage Kubernetes manifests or do continuous reconciliation.

Complex build pipelines: If you need conditional builds, matrix builds, or artifact caching beyond Docker layer caching, use Earthly or Bazel. Pilum’s build step is intentionally simple.

You need rollback automation: Pilum doesn’t yet support automatic rollback. If this is critical, you’ll need to wrap Pilum or wait for the feature.

Try It, Break It, Contribute

Pilum is young (v0.2.0) but deployed in production at SID. We’re using it to deploy 19 services across GCP Cloud Run and Homebrew.

If you’re deploying multi-service systems across multiple platforms, give it a try:

brew tap sid-technologies/pilum
brew install pilum
pilum init  # Creates sample service.yaml
pilum check # Validates configuration
pilum deploy --dry-run --tag=v1.0.0

If you hit rough edges, we want to know:

  • GitHub Issues: github.com/SID-Technologies/Pilum/issues
  • Discussions: github.com/SID-Technologies/Pilum/discussions

If you want a provider we don’t support yet:

The recipe system is designed for extensibility. We’re betting that the right abstraction—declarative recipes, pluggable handlers, parallel execution—can generalize across deployment targets.

If we’re right, Pilum becomes a shared deployment layer for the ecosystem. If we’re wrong, we’ll learn something and iterate.

What’s Next

Current providers: GCP Cloud Run, Homebrew, AWS Lambda (in progress).

On the roadmap:

  • AWS ECS
  • GitHub Releases integration
  • AWS Lambda
  • Parallel recipe discovery across monorepos

The code is on GitHub. The landing page is at pilum.dev.

The code is open. Fork it, extend it, tell us what breaks.

Links:

GitHub: github.com/SID-Technologies/Pilum Landing page: pilum.dev