Skip to content

Production Patterns

Patterns for running neops workers reliably at scale.


Container Deployment

A minimal Dockerfile for a worker:

FROM python:3.12-slim

WORKDIR /app

COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --no-dev

COPY . .

CMD ["uv", "run", "neops_worker"]

Key considerations:

  • Pin the Python version to match your development environment.
  • The worker imports configuration from the project's config module — ensure it is available in the container image.
  • Use .env for configuration or inject environment variables via your orchestrator (Docker Compose, Kubernetes).
  • Exclude test code via .dockerignore rather than selective COPY.

Docker Compose

services:
  worker:
    build: .
    environment:
      URL_BLACKBOARD: http://workflow-engine:3030
      DIR_FUNCTION_BLOCKS: ./my_function_blocks
      WORKER_NAME: docker-worker-01
    restart: unless-stopped
    depends_on:
      - workflow-engine

Scaling Workers

Horizontal scaling

Run multiple worker instances. Each worker:

  • Registers independently with the workflow engine.
  • Polls for jobs it can execute (based on registered function blocks).
  • Processes one job at a time (max_workers=1).

The workflow engine distributes jobs across available workers. To handle more concurrent work, add more worker instances.

Specialization

Different workers can register different function block packages:

Worker DIR_FUNCTION_BLOCKS Handles
config-worker ./config_blocks Config backup, push, compliance
inventory-worker ./inventory_blocks Discovery, inventory collection
monitoring-worker ./monitoring_blocks Health checks, SNMP polling

This lets you scale each concern independently and isolate failures.


Health Monitoring

Heartbeat

The worker sends heartbeats every HEARTBEAT_INTERVAL seconds. If the workflow engine stops receiving heartbeats, it marks the worker as expired and re-queues its in-flight job.

Log-based monitoring

The worker logs structured events at key lifecycle points:

Event Level Indicates
NEOPS Worker starting... INFO Worker initializing
Found N function block(s) INFO Discovery completed
Registering worker with backend... INFO Registration in progress
Processing N job(s) INFO Jobs received and processing
Received SIGTERM. Shutting down... INFO Graceful shutdown initiated
Shutdown requested, skipping N remaining job(s) WARNING Shutdown during job batch
Worker registration failed ERROR Cannot reach workflow engine
Worker expired! Backend rejected ping with 404. ERROR Worker invalidated by backend

Forward these logs to your monitoring stack (ELK, Grafana Loki, Datadog) for alerting.


Error Recovery

Failure Worker behavior Engine behavior
Network partition Misses heartbeats, reconnects on recovery Marks worker expired, re-queues jobs
Function block exception Reports failure result to blackboard Marks job as FAILED_SAFE or FAILED_UNSAFE based on purity
Worker crash Process exits Detects missed heartbeats, re-queues
Workflow engine restart Worker retries API calls Re-accepts worker registrations

Security Considerations

  • Credentials: Device credentials live in the CMS. Workers access them through WorkflowContext. Never store credentials in function block code or environment variables.
  • Network access: Workers need network access to both the workflow engine API and the managed devices. Use network segmentation to limit blast radius.
  • TLS: Configure URL_BLACKBOARD with https:// in production. The underlying HTTP client respects standard TLS settings.
  • Least privilege: Each worker only needs access to the devices its function blocks manage. Use separate workers for different network zones.