Production Patterns
Patterns for running neops workers reliably at scale.
Container Deployment
A minimal Dockerfile for a worker:
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --no-dev
COPY . .
CMD ["uv", "run", "neops_worker"]
Key considerations:
- Pin the Python version to match your development environment.
- The worker imports configuration from the project's config module — ensure it is available in the container image.
- Use
.envfor configuration or inject environment variables via your orchestrator (Docker Compose, Kubernetes). - Exclude test code via
.dockerignorerather than selectiveCOPY.
Docker Compose
services:
worker:
build: .
environment:
URL_BLACKBOARD: http://workflow-engine:3030
DIR_FUNCTION_BLOCKS: ./my_function_blocks
WORKER_NAME: docker-worker-01
restart: unless-stopped
depends_on:
- workflow-engine
Scaling Workers
Horizontal scaling
Run multiple worker instances. Each worker:
- Registers independently with the workflow engine.
- Polls for jobs it can execute (based on registered function blocks).
- Processes one job at a time (
max_workers=1).
The workflow engine distributes jobs across available workers. To handle more concurrent work, add more worker instances.
Specialization
Different workers can register different function block packages:
| Worker | DIR_FUNCTION_BLOCKS |
Handles |
|---|---|---|
config-worker |
./config_blocks |
Config backup, push, compliance |
inventory-worker |
./inventory_blocks |
Discovery, inventory collection |
monitoring-worker |
./monitoring_blocks |
Health checks, SNMP polling |
This lets you scale each concern independently and isolate failures.
Health Monitoring
Heartbeat
The worker sends heartbeats every HEARTBEAT_INTERVAL seconds. If the
workflow engine stops receiving heartbeats, it marks the worker as expired
and re-queues its in-flight job.
Log-based monitoring
The worker logs structured events at key lifecycle points:
| Event | Level | Indicates |
|---|---|---|
NEOPS Worker starting... |
INFO | Worker initializing |
Found N function block(s) |
INFO | Discovery completed |
Registering worker with backend... |
INFO | Registration in progress |
Processing N job(s) |
INFO | Jobs received and processing |
Received SIGTERM. Shutting down... |
INFO | Graceful shutdown initiated |
Shutdown requested, skipping N remaining job(s) |
WARNING | Shutdown during job batch |
Worker registration failed |
ERROR | Cannot reach workflow engine |
Worker expired! Backend rejected ping with 404. |
ERROR | Worker invalidated by backend |
Forward these logs to your monitoring stack (ELK, Grafana Loki, Datadog) for alerting.
Error Recovery
| Failure | Worker behavior | Engine behavior |
|---|---|---|
| Network partition | Misses heartbeats, reconnects on recovery | Marks worker expired, re-queues jobs |
| Function block exception | Reports failure result to blackboard | Marks job as FAILED_SAFE or FAILED_UNSAFE based on purity |
| Worker crash | Process exits | Detects missed heartbeats, re-queues |
| Workflow engine restart | Worker retries API calls | Re-accepts worker registrations |
Security Considerations
- Credentials: Device credentials live in the CMS. Workers access them
through
WorkflowContext. Never store credentials in function block code or environment variables. - Network access: Workers need network access to both the workflow engine API and the managed devices. Use network segmentation to limit blast radius.
- TLS: Configure
URL_BLACKBOARDwithhttps://in production. The underlying HTTP client respects standard TLS settings. - Least privilege: Each worker only needs access to the devices its function blocks manage. Use separate workers for different network zones.