Operations
Health Checks
The engine exposes a health endpoint at GET /health/. Use it for:
- Container health checks (Docker, Kubernetes)
- Load balancer probes
- Monitoring systems
A healthy response:
Monitoring Executions
Active Executions
# All active (non-terminal) executions
curl http://localhost:3030/workflow-execution/active
# Filter by state
curl http://localhost:3030/workflow-execution/state/RUNNING
curl http://localhost:3030/workflow-execution/state/LOCKING
Execution Details
# Full execution details including jobs
curl http://localhost:3030/workflow-execution/id/<execution-id>
Active Jobs
# All active jobs (PENDING or POLLED)
curl http://localhost:3030/blackboard/jobs/active
# All jobs across all executions
curl http://localhost:3030/blackboard/jobs
Common Issues
Workflow stuck in SCHEDULED
Cause: The entities are locked by another active execution.
Resolution: Either wait for the blocking execution to complete, or abort it:
Workflow stuck in LOCKING
Cause: The CMS is unreachable or not responding to the lock request.
Resolution: Check CMS connectivity (NEOPS_CMS_URL) and CMS logs. Restart the engine if the CMS was temporarily down -- the engine will retry on startup.
Jobs not being picked up (many PENDING)
Possible causes:
- No workers online -- Check
GET /workers/online - No worker has the required FB -- Check if any online worker has the function block registered
- Workers not polling -- Check worker logs for connectivity issues
FAILED_UNSAFE investigation
When an execution ends as FAILED_UNSAFE (meaning side effects may have occurred):
- Get the execution details:
GET /workflow-execution/id/<id> - Find the failed job(s) and which device(s) they ran on
- Check which steps executed successfully (those may have side effects)
- Verify the device state manually or through a read-only workflow
Engine restart recovery
On restart, the engine:
- Recovers active executions from PostgreSQL
- Re-evaluates execution state
- Resumes event processing
In-flight jobs (POLLED by workers) are handled by the stuck job cleanup. If the worker is still running, it may push results after the engine restarts.
Aborting Executions
Limited implementation
The abort endpoint is currently a placeholder. Full abort behavior (entity lock release, in-flight job cancellation) is planned. For now, stuck executions will eventually time out through the stuck job cleanup mechanism.
Database Maintenance
The engine uses MikroORM with PostgreSQL. Standard PostgreSQL maintenance applies:
- Vacuuming -- PostgreSQL auto-vacuum handles this in most cases
- Backup -- Standard
pg_dumpfor backups - Disk space -- Completed executions and their jobs are retained indefinitely. Consider periodic cleanup of old execution data if disk space is a concern.