Skip to content

Operations

Health Checks

The engine exposes a health endpoint at GET /health/. Use it for:

  • Container health checks (Docker, Kubernetes)
  • Load balancer probes
  • Monitoring systems
curl http://localhost:3030/health/

A healthy response:

{
  "status": "ok"
}

Monitoring Executions

Active Executions

# All active (non-terminal) executions
curl http://localhost:3030/workflow-execution/active

# Filter by state
curl http://localhost:3030/workflow-execution/state/RUNNING
curl http://localhost:3030/workflow-execution/state/LOCKING

Execution Details

# Full execution details including jobs
curl http://localhost:3030/workflow-execution/id/<execution-id>

Active Jobs

# All active jobs (PENDING or POLLED)
curl http://localhost:3030/blackboard/jobs/active

# All jobs across all executions
curl http://localhost:3030/blackboard/jobs

Common Issues

Workflow stuck in SCHEDULED

Cause: The entities are locked by another active execution.

Resolution: Either wait for the blocking execution to complete, or abort it:

curl -X DELETE http://localhost:3030/workflow-execution/<blocking-execution-id>

Workflow stuck in LOCKING

Cause: The CMS is unreachable or not responding to the lock request.

Resolution: Check CMS connectivity (NEOPS_CMS_URL) and CMS logs. Restart the engine if the CMS was temporarily down -- the engine will retry on startup.

Jobs not being picked up (many PENDING)

Possible causes:

  1. No workers online -- Check GET /workers/online
  2. No worker has the required FB -- Check if any online worker has the function block registered
  3. Workers not polling -- Check worker logs for connectivity issues

FAILED_UNSAFE investigation

When an execution ends as FAILED_UNSAFE (meaning side effects may have occurred):

  1. Get the execution details: GET /workflow-execution/id/<id>
  2. Find the failed job(s) and which device(s) they ran on
  3. Check which steps executed successfully (those may have side effects)
  4. Verify the device state manually or through a read-only workflow

Engine restart recovery

On restart, the engine:

  1. Recovers active executions from PostgreSQL
  2. Re-evaluates execution state
  3. Resumes event processing

In-flight jobs (POLLED by workers) are handled by the stuck job cleanup. If the worker is still running, it may push results after the engine restarts.

Aborting Executions

curl -X DELETE http://localhost:3030/workflow-execution/<execution-id>

Limited implementation

The abort endpoint is currently a placeholder. Full abort behavior (entity lock release, in-flight job cancellation) is planned. For now, stuck executions will eventually time out through the stuck job cleanup mechanism.

Database Maintenance

The engine uses MikroORM with PostgreSQL. Standard PostgreSQL maintenance applies:

  • Vacuuming -- PostgreSQL auto-vacuum handles this in most cases
  • Backup -- Standard pg_dump for backups
  • Disk space -- Completed executions and their jobs are retained indefinitely. Consider periodic cleanup of old execution data if disk space is a concern.

Version Information

# Engine version and API version
curl http://localhost:3030/version/

# Check worker SDK compatibility
curl -X POST http://localhost:3030/version/compatibility/ \
  -H "Content-Type: application/json" \
  -d '{"version": "1.0.0", "library": "neops_worker_sdk"}'