Skip to content

Worker Management

Workers are processes running the Worker SDK that execute function blocks. The engine tracks worker health, manages job assignment, and automatically handles unresponsive workers.

Worker States

stateDiagram-v2
    [*] --> ONLINE: register + heartbeat
    ONLINE --> UNREACHABLE: no ping for 2 min
    UNREACHABLE --> ONLINE: ping received
    UNREACHABLE --> OFFLINE: no ping for 6 min
    OFFLINE --> ONLINE: ping received
    OFFLINE --> Deleted: auto-cleanup after 24h
    Deleted --> [*]
State Condition Impact on jobs
ONLINE Heartbeat within last 2 minutes Jobs are assigned normally
UNREACHABLE No heartbeat for 2-6 minutes No new jobs assigned, existing jobs continue
OFFLINE No heartbeat for 6+ minutes In-flight jobs are auto-failed
Deleted Offline for 24+ hours Worker registration removed

Threshold Cascade

The thresholds are designed to cascade:

0 min   Worker stops pinging
2 min   UNREACHABLE (WORKER_UNREACHABLE_THRESHOLD_MS)
6 min   OFFLINE (3x unreachable, WORKER_OFFLINE_THRESHOLD_MS)
12 min  Stuck jobs failed (2x offline, BLACKBOARD_STUCK_JOB_TIMEOUT)
24 h    Worker deleted (WORKER_OFFLINE_DELETE_THRESHOLD_MS)

Worker Lifecycle

Registration

Workers register on startup via POST /workers/register and receive a UUID. They then register their function blocks with this UUID, enabling the blackboard to route jobs to them.

Heartbeat

Workers send periodic pings (POST /workers/:uuid/ping). The Worker SDK defaults to every 20 seconds. If a ping returns 404 (worker was cleaned up), the worker re-registers.

Graceful Shutdown

On SIGTERM/SIGINT, workers call POST /workers/:uuid/unregister. This immediately marks the worker as offline without deleting its function block registrations. The worker can come back online later by re-registering and sending a ping.

Stuck Job Cleanup

The engine runs a cleanup service every minute (BLACKBOARD_JOB_CHECK_INTERVAL). It finds jobs that have been in POLLED state longer than BLACKBOARD_STUCK_JOB_TIMEOUT (default: 12 minutes) and marks them as failed.

This handles:

  • Workers that crash without graceful shutdown
  • Network partitions between workers and the engine
  • Worker processes that hang during execution

API Reference

Endpoint Method Description
/workers/register POST Register a new worker
/workers/:uuid/ping POST Send heartbeat
/workers/:uuid/unregister POST Graceful shutdown
/workers GET List all workers (add ?findDeleted=true for soft-deleted)
/workers/online GET List online workers only
/workers/:uuid GET Worker details
/workers/:uuid/function-blocks GET Worker's registered FBs
/workers/:uuid/jobs GET Worker's jobs (filter with ?status=polled)
/workers/:uuid DELETE Delete a worker
/workers/cleanup POST Manual cleanup (offline > 24h)
/workers/cleanup-offline POST Manual cleanup (all currently offline)

Scaling Workers

  • Horizontal scaling -- Deploy multiple workers with the same function blocks. The blackboard distributes jobs across them.
  • Specialized workers -- Deploy workers with different function block sets (e.g., one worker for Cisco FBs, another for Juniper).
  • Geographic distribution -- Workers can run in different locations as long as they can reach the engine's REST API and the target devices.

Monitoring worker health

Use GET /workers/online to check how many workers are available. If jobs are queuing up (many PENDING jobs), add more workers. If workers are frequently going UNREACHABLE, check network connectivity between workers and the engine.