Worker Management

Workers are processes running the Worker SDK that execute function blocks. The engine tracks worker health, manages job assignment, and automatically handles unresponsive workers.

Worker States

stateDiagram-v2
    [*] --> ONLINE: register + heartbeat
    ONLINE --> UNREACHABLE: no ping for 2 min
    UNREACHABLE --> ONLINE: ping received
    UNREACHABLE --> OFFLINE: no ping for 6 min
    OFFLINE --> ONLINE: ping received
    OFFLINE --> Deleted: auto-cleanup after 24h
    Deleted --> [*]

State	Condition	Impact on jobs
ONLINE	Heartbeat within last 2 minutes	Jobs are assigned normally
UNREACHABLE	No heartbeat for 2-6 minutes	No new jobs assigned, existing jobs continue
OFFLINE	No heartbeat for 6+ minutes	In-flight jobs are auto-failed
Deleted	Offline for 24+ hours	Worker registration removed

Threshold Cascade

The thresholds are designed to cascade:

0 min   Worker stops pinging
2 min   UNREACHABLE (WORKER_UNREACHABLE_THRESHOLD_MS)
6 min   OFFLINE (3x unreachable, WORKER_OFFLINE_THRESHOLD_MS)
12 min  Stuck jobs failed (2x offline, BLACKBOARD_STUCK_JOB_TIMEOUT)
24 h    Worker deleted (WORKER_OFFLINE_DELETE_THRESHOLD_MS)

Worker Lifecycle

Registration

Workers register on startup via POST /workers/register and receive a UUID. They then register their function blocks with this UUID, enabling the blackboard to route jobs to them.

Heartbeat

Workers send periodic pings (POST /workers/:uuid/ping). The Worker SDK defaults to every 20 seconds. If a ping returns 404 (worker was cleaned up), the worker re-registers.

Graceful Shutdown

On SIGTERM/SIGINT, workers call POST /workers/:uuid/unregister. This immediately marks the worker as offline without deleting its function block registrations. The worker can come back online later by re-registering and sending a ping.

Stuck Job Cleanup

The engine runs a cleanup service every minute (BLACKBOARD_JOB_CHECK_INTERVAL). It finds jobs that have been in POLLED state longer than BLACKBOARD_STUCK_JOB_TIMEOUT (default: 12 minutes) and marks them as failed.

This handles:

Workers that crash without graceful shutdown
Network partitions between workers and the engine
Worker processes that hang during execution

API Reference

Endpoint	Method	Description
`/workers/register`	POST	Register a new worker
`/workers/:uuid/ping`	POST	Send heartbeat
`/workers/:uuid/unregister`	POST	Graceful shutdown
`/workers`	GET	List all workers (add `?findDeleted=true` for soft-deleted)
`/workers/online`	GET	List online workers only
`/workers/:uuid`	GET	Worker details
`/workers/:uuid/function-blocks`	GET	Worker's registered FBs
`/workers/:uuid/jobs`	GET	Worker's jobs (filter with `?status=polled`)
`/workers/:uuid`	DELETE	Delete a worker
`/workers/cleanup`	POST	Manual cleanup (offline > 24h)
`/workers/cleanup-offline`	POST	Manual cleanup (all currently offline)

Scaling Workers

Horizontal scaling -- Deploy multiple workers with the same function blocks. The blackboard distributes jobs across them.
Specialized workers -- Deploy workers with different function block sets (e.g., one worker for Cisco FBs, another for Juniper).
Geographic distribution -- Workers can run in different locations as long as they can reach the engine's REST API and the target devices.

Monitoring worker health

Use GET /workers/online to check how many workers are available. If jobs are queuing up (many PENDING jobs), add more workers. If workers are frequently going UNREACHABLE, check network connectivity between workers and the engine.