Worker Management
Workers are processes running the Worker SDK that execute function blocks. The engine tracks worker health, manages job assignment, and automatically handles unresponsive workers.
Worker States
stateDiagram-v2
[*] --> ONLINE: register + heartbeat
ONLINE --> UNREACHABLE: no ping for 2 min
UNREACHABLE --> ONLINE: ping received
UNREACHABLE --> OFFLINE: no ping for 6 min
OFFLINE --> ONLINE: ping received
OFFLINE --> Deleted: auto-cleanup after 24h
Deleted --> [*]
| State | Condition | Impact on jobs |
|---|---|---|
| ONLINE | Heartbeat within last 2 minutes | Jobs are assigned normally |
| UNREACHABLE | No heartbeat for 2-6 minutes | No new jobs assigned, existing jobs continue |
| OFFLINE | No heartbeat for 6+ minutes | In-flight jobs are auto-failed |
| Deleted | Offline for 24+ hours | Worker registration removed |
Threshold Cascade
The thresholds are designed to cascade:
0 min Worker stops pinging
2 min UNREACHABLE (WORKER_UNREACHABLE_THRESHOLD_MS)
6 min OFFLINE (3x unreachable, WORKER_OFFLINE_THRESHOLD_MS)
12 min Stuck jobs failed (2x offline, BLACKBOARD_STUCK_JOB_TIMEOUT)
24 h Worker deleted (WORKER_OFFLINE_DELETE_THRESHOLD_MS)
Worker Lifecycle
Registration
Workers register on startup via POST /workers/register and receive a UUID. They then register their function blocks with this UUID, enabling the blackboard to route jobs to them.
Heartbeat
Workers send periodic pings (POST /workers/:uuid/ping). The Worker SDK defaults to every 20 seconds. If a ping returns 404 (worker was cleaned up), the worker re-registers.
Graceful Shutdown
On SIGTERM/SIGINT, workers call POST /workers/:uuid/unregister. This immediately marks the worker as offline without deleting its function block registrations. The worker can come back online later by re-registering and sending a ping.
Stuck Job Cleanup
The engine runs a cleanup service every minute (BLACKBOARD_JOB_CHECK_INTERVAL). It finds jobs that have been in POLLED state longer than BLACKBOARD_STUCK_JOB_TIMEOUT (default: 12 minutes) and marks them as failed.
This handles:
- Workers that crash without graceful shutdown
- Network partitions between workers and the engine
- Worker processes that hang during execution
API Reference
| Endpoint | Method | Description |
|---|---|---|
/workers/register |
POST | Register a new worker |
/workers/:uuid/ping |
POST | Send heartbeat |
/workers/:uuid/unregister |
POST | Graceful shutdown |
/workers |
GET | List all workers (add ?findDeleted=true for soft-deleted) |
/workers/online |
GET | List online workers only |
/workers/:uuid |
GET | Worker details |
/workers/:uuid/function-blocks |
GET | Worker's registered FBs |
/workers/:uuid/jobs |
GET | Worker's jobs (filter with ?status=polled) |
/workers/:uuid |
DELETE | Delete a worker |
/workers/cleanup |
POST | Manual cleanup (offline > 24h) |
/workers/cleanup-offline |
POST | Manual cleanup (all currently offline) |
Scaling Workers
- Horizontal scaling -- Deploy multiple workers with the same function blocks. The blackboard distributes jobs across them.
- Specialized workers -- Deploy workers with different function block sets (e.g., one worker for Cisco FBs, another for Juniper).
- Geographic distribution -- Workers can run in different locations as long as they can reach the engine's REST API and the target devices.
Monitoring worker health
Use GET /workers/online to check how many workers are available. If jobs are
queuing up (many PENDING jobs), add more workers. If workers are frequently going
UNREACHABLE, check network connectivity between workers and the engine.