Execution Model
When you trigger a workflow, the engine does not run it in one shot. It traverses the step tree, creates discrete jobs, distributes them to workers, collects results, and decides what to do next. Understanding this flow helps you design workflows that execute efficiently and fail gracefully.
TL;DR
You submit a workflow. The engine resolves function blocks, acquires entities from the CMS,
locks them, then creates jobs on the blackboard. Workers poll for jobs, execute function
blocks, and push results. The engine advances step by step until all steps complete or a
failure occurs. On failure, the engine classifies it as FAILED_SAFE (nothing changed) or
FAILED_UNSAFE (side effects may have occurred).
Execution Flow
sequenceDiagram
participant User
participant Engine
participant CMS
participant BB as Blackboard
participant Worker
User->>Engine: Execute workflow
Engine->>Engine: Resolve FBs & validate
Engine->>CMS: Acquire & lock entities
CMS->>Engine: Locked entity data
loop For each step
Engine->>BB: Create EXECUTE jobs
Worker->>BB: Poll for matching jobs
BB->>Worker: Job (FB, params, context)
Worker->>Worker: Execute function block
Worker->>BB: Push result
BB->>Engine: Job result event
Engine->>Engine: Update context, decide next step
end
Engine->>CMS: Unlock, apply DB updates
Engine->>Engine: Mark COMPLETED
Step Traversal
The engine processes steps using depth-first traversal -- it fully completes each step (including all nested sub-steps in embedded workflows) before moving to the next one. For each step:
- Evaluate condition -- If the step has a
condition, evaluate it against the current context. Skip if false. - Evaluate assertions -- If the step has
assertclauses, evaluate them. Fail the step if any assertion is false. - Create jobs -- For each entity in scope (determined by
runOnand the acquired context), create a job on the blackboard. - Wait for results -- The engine pauses this execution path until all jobs for this step complete.
- Process results -- Add step results to the context. If a job failed, apply retry/error handling logic.
- Advance -- Move to the next step in the sequence.
Per-Entity Execution
A step with runOn: device in a workflow with 100 devices in scope creates 100 jobs -- one per device. These jobs can execute in parallel across multiple workers.
graph LR
Step["Step: show_version<br/>(runOn: device)"] --> J1["Job: device-1"]
Step --> J2["Job: device-2"]
Step --> J3["Job: device-3"]
Step --> JN["Job: device-N"]
J1 --> W1["Worker 1"]
J2 --> W1
J3 --> W2["Worker 2"]
JN --> W2
Workers pick up jobs for function blocks they have registered. Multiple workers can process jobs for the same step concurrently.
Parallel Execution Strategy
By default, the engine waits for all jobs in a step to complete before advancing to the next step. With the workflow-level setting config.executionStrategy.parallel: true, the engine allows per-entity pipeline execution:
Device 1: [Step A] → [Step B] → [Step C]
Device 2: [Step A] → [Step B] → ...
Device 3: [Step A] → ...
Device 1 can be on Step C while Device 3 is still on Step A. The engine tracks state per entity.
Workflow Execution States
Every execution transitions through a state machine:
stateDiagram-v2
classDef goodState stroke:green
classDef warningState stroke:orange
classDef badState stroke:red
class NEW goodState
class READY goodState
class VALID goodState
class LOCKING goodState
class LOCKED goodState
class BLOCKED_WAITING goodState
class RESOURCE_DISCOVERY goodState
class RESOURCES_DISCOVERED goodState
class SCHEDULED goodState
class RUNNING goodState
class COMPLETED goodState
class COMPLETED_ACK goodState
class FAILED_SAFE warningState
class ERROR warningState
class ROLLBACK warningState
class FAILED_UNSAFE badState
class FAILED_UNSAFE_ACK badState
class FAILED_SAFE_ACK warningState
[*] --> NEW
NEW --> VALID: resolved & validated
NEW --> FAILED_SAFE: validation failed
VALID --> READY: dependencies met
READY --> RESOURCE_DISCOVERY: start acquire
RESOURCE_DISCOVERY --> RESOURCES_DISCOVERED: acquire complete
RESOURCES_DISCOVERED --> SCHEDULED: ready for locking
SCHEDULED --> LOCKING: lock request sent
LOCKING --> LOCKED: entities locked
LOCKED --> RUNNING: execution started
RUNNING --> COMPLETED: all steps succeeded
RUNNING --> ERROR: step failed
ERROR --> ROLLBACK: rollback initiated
ERROR --> FAILED_SAFE: pure execution
ERROR --> FAILED_UNSAFE: side effects occurred
ROLLBACK --> FAILED_SAFE: rollback succeeded
ROLLBACK --> FAILED_UNSAFE: rollback failed
COMPLETED --> [*]
FAILED_SAFE --> [*]
FAILED_UNSAFE --> [*]
| State | What's happening |
|---|---|
NEW |
Execution created, waiting for resolution and validation |
VALID |
Workflow definition and function blocks resolved successfully |
BLOCKED_WAITING |
Waiting for external conditions to be met (reserved for future use) |
READY |
All preconditions met, ready to acquire resources |
RESOURCE_DISCOVERY |
Running ACQUIRE jobs to gather entity data |
RESOURCES_DISCOVERED |
Acquisition complete, entities identified |
SCHEDULED |
Waiting for entity locks |
LOCKING |
Lock request sent to CMS |
LOCKED |
Entities locked, execution can begin |
RUNNING |
Steps are actively executing |
COMPLETED |
All steps finished successfully |
ERROR |
A step failed, deciding on recovery action |
ROLLBACK |
Rolling back executed steps |
FAILED_SAFE |
Failed with no irreversible side effects |
FAILED_UNSAFE |
Failed with potential side effects |
Detailed lifecycle reference
For a detailed description of every state and its transitions, see the Workflow Lifecycle reference page.
Error Handling
When a step fails, the engine follows this sequence:
- Classify failure -- Based on the pure/idempotent properties of all executed steps, mark the workflow as
FAILED_SAFEorFAILED_UNSAFE
Implementation status
The following are planned but not yet implemented:
continueOnError-- skip to next step on failure (schema-defined, not enforced)retryConfig-- automatic step-level retries (schema-defined, not consumed)- Rollback -- reverse executed steps (framework placeholder, no logic)
Stuck job detection
If a worker takes too long to complete a job (default: 12 minutes), the engine marks the job as failed. This prevents a single unresponsive worker from blocking an entire workflow. Workers that stop sending heartbeats are first marked as unreachable (2 min) and then offline (6 min). Once a worker is offline, its in-flight jobs are auto-failed by the stuck job cleanup service.