Skip to content

Execution Model

When you trigger a workflow, the engine does not run it in one shot. It traverses the step tree, creates discrete jobs, distributes them to workers, collects results, and decides what to do next. Understanding this flow helps you design workflows that execute efficiently and fail gracefully.

TL;DR

You submit a workflow. The engine resolves function blocks, acquires entities from the CMS, locks them, then creates jobs on the blackboard. Workers poll for jobs, execute function blocks, and push results. The engine advances step by step until all steps complete or a failure occurs. On failure, the engine classifies it as FAILED_SAFE (nothing changed) or FAILED_UNSAFE (side effects may have occurred).

Execution Flow

sequenceDiagram
    participant User
    participant Engine
    participant CMS
    participant BB as Blackboard
    participant Worker

    User->>Engine: Execute workflow
    Engine->>Engine: Resolve FBs & validate
    Engine->>CMS: Acquire & lock entities
    CMS->>Engine: Locked entity data

    loop For each step
        Engine->>BB: Create EXECUTE jobs
        Worker->>BB: Poll for matching jobs
        BB->>Worker: Job (FB, params, context)
        Worker->>Worker: Execute function block
        Worker->>BB: Push result
        BB->>Engine: Job result event
        Engine->>Engine: Update context, decide next step
    end

    Engine->>CMS: Unlock, apply DB updates
    Engine->>Engine: Mark COMPLETED

Step Traversal

The engine processes steps using depth-first traversal -- it fully completes each step (including all nested sub-steps in embedded workflows) before moving to the next one. For each step:

  1. Evaluate condition -- If the step has a condition, evaluate it against the current context. Skip if false.
  2. Evaluate assertions -- If the step has assert clauses, evaluate them. Fail the step if any assertion is false.
  3. Create jobs -- For each entity in scope (determined by runOn and the acquired context), create a job on the blackboard.
  4. Wait for results -- The engine pauses this execution path until all jobs for this step complete.
  5. Process results -- Add step results to the context. If a job failed, apply retry/error handling logic.
  6. Advance -- Move to the next step in the sequence.

Per-Entity Execution

A step with runOn: device in a workflow with 100 devices in scope creates 100 jobs -- one per device. These jobs can execute in parallel across multiple workers.

graph LR
    Step["Step: show_version<br/>(runOn: device)"] --> J1["Job: device-1"]
    Step --> J2["Job: device-2"]
    Step --> J3["Job: device-3"]
    Step --> JN["Job: device-N"]

    J1 --> W1["Worker 1"]
    J2 --> W1
    J3 --> W2["Worker 2"]
    JN --> W2

Workers pick up jobs for function blocks they have registered. Multiple workers can process jobs for the same step concurrently.

Parallel Execution Strategy

By default, the engine waits for all jobs in a step to complete before advancing to the next step. With the workflow-level setting config.executionStrategy.parallel: true, the engine allows per-entity pipeline execution:

Device 1:  [Step A] → [Step B] → [Step C]
Device 2:  [Step A] → [Step B] → ...
Device 3:  [Step A] → ...

Device 1 can be on Step C while Device 3 is still on Step A. The engine tracks state per entity.

Workflow Execution States

Every execution transitions through a state machine:

stateDiagram-v2
        classDef goodState stroke:green
        classDef warningState stroke:orange
        classDef badState stroke:red

        class NEW goodState
        class READY goodState
        class VALID goodState
        class LOCKING goodState
        class LOCKED goodState
        class BLOCKED_WAITING goodState
        class RESOURCE_DISCOVERY goodState
        class RESOURCES_DISCOVERED goodState
        class SCHEDULED goodState
        class RUNNING goodState
        class COMPLETED goodState
        class COMPLETED_ACK goodState
        class FAILED_SAFE warningState
        class ERROR warningState
        class ROLLBACK warningState
        class FAILED_UNSAFE badState
        class FAILED_UNSAFE_ACK badState
        class FAILED_SAFE_ACK warningState

    [*] --> NEW
    NEW --> VALID: resolved & validated
    NEW --> FAILED_SAFE: validation failed
    VALID --> READY: dependencies met
    READY --> RESOURCE_DISCOVERY: start acquire
    RESOURCE_DISCOVERY --> RESOURCES_DISCOVERED: acquire complete
    RESOURCES_DISCOVERED --> SCHEDULED: ready for locking
    SCHEDULED --> LOCKING: lock request sent
    LOCKING --> LOCKED: entities locked
    LOCKED --> RUNNING: execution started
    RUNNING --> COMPLETED: all steps succeeded
    RUNNING --> ERROR: step failed
    ERROR --> ROLLBACK: rollback initiated
    ERROR --> FAILED_SAFE: pure execution
    ERROR --> FAILED_UNSAFE: side effects occurred
    ROLLBACK --> FAILED_SAFE: rollback succeeded
    ROLLBACK --> FAILED_UNSAFE: rollback failed
    COMPLETED --> [*]
    FAILED_SAFE --> [*]
    FAILED_UNSAFE --> [*]
State What's happening
NEW Execution created, waiting for resolution and validation
VALID Workflow definition and function blocks resolved successfully
BLOCKED_WAITING Waiting for external conditions to be met (reserved for future use)
READY All preconditions met, ready to acquire resources
RESOURCE_DISCOVERY Running ACQUIRE jobs to gather entity data
RESOURCES_DISCOVERED Acquisition complete, entities identified
SCHEDULED Waiting for entity locks
LOCKING Lock request sent to CMS
LOCKED Entities locked, execution can begin
RUNNING Steps are actively executing
COMPLETED All steps finished successfully
ERROR A step failed, deciding on recovery action
ROLLBACK Rolling back executed steps
FAILED_SAFE Failed with no irreversible side effects
FAILED_UNSAFE Failed with potential side effects

Detailed lifecycle reference

For a detailed description of every state and its transitions, see the Workflow Lifecycle reference page.

Error Handling

When a step fails, the engine follows this sequence:

  1. Classify failure -- Based on the pure/idempotent properties of all executed steps, mark the workflow as FAILED_SAFE or FAILED_UNSAFE

Implementation status

The following are planned but not yet implemented:

  • continueOnError -- skip to next step on failure (schema-defined, not enforced)
  • retryConfig -- automatic step-level retries (schema-defined, not consumed)
  • Rollback -- reverse executed steps (framework placeholder, no logic)
Stuck job detection

If a worker takes too long to complete a job (default: 12 minutes), the engine marks the job as failed. This prevents a single unresponsive worker from blocking an entire workflow. Workers that stop sending heartbeats are first marked as unreachable (2 min) and then offline (6 min). Once a worker is offline, its in-flight jobs are auto-failed by the stuck job cleanup service.