Workflow as a Transaction

Network operations are inherently risky. A config push might succeed on half your devices and fail on the rest. A firmware upgrade might complete but the verification step might time out. Traditional scripts leave you guessing: what actually happened?

The workflow engine borrows from database transaction theory to bring predictability to this chaos. Every workflow execution is tracked as a transaction with well-defined failure semantics.

The Core Idea

Group a sequence of operations so the engine always knows the state of the world:

What ran -- which steps executed on which devices
What changed -- which steps had side effects (configuration writes, entity modifications)
What's safe -- whether the execution can be retried, rolled back, or needs manual intervention

The engine achieves this through the pure and idempotent contracts on function blocks, combined with entity locking and atomic state transitions.

ACID-Like Properties

Property	How neops implements it
Atomicity	Operations within a workflow are tracked as a logical unit. DB updates are applied atomically on completion.
Consistency	Workflow definitions are validated before execution. Parameter schemas are checked. Assertions can guard steps.
Isolation	Entity locking prevents concurrent workflows from modifying the same devices, interfaces, or groups.
Durability	Every state transition is persisted. Results, logs, and DB updates survive restarts.

What this means for you

If your workflow only reads data before failing → the engine tells you nothing changed (FAILED_SAFE)
If your workflow wrote some config before failing → the engine tells you manual review needed (FAILED_UNSAFE)
No more guessing what state your network is in after a failure

Failure Classification

When a workflow fails, the engine classifies the failure based on what actually executed:

stateDiagram-v2
        classDef goodState stroke:green
        classDef warningState stroke:orange
        classDef badState stroke:red

        class NEW goodState
        class READY goodState
        class VALID goodState
        class LOCKING goodState
        class LOCKED goodState
        class BLOCKED_WAITING goodState
        class RESOURCE_DISCOVERY goodState
        class RESOURCES_DISCOVERED goodState
        class SCHEDULED goodState
        class RUNNING goodState
        class COMPLETED goodState
        class COMPLETED_ACK goodState
        class FAILED_SAFE warningState
        class ERROR warningState
        class ROLLBACK warningState
        class FAILED_UNSAFE badState
        class FAILED_UNSAFE_ACK badState
        class FAILED_SAFE_ACK warningState

    [*] --> NEW
    NEW --> VALID
    NEW --> FAILED_SAFE
    VALID --> LOCKED: acquire, lock
    LOCKED --> RUNNING
    RUNNING --> COMPLETED
    RUNNING --> FAILED_SAFE
    RUNNING --> FAILED_UNSAFE
    COMPLETED --> [*]
    FAILED_SAFE --> [*]
    FAILED_UNSAFE --> [*]

FAILED_SAFE

The workflow failed, but nothing irreversible happened. This occurs when:

The workflow failed before any steps executed
All executed steps were pure (read-only) -- no side effects at all
The workflow was successfully rolled back

You can safely discard or re-run this workflow without worrying about leftover state.

FAILED_UNSAFE

The workflow failed and some side effects may have occurred. This happens when:

A non-pure, non-idempotent step executed before the failure
The rollback itself failed
An external system was modified in a way that cannot be automatically undone

Manual review is required to assess and correct the state.

The Decision Tree

graph TD
    Fail["Workflow step failed"] --> AnyExec{"Did any step<br/>execute successfully?"}
    AnyExec -->|No| Safe1["FAILED_SAFE"]
    AnyExec -->|Yes| AllPure{"Were all executed<br/>steps pure?"}
    AllPure -->|Yes| Safe2["FAILED_SAFE"]
    AllPure -->|No| AllIdemp{"Were all executed<br/>steps idempotent?"}
    AllIdemp -->|Yes| CanRetry["Auto-retry eligible<br/>(planned)"]
    CanRetry --> RetryOk{"Retry<br/>succeeded?"}
    RetryOk -->|Yes| Completed["COMPLETED"]
    RetryOk -->|No| Unsafe1["FAILED_UNSAFE"]
    AllIdemp -->|No| Unsafe2["FAILED_UNSAFE"]

Implementation status

Today: Only the pure → FAILED_SAFE path is active. If any non-pure step executed, the result is FAILED_UNSAFE regardless of idempotency. The idempotent → auto-retry path shown above is the planned design. The engine already tracks isIdempotentExecution but does not yet use it for retry decisions.

Entity Locking

Before execution, the engine acquires exclusive locks on all entities in the workflow's scope:

Acquisition -- The engine queries the CMS for the required entities
Locking -- The CMS grants exclusive write locks
Execution -- Steps run against the locked entities
Release -- On completion (success or failure), locks are released and DB updates are applied atomically

This prevents two workflows from modifying the same device simultaneously. If a device is already locked by another workflow, the new workflow waits in SCHEDULED state until the lock becomes available.

Locking granularity

Locks are per-entity: a workflow locking device A does not block a different workflow operating on device B. Locks are exclusive -- only one workflow can hold a lock on a given entity at a time.

Practical Implications

For workflow authors:

Mark read-only function blocks as isPure: true -- this gives the engine maximum flexibility for failure classification
Mark config-push function blocks as isIdempotent: true if re-running them is safe (declarative config, not append-based) -- this will enable automatic retries when that feature is implemented
Order steps so pure (read-only) steps execute first -- if the workflow fails during the pure phase, it is automatically FAILED_SAFE

For operators:

FAILED_SAFE workflows can be safely ignored or re-triggered
FAILED_UNSAFE workflows need investigation -- check the execution details in the Monitor App to see which steps ran on which devices
Entity locks are released even on failure, so subsequent workflows are not blocked