Workflow as a Transaction
Network operations are inherently risky. A config push might succeed on half your devices and fail on the rest. A firmware upgrade might complete but the verification step might time out. Traditional scripts leave you guessing: what actually happened?
The workflow engine borrows from database transaction theory to bring predictability to this chaos. Every workflow execution is tracked as a transaction with well-defined failure semantics.
The Core Idea
Group a sequence of operations so the engine always knows the state of the world:
- What ran -- which steps executed on which devices
- What changed -- which steps had side effects (configuration writes, entity modifications)
- What's safe -- whether the execution can be retried, rolled back, or needs manual intervention
The engine achieves this through the pure and idempotent contracts on function blocks, combined with entity locking and atomic state transitions.
ACID-Like Properties
| Property | How neops implements it |
|---|---|
| Atomicity | Operations within a workflow are tracked as a logical unit. DB updates are applied atomically on completion. |
| Consistency | Workflow definitions are validated before execution. Parameter schemas are checked. Assertions can guard steps. |
| Isolation | Entity locking prevents concurrent workflows from modifying the same devices, interfaces, or groups. |
| Durability | Every state transition is persisted. Results, logs, and DB updates survive restarts. |
What this means for you
- If your workflow only reads data before failing → the engine tells you nothing changed (
FAILED_SAFE) - If your workflow wrote some config before failing → the engine tells you manual review needed (
FAILED_UNSAFE) - No more guessing what state your network is in after a failure
Failure Classification
When a workflow fails, the engine classifies the failure based on what actually executed:
stateDiagram-v2
classDef goodState stroke:green
classDef warningState stroke:orange
classDef badState stroke:red
class NEW goodState
class READY goodState
class VALID goodState
class LOCKING goodState
class LOCKED goodState
class BLOCKED_WAITING goodState
class RESOURCE_DISCOVERY goodState
class RESOURCES_DISCOVERED goodState
class SCHEDULED goodState
class RUNNING goodState
class COMPLETED goodState
class COMPLETED_ACK goodState
class FAILED_SAFE warningState
class ERROR warningState
class ROLLBACK warningState
class FAILED_UNSAFE badState
class FAILED_UNSAFE_ACK badState
class FAILED_SAFE_ACK warningState
[*] --> NEW
NEW --> VALID
NEW --> FAILED_SAFE
VALID --> LOCKED: acquire, lock
LOCKED --> RUNNING
RUNNING --> COMPLETED
RUNNING --> FAILED_SAFE
RUNNING --> FAILED_UNSAFE
COMPLETED --> [*]
FAILED_SAFE --> [*]
FAILED_UNSAFE --> [*]
FAILED_SAFE
The workflow failed, but nothing irreversible happened. This occurs when:
- The workflow failed before any steps executed
- All executed steps were pure (read-only) -- no side effects at all
- The workflow was successfully rolled back
You can safely discard or re-run this workflow without worrying about leftover state.
FAILED_UNSAFE
The workflow failed and some side effects may have occurred. This happens when:
- A non-pure, non-idempotent step executed before the failure
- The rollback itself failed
- An external system was modified in a way that cannot be automatically undone
Manual review is required to assess and correct the state.
The Decision Tree
graph TD
Fail["Workflow step failed"] --> AnyExec{"Did any step<br/>execute successfully?"}
AnyExec -->|No| Safe1["FAILED_SAFE"]
AnyExec -->|Yes| AllPure{"Were all executed<br/>steps pure?"}
AllPure -->|Yes| Safe2["FAILED_SAFE"]
AllPure -->|No| AllIdemp{"Were all executed<br/>steps idempotent?"}
AllIdemp -->|Yes| CanRetry["Auto-retry eligible<br/>(planned)"]
CanRetry --> RetryOk{"Retry<br/>succeeded?"}
RetryOk -->|Yes| Completed["COMPLETED"]
RetryOk -->|No| Unsafe1["FAILED_UNSAFE"]
AllIdemp -->|No| Unsafe2["FAILED_UNSAFE"]
Implementation status
Today: Only the pure → FAILED_SAFE path is active. If any non-pure step executed,
the result is FAILED_UNSAFE regardless of idempotency. The idempotent → auto-retry
path shown above is the planned design. The engine already tracks
isIdempotentExecution but does not yet use it for retry decisions.
Entity Locking
Before execution, the engine acquires exclusive locks on all entities in the workflow's scope:
- Acquisition -- The engine queries the CMS for the required entities
- Locking -- The CMS grants exclusive write locks
- Execution -- Steps run against the locked entities
- Release -- On completion (success or failure), locks are released and DB updates are applied atomically
This prevents two workflows from modifying the same device simultaneously. If a device is already locked by another workflow, the new workflow waits in SCHEDULED state until the lock becomes available.
Locking granularity
Locks are per-entity: a workflow locking device A does not block a different workflow operating on device B. Locks are exclusive -- only one workflow can hold a lock on a given entity at a time.
Practical Implications
For workflow authors:
- Mark read-only function blocks as
isPure: true-- this gives the engine maximum flexibility for failure classification - Mark config-push function blocks as
isIdempotent: trueif re-running them is safe (declarative config, not append-based) -- this will enable automatic retries when that feature is implemented - Order steps so pure (read-only) steps execute first -- if the workflow fails during the pure phase, it is automatically
FAILED_SAFE
For operators:
FAILED_SAFEworkflows can be safely ignored or re-triggeredFAILED_UNSAFEworkflows need investigation -- check the execution details in the Monitor App to see which steps ran on which devices- Entity locks are released even on failure, so subsequent workflows are not blocked