Workflow as a Transaction
This document describes the paradigm of "workflow as a transaction" in the context of network operations, particularly within the Neops framework.
Motivation: Safe and Predictable Network Operations
Managing and automating operations in a networked environment presents inherent risks. Executing actions such as device configurations, firmware upgrades, or interface resets can lead to unintended side effects, inconsistent system states, or even critical service disruptions. These challenges stem from the following realities:
- Irreversibility: Some network actions cannot be undone once performed (e.g., hardware reboots or deletions).
- Side Effects: Configuration changes may have cascading effects that are hard to predict or localize.
- Concurrency: Multiple processes acting on the same resource can create race conditions or data corruption.
- Partial Failures: In multi-step operations, some steps may succeed while others fail, leaving the system in an unstable state.
To address these risks, Neops employs the principle of "workflow as a transaction".
Workflow as a Transaction
The concept borrows from traditional database transactions: group a sequence of operations such that either all steps succeed, or none do. This helps maintain consistency and predictability in execution. While not all network actions are reversible, Neops attempts to get as close as possible to ACID (Atomicity, Consistency, Isolation, Durability) properties by:
- Atomic Grouping: Operations within a workflow are treated as a logical unit.
- Consistency: Workflows are validated before execution; results are validated afterward.
- Isolation: Resource locking prevents concurrent access to the same network entities (devices, interfaces, groups).
- Durability: Executed changes are logged and persisted for auditability and rollback where feasible.
Simplified Workflow Lifecycle
Each workflow execution transitions through a set of defined states:
- NEW: A workflow has been created and awaits validation.
- VALID: The workflow definition and parameters are confirmed to be correct.
- SCHEDULED: Conditions (e.g., time, device availability) have been met for execution.
- LOCKED: Target entities (devices, interfaces, groups) are exclusively locked.
- RUNNING: The workflow is actively being executed across its steps.
- COMPLETED: All steps finished successfully; effects are consistent and durable.
- FAILED_SAFE: The workflow failed but with no irreversible changes — e.g., it never ran or all steps were reverted.
- FAILED_UNSAFE: Irreversible or untracked changes occurred — e.g., firmware installation succeeded but subsequent verification failed. Manual intervention is required to assess and correct the system.
This model explicitly accounts for interactions with the "real world," where not everything is reversible or predictable.
stateDiagram-v2
classDef goodState stroke:green
classDef warningState stroke:orange
classDef badState stroke:red
class NEW goodState
class READY goodState
class VALID goodState
class LOCKING goodState
class LOCKED goodState
class BLOCKED_WAITING goodState
class RESOURCE_DISCOVERY goodState
class RESOURCES_DISCOVERED goodState
class SCHEDULED goodState
class RUNNING goodState
class COMPLETED goodState
class COMPLETED_ACK goodState
class FAILED_SAFE warningState
class ERROR warningState
class ROLLBACK warningState
class FAILED_UNSAFE badState
class FAILED_UNSAFE_ACK badState
class FAILED_SAFE_ACK warningState
[*] --> NEW
NEW --> VALID
NEW --> FAILED_SAFE
VALID --> LOCKED: ...
LOCKED --> RUNNING
RUNNING --> COMPLETED
RUNNING --> FAILED_SAFE
RUNNING --> FAILED_UNSAFE
COMPLETED --> [*]
FAILED_SAFE --> [*]
FAILED_UNSAFE --> [*]