Skip to content

Workflow as a Transaction

This document describes the paradigm of "workflow as a transaction" in the context of network operations, particularly within the Neops framework.

Motivation: Safe and Predictable Network Operations

Managing and automating operations in a networked environment presents inherent risks. Executing actions such as device configurations, firmware upgrades, or interface resets can lead to unintended side effects, inconsistent system states, or even critical service disruptions. These challenges stem from the following realities:

  • Irreversibility: Some network actions cannot be undone once performed (e.g., hardware reboots or deletions).
  • Side Effects: Configuration changes may have cascading effects that are hard to predict or localize.
  • Concurrency: Multiple processes acting on the same resource can create race conditions or data corruption.
  • Partial Failures: In multi-step operations, some steps may succeed while others fail, leaving the system in an unstable state.

To address these risks, Neops employs the principle of "workflow as a transaction".

Workflow as a Transaction

The concept borrows from traditional database transactions: group a sequence of operations such that either all steps succeed, or none do. This helps maintain consistency and predictability in execution. While not all network actions are reversible, Neops attempts to get as close as possible to ACID (Atomicity, Consistency, Isolation, Durability) properties by:

  • Atomic Grouping: Operations within a workflow are treated as a logical unit.
  • Consistency: Workflows are validated before execution; results are validated afterward.
  • Isolation: Resource locking prevents concurrent access to the same network entities (devices, interfaces, groups).
  • Durability: Executed changes are logged and persisted for auditability and rollback where feasible.

Simplified Workflow Lifecycle

Each workflow execution transitions through a set of defined states:

  • NEW: A workflow has been created and awaits validation.
  • VALID: The workflow definition and parameters are confirmed to be correct.
  • SCHEDULED: Conditions (e.g., time, device availability) have been met for execution.
  • LOCKED: Target entities (devices, interfaces, groups) are exclusively locked.
  • RUNNING: The workflow is actively being executed across its steps.
  • COMPLETED: All steps finished successfully; effects are consistent and durable.
  • FAILED_SAFE: The workflow failed but with no irreversible changes — e.g., it never ran or all steps were reverted.
  • FAILED_UNSAFE: Irreversible or untracked changes occurred — e.g., firmware installation succeeded but subsequent verification failed. Manual intervention is required to assess and correct the system.

This model explicitly accounts for interactions with the "real world," where not everything is reversible or predictable.

stateDiagram-v2
        classDef goodState stroke:green
        classDef warningState stroke:orange
        classDef badState stroke:red

        class NEW goodState
        class READY goodState
        class VALID goodState
        class LOCKING goodState
        class LOCKED goodState
        class BLOCKED_WAITING goodState
        class RESOURCE_DISCOVERY goodState
        class RESOURCES_DISCOVERED goodState
        class SCHEDULED goodState
        class RUNNING goodState
        class COMPLETED goodState
        class COMPLETED_ACK goodState
        class FAILED_SAFE warningState
        class ERROR warningState
        class ROLLBACK warningState
        class FAILED_UNSAFE badState
        class FAILED_UNSAFE_ACK badState
        class FAILED_SAFE_ACK warningState

    [*] --> NEW
    NEW --> VALID
    NEW --> FAILED_SAFE
    VALID --> LOCKED: ...
    LOCKED --> RUNNING
    RUNNING --> COMPLETED
    RUNNING --> FAILED_SAFE
    RUNNING --> FAILED_UNSAFE
    COMPLETED --> [*]
    FAILED_SAFE --> [*]
    FAILED_UNSAFE --> [*]