Skip to content

Retry & Rollback

The engine provides two mechanisms for handling step failures: automatic retries and rollback.

Retry Configuration

Not Yet Implemented

retryConfig is defined in the workflow schema and can be set on steps, but the execution engine does not yet consume it. Failed steps are not retried based on this configuration. This feature is planned for a future release.

Add retryConfig to a step to configure retry behavior on failure (when implemented):

- type: functionBlock
  label: push_config
  functionBlock: "fb.examples.neops.io/configureDevice:1.0.0"
  retryConfig:
    maxRetries: 3
    delay: 10
Field Required Description
maxRetries Yes Number of retry attempts (1-100)
delay No Seconds between attempts (0-600)
condition No JMESPath condition that must be true for retry

Planned Retry Behavior

  1. A step fails
  2. The engine checks if the step has retryConfig
  3. If retryCount < maxRetries, the engine waits delay seconds and creates a new job
  4. The retry job replaces the failed job (tracked via replaces/replacedBy references)
  5. If all retries fail, the step is marked as permanently failed

Repeat Configuration

Not Yet Implemented

repeatConfig is defined in the workflow schema and can be set on steps, but the execution engine does not yet consume it. Steps always execute once. This feature is planned for a future release.

Repeat a step multiple times (independent of failure), when implemented:

- type: functionBlock
  label: poll_status
  functionBlock: "fb.examples.neops.io/checkStatus:1.0.0"
  repeatConfig:
    repeats: 5
    delay: 60
    condition:
      type: jmes
      jmes: "{{ poll_status.result.data.status != 'ready' }}"
Field Required Description
repeats No Fixed number of repetitions (1-100)
delay No Seconds between iterations (0-600)
condition No Continue repeating while condition is true

Provide either repeats (fixed count), condition (repeat while true), or both (whichever limit is reached first). Use repeat for polling patterns (wait until a device reboots, check convergence).

Failure Classification

When a step fails (after retries are exhausted, once implemented), the engine classifies the overall workflow failure:

graph TD
    StepFail["Step failed"] --> COE{"continueOnError?<br/>(planned)"}
    COE -->|Yes| Continue["Continue to next step"]
    COE -->|No| Classify["Classify failure"]
    Classify --> AllPure{"All executed<br/>steps pure?"}
    AllPure -->|Yes| FS["FAILED_SAFE"]
    AllPure -->|No| AnyIdemp{"All executed steps<br/>pure or idempotent?"}
    AnyIdemp -->|Yes| AutoRetry["Eligible for<br/>workflow retry (planned)"]
    AnyIdemp -->|No| FU["FAILED_UNSAFE"]

Implementation status

Today: Only the pure → FAILED_SAFE path is active. Everything else results in FAILED_UNSAFE. The continueOnError and idempotent retry paths shown above are the planned design. The engine already tracks isIdempotentExecution but does not yet use it for decisions, and continueOnError is not yet enforced.

Result Meaning Action
FAILED_SAFE No side effects occurred (all steps were pure) Safe to discard or re-trigger
FAILED_UNSAFE Side effects may have occurred Investigate in Monitor App
COMPLETED All steps succeeded Nothing to do

Rollback

Framework Placeholder

Rollback is defined in the execution model (ROLLBACK state, ROLLBACK job type) but the rollback handler is currently a framework placeholder with no execution logic. Function block developers can implement rollback() in the Worker SDK in preparation for when the engine enables this feature.

Current workaround: Design a separate "rollback workflow" that reverses the changes made by the original workflow, and trigger it manually when needed.

When fully implemented, the engine will support rollback for steps whose function blocks implement a rollback() method. Failed workflows will trigger rollback jobs in reverse step order.

Designing for Failure

Use isPure liberally. : Every read-only function block should be marked pure. This gives the engine maximum information for failure classification.

Order steps strategically. : Put pure steps first (data collection, validation) and configuration steps last. If the workflow fails during data collection, it is FAILED_SAFE.

Design idempotent configuration steps. : Prefer declarative configuration (replace entire section) over imperative (append line). Declarative configs are naturally idempotent, which will enable automatic retries when that feature is implemented.