Retry & Rollback
The engine provides two mechanisms for handling step failures: automatic retries and rollback.
Retry Configuration
Not Yet Implemented
retryConfig is defined in the workflow schema and can be set on steps, but the
execution engine does not yet consume it. Failed steps are not retried based on this
configuration. This feature is planned for a future release.
Add retryConfig to a step to configure retry behavior on failure (when implemented):
- type: functionBlock
label: push_config
functionBlock: "fb.examples.neops.io/configureDevice:1.0.0"
retryConfig:
maxRetries: 3
delay: 10
| Field | Required | Description |
|---|---|---|
maxRetries |
Yes | Number of retry attempts (1-100) |
delay |
No | Seconds between attempts (0-600) |
condition |
No | JMESPath condition that must be true for retry |
Planned Retry Behavior
- A step fails
- The engine checks if the step has
retryConfig - If
retryCount < maxRetries, the engine waitsdelayseconds and creates a new job - The retry job replaces the failed job (tracked via
replaces/replacedByreferences) - If all retries fail, the step is marked as permanently failed
Repeat Configuration
Not Yet Implemented
repeatConfig is defined in the workflow schema and can be set on steps, but the
execution engine does not yet consume it. Steps always execute once. This feature is
planned for a future release.
Repeat a step multiple times (independent of failure), when implemented:
- type: functionBlock
label: poll_status
functionBlock: "fb.examples.neops.io/checkStatus:1.0.0"
repeatConfig:
repeats: 5
delay: 60
condition:
type: jmes
jmes: "{{ poll_status.result.data.status != 'ready' }}"
| Field | Required | Description |
|---|---|---|
repeats |
No | Fixed number of repetitions (1-100) |
delay |
No | Seconds between iterations (0-600) |
condition |
No | Continue repeating while condition is true |
Provide either repeats (fixed count), condition (repeat while true), or both (whichever limit is reached first). Use repeat for polling patterns (wait until a device reboots, check convergence).
Failure Classification
When a step fails (after retries are exhausted, once implemented), the engine classifies the overall workflow failure:
graph TD
StepFail["Step failed"] --> COE{"continueOnError?<br/>(planned)"}
COE -->|Yes| Continue["Continue to next step"]
COE -->|No| Classify["Classify failure"]
Classify --> AllPure{"All executed<br/>steps pure?"}
AllPure -->|Yes| FS["FAILED_SAFE"]
AllPure -->|No| AnyIdemp{"All executed steps<br/>pure or idempotent?"}
AnyIdemp -->|Yes| AutoRetry["Eligible for<br/>workflow retry (planned)"]
AnyIdemp -->|No| FU["FAILED_UNSAFE"]
Implementation status
Today: Only the pure → FAILED_SAFE path is active. Everything else results in
FAILED_UNSAFE. The continueOnError and idempotent retry paths shown above are the
planned design. The engine already tracks isIdempotentExecution but does not yet
use it for decisions, and continueOnError is not yet enforced.
| Result | Meaning | Action |
|---|---|---|
FAILED_SAFE |
No side effects occurred (all steps were pure) | Safe to discard or re-trigger |
FAILED_UNSAFE |
Side effects may have occurred | Investigate in Monitor App |
COMPLETED |
All steps succeeded | Nothing to do |
Rollback
Framework Placeholder
Rollback is defined in the execution model (ROLLBACK state, ROLLBACK job type)
but the rollback handler is currently a framework placeholder with no execution logic.
Function block developers can implement rollback() in the Worker SDK in preparation
for when the engine enables this feature.
Current workaround: Design a separate "rollback workflow" that reverses the changes made by the original workflow, and trigger it manually when needed.
When fully implemented, the engine will support rollback for steps whose function blocks implement a rollback() method. Failed workflows will trigger rollback jobs in reverse step order.
Designing for Failure
Use isPure liberally.
: Every read-only function block should be marked pure. This gives the engine maximum information for failure classification.
Order steps strategically.
: Put pure steps first (data collection, validation) and configuration steps last. If the workflow fails during data collection, it is FAILED_SAFE.
Design idempotent configuration steps. : Prefer declarative configuration (replace entire section) over imperative (append line). Declarative configs are naturally idempotent, which will enable automatic retries when that feature is implemented.