Skip to content

Retries

Argo Workflows offers a range of options for retrying failed steps.

Configuring retryStrategy in WorkflowSpec

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-container-
spec:
  entrypoint: retry-container
  templates:
  - name: retry-container
    retryStrategy:
      limit: "10"
    container:
      image: python:alpine3.6
      command: ["python", -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]

Retry policies

Use retryPolicy to choose which failures to retry:

  • Always: Retry all failed steps
  • OnFailure: Retry steps whose main container is marked as failed in Kubernetes
  • OnError: Retry steps that encounter Argo controller errors, or whose init or wait containers fail
  • OnTransientError: Retry steps that encounter errors defined as transient, or errors matching the TRANSIENT_ERROR_PATTERN environment variable. Available in version 3.0 and later.

For example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-on-error-
spec:
  entrypoint: error-container
  templates:
  - name: error-container
    retryStrategy:
      limit: "2"
      retryPolicy: "Always"
    container:
      image: python
      command: ["python", "-c"]
      # fail with a 80% probability
      args: ["import random; import sys; exit_code = random.choice(range(0, 5)); sys.exit(exit_code)"]

Conditional retries

v3.2 and after

You can also use expression to control retries. The expression field accepts an expr expression and has access to the following variables:

  • lastRetry.exitCode: The exit code of the last retry, or "-1" if not available
  • lastRetry.status: The phase of the last retry: Error, Failed
  • lastRetry.duration: The duration of the last retry, in seconds

If expression evaluates to false, the step will not be retried.

See example for usage.

Backoff

You can configure the delay between retries with backoff. See example for usage.