Argo Workflows offers a range of options for retrying failed steps.
apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: retry-container- spec: entrypoint: retry-container templates: - name: retry-container retryStrategy: limit: "10" container: image: python:alpine3.6 command: ["python", -c] # fail with a 66% probability args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
retryPolicy to choose which failures to retry:
Always: Retry all failed steps
OnFailure: Retry steps whose main container is marked as failed in Kubernetes (this is the default)
OnError: Retry steps that encounter Argo controller errors, or whose init or wait containers fail
OnTransientError: Retry steps that encounter errors defined as transient, or errors matching the
TRANSIENT_ERROR_PATTERNenvironment variable. Available in version 3.0 and later.
apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: retry-on-error- spec: entrypoint: error-container templates: - name: error-container retryStrategy: limit: "2" retryPolicy: "Always" container: image: python command: ["python", "-c"] # fail with a 80% probability args: ["import random; import sys; exit_code = random.choice(range(0, 5)); sys.exit(exit_code)"]
v3.2 and after
You can also use
expression to control retries. The
accepts an expr expression and has
access to the following variables:
lastRetry.exitCode: The exit code of the last retry, or "-1" if not available
lastRetry.status: The phase of the last retry: Error, Failed
lastRetry.duration: The duration of the last retry, in seconds
expression evaluates to false, the step will not be retried.
Note that when
expression is specified,
retryPolicy will be ignored.
See example for usage.
You can configure the delay between retries with
backoff. See example for usage.