Analysis & Progressive Delivery¶

Argo Rollouts provides several ways to perform analysis to drive progressive delivery. This document describes how to achieve various forms of progressive delivery, varying the point in time analysis is performed, its frequency, and occurrence.

Custom Resource Definitions¶

CRD	Description
Rollout	A `Rollout` acts as a drop-in replacement for a Deployment resource. It provides additional blueGreen and canary update strategies. These strategies can create AnalysisRuns and Experiments during the update, which will progress the update, or abort it.
AnalysisTemplate	An `AnalysisTemplate` is a template spec which defines how to perform a canary analysis, such as the metrics which it should perform, its frequency, and the values which are considered successful or failed. AnalysisTemplates may be parameterized with inputs values.
ClusterAnalysisTemplate	A `ClusterAnalysisTemplate` is like an `AnalysisTemplate`, but it is not limited to its namespace. It can be used by any `Rollout` throughout the cluster.
AnalysisRun	An `AnalysisRun` is an instantiation of an `AnalysisTemplate`. AnalysisRuns are like Jobs in that they eventually complete. Completed runs are considered Successful, Failed, or Inconclusive, and the result of the run affect if the Rollout's update will continue, abort, or pause, respectively.
Experiment	An `Experiment` is limited run of one or more ReplicaSets for the purposes of analysis. Experiments typically run for a pre-determined duration, but can also run indefinitely until stopped. Experiments may reference an `AnalysisTemplate` to run during or after the experiment. The canonical use case for an Experiment is to start a baseline and canary deployment in parallel, and compare the metrics produced by the baseline and canary pods for an equal comparison.

Background Analysis¶

Analysis can be run in the background -- while the canary is progressing through its rollout steps.

The following example gradually increments the canary weight by 20% every 10 minutes until it reaches 100%. In the background, an AnalysisRun is started based on the AnalysisTemplate named success-rate. The success-rate template queries a prometheus server, measuring the HTTP success rates at 5 minute intervals/samples. It has no end time, and continues until stopped or failed. If the metric is measured to be less than 95%, and there are three such measurements, the analysis is considered Failed. The failed analysis causes the Rollout to abort, setting the canary weight back to zero, and the Rollout would be considered in a Degraded. Otherwise, if the rollout completes all of its canary steps, the rollout is considered successful and the analysis run is stopped by the controller.

This example highlights:

Background analysis style of progressive delivery
Using a Prometheus query to perform a measurement
The ability to parameterize the analysis
Delay starting the analysis run until step 3 (Set Weight 40%)

Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    canary:
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 2 # delay starting analysis run until setWeight: 40%
        args:
        - name: service-name
          value: guestbook-svc.default.svc.cluster.local
      steps:
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 40
      - pause: {duration: 10m}
      - setWeight: 60
      - pause: {duration: 10m}
      - setWeight: 80
      - pause: {duration: 10m}

AnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    # NOTE: prometheus queries return results in the form of a vector.
    # So it is common to access the index 0 of the returned array to obtain the value
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Inline Analysis¶

Analysis can also be performed as a rollout step as an inline "analysis" step. When analysis is performed "inlined," an AnalysisRun is started when the step is reached, and blocks the rollout until the run is completed. The success or failure of the analysis run decides if the rollout will proceed to the next step, or abort the rollout completely.

This example sets the canary weight to 20%, pauses for 5 minutes, then runs an analysis. If the analysis was successful, continues with rollout, otherwise aborts.

This example demonstrates:

The ability to invoke an analysis in-line as part of steps

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    canary:
      steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: success-rate
          args:
          - name: service-name
            value: guestbook-svc.default.svc.cluster.local

In this example, the AnalysisTemplate is identical to the background analysis example, but since no interval is specified, the analysis will perform a single measurement and complete.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  - name: prometheus-port
    value: 9090
  metrics:
  - name: success-rate
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: "http://prometheus.example.com:{{args.prometheus-port}}"
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Multiple measurements can be performed over a longer duration period, by specifying the count and interval fields:

  metrics:
  - name: success-rate
    successCondition: result[0] >= 0.95
    interval: 60s
    count: 5
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

Note

The count can have 0 as value which means that it will run until the end of the Rollout execution for background analysis (outside of steps). However if the count has value 0 and the analysis is defined in the steps, the analysis won't be executed.

ClusterAnalysisTemplates¶

Important

Available since v0.9.0

A Rollout can reference a Cluster scoped AnalysisTemplate called a ClusterAnalysisTemplate. This can be useful when you want to share an AnalysisTemplate across multiple Rollouts; in different namespaces, and avoid duplicating the same template in every namespace. Use the field clusterScope: true to reference a ClusterAnalysisTemplate instead of an AnalysisTemplate.

Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    canary:
      steps:
      - setWeight: 20
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: success-rate
            clusterScope: true
          args:
          - name: service-name
            value: guestbook-svc.default.svc.cluster.local

ClusterAnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  - name: prometheus-port
    value: 9090
  metrics:
  - name: success-rate
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: "http://prometheus.example.com:{{args.prometheus-port}}"
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Note

The resulting AnalysisRun will still run in the namespace of the Rollout

Analysis with Multiple Templates¶

A Rollout can reference multiple AnalysisTemplates when constructing an AnalysisRun. This allows users to compose analysis from multiple AnalysisTemplates. If multiple templates are referenced, then the controller will merge the templates together. The controller combines the metrics and args fields of all the templates.

Rollout

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    canary:
      analysis:
        templates:
        - templateName: success-rate
        - templateName: error-rate
        args:
        - name: service-name
          value: guestbook-svc.default.svc.cluster.local

AnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: error-rate
    interval: 5m
    successCondition: result[0] <= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code=~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

AnalysisRun

# NOTE: Generated AnalysisRun from the multiple templates
apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
metadata:
  name: guestbook-CurrentPodHash-multiple-templates
spec:
  args:
  - name: service-name
    value: guestbook-svc.default.svc.cluster.local
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))
  - name: error-rate
    interval: 5m
    successCondition: result[0] <= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code=~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Note

The controller will error when merging the templates if:

Multiple metrics in the templates have the same name
Two arguments with the same name have different default values no matter the argument value in Rollout

Analysis Template referencing other Analysis Templates¶

AnalysisTemplates and ClusterAnalysisTemplates may reference other templates.

They can be combined with other metrics:

AnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: error-rate
    interval: 5m
    successCondition: result[0] <= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code=~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: rates
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))
  templates:
  - templateName: error-rate
    clusterScope: false

Or without additional metrics:

AnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: error-rate
    interval: 5m
    successCondition: result[0] <= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code=~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: rates
spec:
  args:
  - name: service-name
  templates:
  - templateName: success-rate
    clusterScope: false
  - templateName: error-rate
    clusterScope: false

The result in the AnalysisRun will have the aggregation of metrics of each template:

AnalysisRun

# NOTE: Generated AnalysisRun from a single template referencing several templates
apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
metadata:
  name: guestbook-CurrentPodHash-templates-in-template
spec:
  args:
  - name: service-name
    value: guestbook-svc.default.svc.cluster.local
  metrics:
  - name: success-rate
    interval: 5m
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))
  - name: error-rate
    interval: 5m
    successCondition: result[0] <= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code=~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Note

The same limitations as for the multiple templates feature apply. The controller will error when merging the templates if:

Multiple metrics in the templates have the same name
Two arguments with the same name have different default values no matter the argument value in Rollout

However, if the same AnalysisTemplate is referenced several times along the chain of references, the controller will only keep it once and discard the other references.

Analysis Template Arguments¶

AnalysisTemplates may declare a set of arguments that can be passed by Rollouts. The args can then be used as in metrics configuration and are resolved at the time the AnalysisRun is created. Argument placeholders are defined as {{ args.<name> }}.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: args-example
spec:
  args:
  # required in Rollout due to no default value
  - name: service-name
  - name: stable-hash
  - name: latest-hash
  # optional in Rollout given the default value
  - name: api-url
    value: http://example/measure
  # from secret
  - name: api-token
    valueFrom:
      secretKeyRef:
        name: token-secret
        key: apiToken
  metrics:
  - name: webmetric
    successCondition: result == 'true'
    provider:
      web:
        # placeholders are resolved when an AnalysisRun is created
        url: "{{ args.api-url }}?service={{ args.service-name }}"
        headers:
          - key: Authorization
            value: "Bearer {{ args.api-token }}"
        jsonPath: "{$.results.ok}"

Analysis arguments defined in a Rollout are merged with the args from the AnalysisTemplate when the AnalysisRun is created.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    canary:
      analysis:
        templates:
        - templateName: args-example
        args:
        # required value
        - name: service-name
          value: guestbook-svc.default.svc.cluster.local
        # override default value
        - name: api-url
          value: http://other-api
        # pod template hash from the stable ReplicaSet
        - name: stable-hash
          valueFrom:
            podTemplateHashValue: Stable
        # pod template hash from the latest ReplicaSet
        - name: latest-hash
          valueFrom:
            podTemplateHashValue: Latest

Analysis arguments also support valueFrom for reading metadata fields and passing them as arguments to AnalysisTemplate. An example would be to reference metadata labels like env and region and passing them along to AnalysisTemplate.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
  labels:
    appType: demo-app
    buildType: nginx-app
    ...
    env: dev
    region: us-west-2
spec:
...
  strategy:
    canary:
      analysis:
        templates:
        - templateName: args-example
        args:
        ...
        - name: env
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['env']
        # region where this app is deployed
        - name: region
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['region']

Important

Available since v1.2

Analysis arguments also support valueFrom for reading any field from Rollout status and passing them as arguments to AnalysisTemplate. Following example references Rollout status field like aws canaryTargetGroup name and passing them along to AnalysisTemplate

from the Rollout status

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
  labels:
    appType: demo-app
    buildType: nginx-app
    ...
    env: dev
    region: us-west-2
spec:
...
  strategy:
    canary:
      analysis:
        templates:
        - templateName: args-example
        args:
        ...
        - name: canary-targetgroup-name
          valueFrom:
            fieldRef:
              fieldPath: status.alb.canaryTargetGroup.name

BlueGreen Pre Promotion Analysis¶

A Rollout using the BlueGreen strategy can launch an AnalysisRun before it switches traffic to the new version using pre-promotion. This can be used to block the Service selector switch until the AnalysisRun finishes successfully. The success or failure of the AnalysisRun decides if the Rollout switches traffic, or abort the Rollout completely.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    blueGreen:
      activeService: active-svc
      previewService: preview-svc
      prePromotionAnalysis:
        templates:
        - templateName: smoke-tests
        args:
        - name: service-name
          value: preview-svc.default.svc.cluster.local

In this example, the Rollout creates a pre-promotion AnalysisRun once the new ReplicaSet is fully available. The Rollout will not switch traffic to the new version until the analysis run finishes successfully.

Note: if theautoPromotionSeconds field is specified and the Rollout has waited auto promotion seconds amount of time, the Rollout marks the AnalysisRun successful and switches the traffic to a new version automatically. If the AnalysisRun completes before then, the Rollout will not create another AnalysisRun and wait out the rest of the autoPromotionSeconds.

BlueGreen Post Promotion Analysis¶

A Rollout using a BlueGreen strategy can launch an analysis run after the traffic switch to the new version using post-promotion analysis. If post-promotion Analysis fails or errors, the Rollout enters an aborted state and switches traffic back to the previous stable Replicaset. When post-analysis is Successful, the Rollout is considered fully promoted and the new ReplicaSet will be marked as stable. The old ReplicaSet will then be scaled down according to scaleDownDelaySeconds (default 30 seconds).

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
...
  strategy:
    blueGreen:
      activeService: active-svc
      previewService: preview-svc
      scaleDownDelaySeconds: 600 # 10 minutes
      postPromotionAnalysis:
        templates:
        - templateName: smoke-tests
        args:
        - name: service-name
          value: preview-svc.default.svc.cluster.local

Failure Conditions and Failure Limit¶

failureCondition can be used to cause an analysis run to fail. failureLimit is the maximum number of failed run an analysis is allowed. The following example continually polls the defined Prometheus server to get the total number of errors(i.e., HTTP response code >= 500) every 5 minutes, causing the measurement to fail if ten or more errors are encountered. The entire analysis run is considered as Failed after three failed measurements.

  metrics:
  - name: total-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code=~"5.*"}[5m]
          ))

ConsecutiveSuccessLimit and FailureLimit¶

Important

consecutiveSuccessLimit available since v1.8

You can use either failureLimit to define a limit for the number of failures before the analysis is considered failed, consecutiveSuccessLimit to define the required consecutive number of successes for the analysis to succeed, or both together. One of them has to be applicable (i.e. not disabled, see below for more), otherwise a validation error is thrown.

To disable:

failureLimit, set the field to -1.
consecutiveSuccessLimit, set the field to 0 (the default value).

The default value for both is 0, the meaning of which differs for each one of them. A value of 0 for failureLimit means its logic is applicable and no failures are tolerated. However, a value of 0 for consecutiveSuccessLimit means it's inapplicable or disabled.

Let's go through each case and show what the behavior would look like.

Only FailureLimit applicable¶

The behavior is shown above in the Failure Conditions and Failure Limit section. This is the default behavior if you set nothing of the two fields (with failureLimit having a default value of 0, so no failures are tolerated).

Only ConsecutiveSuccessLimit applicable¶

To have this behavior, you need to have something like

failureLimit: -1
consecutiveSuccessLimit: 4 # Any value > 0

This behavior is essentially waiting for a condition to hold, or an event to happen. That is, keep measuring a metric and keep failing until you measure N consecutive successful measurements, at which point the analysis concludes successfully. This can be useful as an event-driven way of promoting a rollout when used in an inline analysis.

Both FailureLimit and ConsecutiveSuccessLimit applicable¶

To have this behavior, you need to have something like

failureLimit: 3 # Any value >= 0
consecutiveSuccessLimit: 4 # Any value > 0

The behavior is simply waiting to measure N consecutive successful measurements, while being limited by the number of overall failures specified by failureLimit. Above, we need to have at most 3 failures before we get 4 consecutive successful measurements for the analysis to be considered successful.

In case of an analysis that has count specified (that is, runs for a specific amount of time) and that count is reached, the evaluation of success is as follows:

failureLimit is violated and consecutiveSuccessLimit is satisfied: Failure.
failureLimit is violated and consecutiveSuccessLimit is not satisfied: Failure.
failureLimit is not violated and consecutiveSuccessLimit is satisfied: Success.
failureLimit is not violated and consecutiveSuccessLimit is not satisfied: Inconclusive State.

As illustrated, failureLimit takes priority if violated. However, if neither is violated/satisfied, the analysis reaches an inconclusive state.

Note

When terminating analyses prematurely, they are always terminated successfully, unless it happens that failureLimit is enabled and violated, then they terminate in failure. consecutiveSuccessLimit, if enabled, doesn't affect the termination status.

For more clarity, examples of analyses terminated "prematurely":

A background analysis with count not specified when terminated at the end of the rollout.
Any analysis with count specified and not yet reached when the rollout is aborted.

Dry-Run Mode¶

Important

Available since v1.2

dryRun can be used on a metric to control whether or not to evaluate that metric in a dry-run mode. A metric running in the dry-run mode won't impact the final state of the rollout or experiment even if it fails or the evaluation comes out as inconclusive.

The following example queries prometheus every 5 minutes to get the total number of 4XX and 5XX errors, and even if the evaluation of the metric to monitor the 5XX error-rate fail, the analysis run will pass.

  dryRun:
  - metricName: total-5xx-errors
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

RegEx matches are also supported. .* can be used to make all the metrics run in the dry-run mode. In the following example, even if one or both metrics fail, the analysis run will pass.

  dryRun:
  - metricName: .*
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

Dry-Run Summary¶

If one or more metrics are running in the dry-run mode, the summary of the dry-run results gets appended to the analysis run message. Assuming that the total-4xx-errors metric fails in the above example but, the total-5xx-errors succeeds, the final dry-run summary will look like this.

Message: Run Terminated
Run Summary:
  ...
Dry Run Summary: 
  Count: 2
  Successful: 1
  Failed: 1
Metric Results:
...

Dry-Run Rollouts¶

If a rollout wants to dry run its analysis, it simply needs to specify the dryRun field to its analysis stanza. In the following example, all the metrics from random-fail and always-pass get merged and executed in the dry-run mode.

kind: Rollout
spec:
...
  steps:
  - analysis:
      templates:
      - templateName: random-fail
      - templateName: always-pass
      dryRun:
      - metricName: .*

Dry-Run Experiments¶

If an experiment wants to dry run its analysis, it simply needs to specify the dryRun field under its specs. In the following example, all the metrics from analyze-job matching the RegEx rule test.* will be executed in the dry-run mode.

kind: Experiment
spec:
  templates:
  - name: baseline
    selector:
      matchLabels:
        app: rollouts-demo
    template:
      metadata:
        labels:
          app: rollouts-demo
      spec:
        containers:
        - name: rollouts-demo
          image: argoproj/rollouts-demo:blue
  analyses:
  - name: analyze-job
    templateName: analyze-job
  dryRun:
  - metricName: test.*

Measurements Retention¶

Important

Available since v1.2

measurementRetention can be used to retain other than the latest ten results for the metrics running in any mode (dry/non-dry). Setting this option to 0 would disable it and, the controller will revert to the existing behavior of retaining the latest ten measurements.

The following example queries Prometheus every 5 minutes to get the total number of 4XX and 5XX errors and retains the latest twenty measurements for the 5XX metric run results instead of the default ten.

  measurementRetention:
  - metricName: total-5xx-errors
    limit: 20
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

RegEx matches are also supported. .* can be used to apply the same retention rule to all the metrics. In the following example, the controller will retain the latest twenty run results for all the metrics instead of the default ten results.

  measurementRetention:
  - metricName: .*
    limit: 20
  metrics:
  - name: total-5xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"5.*"}[5m]
          ))
  - name: total-4xx-errors
    interval: 5m
    failureCondition: result[0] >= 10
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code~"4.*"}[5m]
          ))

Measurements Retention for Rollouts Analysis¶

If a rollout wants to retain more results of its analysis metrics, it simply needs to specify the measurementRetention field to its analysis stanza. In the following example, all the metrics from random-fail and always-pass get merged, and their latest twenty measurements get retained instead of the default ten.

kind: Rollout
spec:
...
  steps:
  - analysis:
      templates:
      - templateName: random-fail
      - templateName: always-pass
      measurementRetention:
      - metricName: .*
        limit: 20

Define custom Labels/Annotations for AnalysisRun¶

If you would like to annotate/label the AnalysisRun with the custom labels your can do it by specifying analysisRunMetadata field.

kind: Rollout
spec:
...
  steps:
  - analysis:
      templates:
      - templateName: my-template
      analysisRunMetadata:
        labels:
          my-custom-label: label-value
        annotations:
          my-custom-annotation: annotation-value

Measurements Retention for Experiments¶

If an experiment wants to retain more results of its analysis metrics, it simply needs to specify the measurementRetention field under its specs. In the following example, all the metrics from analyze-job matching the RegEx rule test.* will have their latest twenty measurements get retained instead of the default ten.

kind: Experiment
spec:
  templates:
  - name: baseline
    selector:
      matchLabels:
        app: rollouts-demo
    template:
      metadata:
        labels:
          app: rollouts-demo
      spec:
        containers:
        - name: rollouts-demo
          image: argoproj/rollouts-demo:blue
  analyses:
  - name: analyze-job
    templateName: analyze-job
  measurementRetention:
  - metricName: test.*
    limit: 20

Time-to-live (TTL) Strategy¶

Important

Available since v1.7

ttlStrategy limits the lifetime of an analysis run that has finished execution depending on if it Succeeded or Failed. If this struct is set, once the run finishes, it will be deleted after the time to live expires. If this field is unset, the analysis controller will keep the completed runs, unless they are associated with rollouts using other garbage collection policies (e.g. successfulRunHistoryLimit and unsuccessfulRunHistoryLimit).

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
spec:
  ...
  ttlStrategy:
    secondsAfterCompletion: 3600
    secondsAfterSuccess: 1800
    secondsAfterFailure: 1800

Inconclusive Runs¶

Analysis runs can also be considered Inconclusive, which indicates the run was neither successful, nor failed. Inconclusive runs causes a rollout to become paused at its current step. Manual intervention is then needed to either resume the rollout, or abort. One example of how analysis runs could become Inconclusive, is when a metric defines no success or failure conditions.

  metrics:
  - name: my-query
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

Inconclusive analysis runs might also happen when both success and failure conditions are specified, but the measurement value did not meet either condition.

  metrics:
  - name: success-rate
    successCondition: result[0] >= 0.90
    failureCondition: result[0] < 0.50
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

A use case for having Inconclusive analysis runs are to enable Argo Rollouts to automate the execution of analysis runs, and collect the measurement, but still allow human judgement to decide whether or not measurement value is acceptable and decide to proceed or abort.

Delay Analysis Runs¶

If the analysis run does not need to start immediately (i.e. give the metric provider time to collect metrics on the canary version), Analysis Runs can delay the specific metric analysis. Each metric can be configured to have a different delay. In additional to the metric specific delays, the rollouts with background analysis can delay creating an analysis run until a certain step is reached

Delaying a specific analysis metric:

  metrics:
  - name: success-rate
    # Do not start this analysis until 5 minutes after the analysis run starts
    initialDelay: 5m
    successCondition: result[0] >= 0.90
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: ...

Delaying starting background analysis run until step 3 (Set Weight 40%):

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: guestbook
spec:
  strategy:
    canary:
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 2
      steps:
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 40
      - pause: {duration: 10m}

Referencing Secrets¶

AnalysisTemplates and AnalysisRuns can reference secret objects in .spec.args. This allows users to securely pass authentication information to Metric Providers, like login credentials or API tokens.

An AnalysisRun can only reference secrets from the same namespace as it's running in. This is only relevant for AnalysisRuns, since AnalysisTemplates do not resolve the secret.

In the following example, an AnalysisTemplate references an API token and passes it to a Web metric provider.

This example demonstrates:

The ability to reference a secret in the AnalysisTemplate .spec.args
The ability to pass secret arguments to Metric Providers

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
spec:
  args:
  - name: api-token
    valueFrom:
      secretKeyRef:
        name: token-secret
        key: apiToken
  metrics:
  - name: webmetric
    provider:
      web:
        headers:
        - key: Authorization
          value: "Bearer {{ args.api-token }}"

Handling Metric Results¶

NaN and Infinity¶

Metric providers can sometimes return values of NaN (not a number) and infinity. Users can edit the successCondition and failureCondition fields to handle these cases accordingly.

Here are three examples where a metric result of NaN is considered successful, inconclusive and failed respectively.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
  ...
    successCondition: isNaN(result) || result >= 0.95
status:
  metricResults:
  - count: 1
    measurements:
    - finishedAt: "2021-02-10T00:15:26Z"
      phase: Successful
      startedAt: "2021-02-10T00:15:26Z"
      value: NaN
    name: success-rate
    phase: Successful
    successful: 1
  phase: Successful
  startedAt: "2021-02-10T00:15:26Z"

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
  ...
    successCondition: result >= 0.95
    failureCondition: result < 0.95
status:
  metricResults:
  - count: 1
    measurements:
    - finishedAt: "2021-02-10T00:15:26Z"
      phase: Inconclusive
      startedAt: "2021-02-10T00:15:26Z"
      value: NaN
    name: success-rate
    phase: Inconclusive
    successful: 1
  phase: Inconclusive
  startedAt: "2021-02-10T00:15:26Z"

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
  ...
    successCondition: result >= 0.95
status:
  metricResults:
  - count: 1
    measurements:
    - finishedAt: "2021-02-10T00:15:26Z"
      phase: Failed
      startedAt: "2021-02-10T00:15:26Z"
      value: NaN
    name: success-rate
    phase: Failed
    successful: 1
  phase: Failed
  startedAt: "2021-02-10T00:15:26Z"

Here are two examples where a metric result of infinity is considered successful and failed respectively.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
  ...
    successCondition: result >= 0.95
status:
  metricResults:
  - count: 1
    measurements:
    - finishedAt: "2021-02-10T00:15:26Z"
      phase: Successful
      startedAt: "2021-02-10T00:15:26Z"
      value: +Inf
    name: success-rate
    phase: Successful
    successful: 1
  phase: Successful
  startedAt: "2021-02-10T00:15:26Z"

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
  ...
    failureCondition: isInf(result)
status:
  metricResults:
  - count: 1
    measurements:
    - finishedAt: "2021-02-10T00:15:26Z"
      phase: Failed
      startedAt: "2021-02-10T00:15:26Z"
      value: +Inf
    name: success-rate
    phase: Failed
    successful: 1
  phase: Failed
  startedAt: "2021-02-10T00:15:26Z"

Empty array¶

Prometheus¶

Metric providers can sometimes return empty array, e.g., no data returned from prometheus query.

Here are two examples where a metric result of empty array is considered successful and failed respectively.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
  ...
    successCondition: len(result) == 0 || result[0] >= 0.95
status:
  metricResults:
  - count: 1
    measurements:
    - finishedAt: "2021-09-08T19:15:49Z"
      phase: Successful
      startedAt: "2021-09-08T19:15:49Z"
      value: '[]'
    name: success-rate
    phase: Successful
    successful: 1
  phase: Successful
  startedAt:  "2021-09-08T19:15:49Z"

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
  ...
    successCondition: len(result) > 0 && result[0] >= 0.95
status:
  metricResults:
  - count: 1
    measurements:
    - finishedAt: "2021-09-08T19:19:44Z"
      phase: Failed
      startedAt: "2021-09-08T19:19:44Z"
      value: '[]'
    name: success-rate
    phase: Failed
    successful: 1
  phase: Failed
  startedAt: "2021-09-08T19:19:44Z"

Datadog¶

Datadog queries can return empty results if the query takes place during a time interval with no metrics. The Datadog provider will return a nil value yielding an error during the evaluation phase like:

invalid operation: < (mismatched types <nil> and float64)

However, empty query results yielding a nil value can be handled using the default() function. Here is a succeeding example using the default() function:

successCondition: default(result, 0) < 0.05