Datadog Metrics¶
A Datadog query can be used to obtain measurements for analysis.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: loq-error-rate
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 5m
successCondition: result <= 0.01
failureLimit: 3
provider:
datadog:
apiVersion: v2
interval: 5m
query: |
sum:requests.error.rate{service:{{args.service-name}}}
The field apiVersion
refers to the API version of Datadog (v1 or v2). Default value is v1
if this is omitted. See "Working with Datadog API v2" below for more information.
Datadog api and app tokens can be configured in a kubernetes secret in argo-rollouts namespace.
apiVersion: v1
kind: Secret
metadata:
name: datadog
type: Opaque
stringData:
address: https://api.datadoghq.com
api-key: <datadog-api-key>
app-key: <datadog-app-key>
apiVersion
here is different from the apiVersion
from the Datadog configuration above.
Working with Datadog API v2¶
Important
While some basic v2 functionality is working in earlier versions, the new properties of formula
and queries
are only available as of v1.7
Moving to v2¶
If your old v1 was just a simple metric query - no formula as part of the query - then you can just move to v2 by updating the apiVersion
in your existing Analysis Template, and everything should work.
If you have a formula, you will need to update how you configure your metric. Here is a before/after example of what your Analysis Template should look like:
Before:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: log-error-rate
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 30s
successCondition: default(result, 0) < 10
failureLimit: 3
provider:
datadog:
apiVersion: v1
interval: 5m
query: "moving_rollup(sum:requests.errors{service:{{args.service-name}}}.as_count(), 60, 'sum') / sum:requests{service:{{args.service-name}}}.as_count()"
After:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: loq-error-rate
spec:
args:
- name: service-name
metrics:
- name: error-rate
# Polling rate against the Datadog API
interval: 30s
successCondition: default(result, 0) < 10
failureLimit: 3
provider:
datadog:
apiVersion: v2
# The window of time we are looking at in DD. Basically we will fetch data from (now-5m) to now.
interval: 5m
queries:
a: sum:requests.errors{service:{{args.service-name}}}.as_count()
b: sum:requests{service:{{args.service-name}}}.as_count()
formula: "moving_rollup(a, 60, 'sum') / b"
Examples¶
Simple v2 query with no formula
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-container-restarts
spec:
args:
# This is set in rollout using the valueFrom: podTemplateHashValue functionality
- name: canary-hash
- name: service-name
- name: restarts.initial-delay
value: "60s"
- name: restarts.max-restarts
value: "4"
metrics:
- name: kubernetes.containers.restarts
initialDelay: "{{ args.restarts.initial-delay }}"
interval: 15s
failureCondition: default(result, 0) > {{ args.restarts.max-restarts }}
failureLimit: 0
provider:
datadog:
apiVersion: v2
interval: 5m
queries:
# The key is arbitrary - you will use this key to refer to the query if you use a formula.
q: "max:kubernetes.containers.restarts{service-name:{{args.service-name}},rollouts_pod_template_hash:{{args.canary-hash}}}"
Tips¶
Datadog Results¶
Datadog queries can return empty results if the query takes place during a time interval with no metrics. The Datadog provider will return a nil
value yielding an error during the evaluation phase like:
invalid operation: < (mismatched types <nil> and float64)
However, empty query results yielding a nil
value can be handled using the default()
function. Here is a succeeding example using the default()
function:
successCondition: default(result, 0) < 0.05
Templates and Helm¶
Helm and Argo Rollouts both try to parse things between {{ ... }}
when rendering templates. If you use Helm to deliver your manifests, you will need to escape {{ args.whatever }}
. Using the example above, here it is set up for Helm:
...<snip>
metrics:
- name: kubernetes.containers.restarts
initialDelay: "{{ `{{ args.restarts.initial-delay }}` }}"
interval: 15s
failureCondition: default(result, 0) > {{ `{{ args.restarts.max-restarts }}` }}
failureLimit: 0
provider:
datadog:
apiVersion: v2
interval: 5m
queries:
q: "{{ `max:kubernetes.containers.restarts{kube_app_name:{{args.kube_app_name}},rollouts_pod_template_hash:{{args.canary-hash}}}` }}"
Rate Limits¶
For the v1
API, you ask for an increase on the api/v1/query
route.
For the v2
API, the Ratelimit-Name you ask for an increase in is the query_scalar_public
.