Deployment and Operations

12 mins read

Canary Deployments

Get started with canary deployment on Kubernetes using Flagger.

Overview

This documentation guides you through the process of setting up Flagger Canary deployment with NGINX for Cloudentity. Flagger is a progressive delivery tool that automates the release process using canary deployments. It provides observability, control, and operational simplicity when deploying microservices on Kubernetes.

Canary deployment is a process which provides a balance between risk and progress, ensuring that new versions can be tested and rolled out in a controlled manner. Each step of the canary is described below.

[mermaid-begin]
graph LR UserRequest[User makes request] --> Ingress Ingress -->|Majority of Requests| StableVersion[V1.0 - Stable Version] Ingress -->|Canary Users| NewVersion[V2.0 - New Version] NewVersion -->|Performs bad| UserResponse2[Canary aborted] NewVersion -->|Performs well| IncreasingCanaryUsers[Increase Canary Users over Time] IncreasingCanaryUsers --> StableVersionReplacement[Gradually replaces V1.0]
  • User makes a request: This is the initial user action. It could be accessing a website, using a feature of an application, etc.
  • Request reaches the Ingress: The Ingress is like a traffic controller that manages incoming requests. It decides where to route each user request based on certain rules or strategies.
  • Majority of requests are routed to the stable version (V1.0): To minimize risk, the Ingress initially directs the majority of user requests to the current stable version of the software.
  • Few Canary Users are routed to the new version (V2.0): A small percentage of users (the “canary users”) are directed to the new version of the software. The purpose of this is to test the new version in a live production environment with a limited user base, reducing potential impact in case of unforeseen issues.
  • Performance evaluation: The new version’s performance is monitored closely.
  • If it performs badly: If any significant issues or performance degradation are detected with the new version, the canary deployment is aborted. In this case, all users are then routed back to the stable version until the issues with the new version are resolved.
  • If it performs well: If the new version operates as expected and no significant issues are found, the percentage of user traffic directed to the new version is gradually increased over time. This is represented by the “Increase Canary Users over Time” step.
  • Gradual replacement of the stable version: As the new version proves to be stable and efficient, it gradually takes over the entire user traffic from the stable version, thus completing the canary deployment process.

Note

For a complete and ready-to-use solution, consider exploring our Cloudentity on Kubernetes via the GitOps approach. Get started with our quickstart guide, and delve deeper with the deployment configuration details.

Prerequisites

Before proceeding, ensure that you have the following tools installed on your system:

  • Kind v0.13.0+ (Kubernetes cluster v1.16+)
  • Helm v3.0+

Set Up Kubernetes Kind Cluster

If you don’t have a Kubernetes cluster, you can set up a local one using Kind. Kind allows you to run Kubernetes on your local machine. This step is optional if you already have a Kubernetes cluster set up.

Create a configuration file named kind-config.yaml with the following content:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30443
    hostPort: 8443
    protocol: TCP

This configuration creates a new Kind cluster with a single control-plane node and map port 30443 from the container to port 8443 on the host. It is used to access NGINX ingress on localhost.

Now, you can create a new Kind cluster with this configuration:

kind create cluster --name=my-cluster --config=kind-config.yaml

Learn more

Visit the Kind official documentation.

Install NGINX Ingress Controller

Flagger uses the NGINX Ingress Controller to control the traffic routing during the canary deployment. Flagger modifies the Ingress resource to gradually shift traffic from the stable version of the ACP app to the canary version. This allows us to monitor how the system behaves under the canary version without fully committing to it.

To install the NGINX Ingress Controller in the nginx namespace, use the following commands:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update ingress-nginx
helm install nginx-ingress ingress-nginx/ingress-nginx --create-namespace --namespace nginx --set controller.service.type=NodePort --set controller.service.nodePorts.https=30443
  • --set controller.service.type=NodePort sets the type of the service to NodePort, allowing it to be accessible on a static port on the cluster.
  • --set controller.service.nodePorts.https=30443 specifies the node port for HTTPS traffic as 30443. Any HTTPS traffic sent to this port will be forwarded to the NGINX service.

Learn more

Visit the NGINX Ingress Controller official documentation.

Install Prometheus Operator

Prometheus Operator simplifies the deployment and configuration of Prometheus, a monitoring system and time series database. Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets.

Flagger uses Prometheus to retrieve metrics about Cloudentity and uses this information to make decisions during the canary deployment. If the metrics indicate that the canary version is causing issues, Flagger will halt the rollout and revert to the stable version.

To install prometheus stack in the monitoring namespace, use the following commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update prometheus-community
helm install prometheus prometheus-community/kube-prometheus-stack --create-namespace --namespace monitoring --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
  • --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false – By default, the Prometheus Operator only selects ServiceMonitors within its own release. This flag ensures that the Prometheus Operator will also select ServiceMonitors outside of its own release. This is necessary when you want to monitor services in other namespaces or from other releases like Cloudentity.

Learn more

Visit the Prometheus Operator Helm Chart repository.

Install Flagger with NGINX Backend

Flagger is a Kubernetes operator that automates the promotion of canary deployments using various service mesh providers, including NGINX. Flagger requires a running Kubernetes cluster and uses Prometheus for monitoring and collecting metrics during the canary deployment process. In this step, install Flagger with the NGINX backend:

helm repo add flagger https://flagger.app
helm repo update flagger
helm upgrade -i flagger flagger/flagger --create-namespace --namespace=flagger --set meshProvider=nginx --set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090
helm upgrade -i loadtester flagger/loadtester --namespace=flagger
  • --set meshProvider=nginx – the flag sets the service mesh provider to NGINX. Flagger supports multiple service mesh providers, and in this case, you’re specifying that you’re using NGINX.
  • --set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090 – the flag sets the URL of the metrics server, where Flagger will fetch metrics from during canary analysis. In this case, the Prometheus server’s service address in the monitoring namespace is being used.

Learn more

Visit the Flagger official documentation.

Install Cloudentity Helm chart

In this step, you need to install Cloudentity on Kubernetes using the kube-acp-stack Helm Chart.

Define Flagger Canary Custom Resource for Cloudentity

In this step, you’ll define a Flagger Canary custom resource. This resource describes the desired state for Cloudentity deployment and plays a key role in enabling the automated canary deployment process.

This custom resource is broken down into the following sections:

  1. Canary includes configuration about the deployment and service to watch, the ingress reference, service port details, analysis, and related webhooks.

    Parameter Description
    provider Specifies the service mesh provider, in this case, NGINX.
    targetRef Identifies Cloudentity deployment object that Flagger will manage.
    ingressRef Identifies Cloudentity Ingress object that Flagger will manage.
    service Specifies ports used by Cloudentity service.
    analysis Defines the parameters for the canary analysis as below.
    interval Defines the period of single stepWeight iteration.
    stepWeights Number of steps and their traffic routing weights for the canary service.
    threshold Number of times a canary analysis is allowed to fail before its rolled back.
    webhooks These are used for running validation tests before a canary is started. In this case, a pre-rollout webhook is configured to check the “alive” status.
    metrics This section defines the metrics that will be checked at the end of each iteration. Each metric includes a reference to a MetricTemplate, a threshold, and an interval for fetching the metric.
  2. MetricTemplate describes how to fetch the Cloudentity metrics from Prometheus. It contains a custom query that Flagger will use to fetch and analyze metrics from Prometheus.

  3. ServiceMonitor is configured to monitor the canary version of Cloudentity. This enables Flagger and Prometheus to monitor the performance of the canary version during the canary analysis.

Note

Below configuration is an example configuration for the purpose of this article. For a complete list of recommended metrics to monitor during a canary release of Cloudentity, refer to the Recommended Cloudentity Metrics to be Monitored section.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: acp
spec:
  provider: nginx
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: acp
  ingressRef:
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: acp
  service:
    port: 8443
    targetPort: http
    portDiscovery: true
  analysis:
    interval: 1m
    threshold: 2
    stepWeights: [5, 10, 15, 20, 25, 35, 45, 60, 75, 90]
    webhooks:
      - name: "check alive"
        type: pre-rollout
        url: http://loadtester/
        timeout: 15s
        metadata:
          type: bash
          cmd: "curl -k https://acp-canary.acp:8443/alive"
    metrics:
      - name: "ACP P95 Latency"
        templateRef:
          name: acp-request-duration
        thresholdRange:
          max: 0.25
        interval: 60s
---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: acp-request-duration
spec:
  provider:
    type: prometheus
    address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
  query: |
        avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0)
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: acp-canary
spec:
  endpoints:
    - port: metrics
      interval: 10s
  selector:
    matchLabels:
      app.kubernetes.io/name: acp-canary

Save the above YAML configuration in a file named acp-canary.yaml and apply it to your cluster:

kubectl apply -f acp-canary.yaml --namespace acp

Applying those resource creates a second Cloudentity ingress, along with additional services for the canary and primary deployments. Check if the Canary resource has been successfully initialized by running:

kubectl get canaries --namespace acp

The status should be initialized

Learn more

Visit the NGINX Canary Deployments.

Triggering Canary Release, Simulating Latency and Observing Canary Failures

In this step, we will trigger a canary release by making a change in the Cloudentity Helm chart. We will then simulate latency to cause the canary process to fail. We will also learn how to observe the canary process.

Access Ingress on Kind with Local Domain

In order to access the ingress on kind using the local domain, you need to map the domain to your localhost in your hosts file.

Open the hosts file in a text editor.

sudo vi /etc/hosts

Add the following line to the file:

127.0.0.1       acp.local

Save your changes and exit the text editor.

Now, you should be able to access ingress on the kind cluster via https://acp.local:8443 from your browser.

Make a Change in the Cloudentity Helm Chart to Trigger Canary Release

helm upgrade acp cloudentity/kube-acp-stack --namespace acp --set acp.serviceMonitor.enabled=true --set acp.config.data.logging.level=warn

It starts a canary version of Cloudentity:

kubectl get pods -n acp -l 'app.kubernetes.io/name in (acp, acp-primary)'

NAME                           READY   STATUS    RESTARTS        AGE
acp-69cdf99895-b6xcb           1/1     Running   0               3m6s
acp-primary-868d5889d6-5xj59   1/1     Running   0               11m
  • Pod named acp-<id> is canary version of Cloudentity which is analyzed before promotion.
  • Pod named acp-primary-<id> is our currently deployed version of Cloudentity.

Simulate Latency Using tc Command

We want to add latency just to the canary pod in order to simulate issues with the deployed new version.

  1. Get Cloudentity pod network interface index, in the case below its 48 as indicated by eth0@if48.

    kubectl exec acp-69cdf99895-b6xcb --namespace acp -- ip link show eth0
    
    2: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
       link/ether 76:07:42:2a:3e:e0 brd ff:ff:ff:ff:ff:ff
    
  2. Identify pod interface name in the Kind cluster. We will use number from previous command, in this case it will be veth3ab723f4 as indicated by number 48 next to it (note that @if2 was removed from the name).

    docker exec my-cluster-control-plane ip link show
    
    48: veth3ab723f4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
        link/ether da:1f:26:cf:a2:e9 brd ff:ff:ff:ff:ff:ff link-netns cni-43b9239b-e0d1-b12f-16a6-0fcfc72e8df0
    
  3. Set latency on using name with the below command:

    docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 10ms
    

It’s also important to note that adding latency this way, affects all individual packets. If a requiest is made of multiple packets, the latency sums up which can be observed later on.

Observe the Canary Process

During the canary deployment, Flagger gradually shifts traffic from the old version to the new version of Cloudentity while monitoring the defined metrics. You can observe the progress of the canary deployment by querying the Flagger logs with kubectl.

kubectl -n acp describe canary/acp
kubectl logs flagger-5d4f4f785-gjdt6 --namespace flagger --follow

We will start a simple curl command to connect to Cloudentity. As we had set latency to 10ms, responses from canary version are delayed. You can observe over time as more and more requests are hitting the canary version of Cloudentity.

If you used canary manifest from previous steps, canary lasts for 10 minutes and progressivly shifts traffic into the canary version starting from 5% in the first iteration to 90% in the last iteration.

while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done

Response time: 0.006966s
Response time: 0.119855s <- canary version (static delay of 10ms)
Response time: 0.068919s

Now, lets increase the latency even more causing the canary release to fail:

docker exec my-cluster-control-plane tc qdisc add del veth3ab723f4 root
docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 100ms

The times are above 250ms threshold:

while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done
Response time: 0.607907s
Response time: 0.010327s <- primary version (no delay)
Response time: 0.608345s

If the average latency is above the threshold, you can see the canary process failing as can be observed in the flagger logs:

{"level":"info","ts":"2023-05-18T14:20:58.628Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up acp.acp","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.617Z","caller":"controller/events.go:33","msg":"Starting canary analysis for acp.acp","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.734Z","caller":"controller/events.go:33","msg":"Pre-rollout check check alive passed","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.898Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 5","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:22:58.761Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 10","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:23:58.755Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 15","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:24:58.754Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 20","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:25:58.747Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:26:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 35","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:27:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 45","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:28:58.752Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 60","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:31:58.617Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 1.34 > 0.25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:32:58.618Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 0.97 > 0.25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:33:58.616Z","caller":"controller/events.go:45","msg":"Rolling back acp.acp failed checks threshold reached 2","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:33:58.749Z","caller":"controller/events.go:45","msg":"Canary failed! Scaling down acp.acp","canary":"acp.acp"}

Clean Up

Once you’re done with your testing, you may want to clean up the resources you’ve created. Since all our resources are created within the kind cluster, we just need to delete the kind cluster to clean up.

To delete the kind cluster, run the following command:

kind delete cluster --name=my-cluster

This command deletes the Kind cluster named my-cluster, and with it all the resources within the cluster. Please replace my-cluster with the name of your cluster if it’s different. Be aware that this deletes all the resources within the cluster, including any applications or services you’ve deployed.

Also, remember to clean up any changes you’ve made to your /etc/hosts file.

When performing canary deployments, it’s crucial to monitor specific metrics to ensure the new version is performing as expected.

Here are some recommended metrics for the Cloudentity application:

Error Rates

Monitoring the rate of various HTTP error codes (5xx) can help identify issues with the new version.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: acp-error-rate
spec:
  provider:
    type: prometheus
    address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
  query: |
    (100 - sum(rate(acp_http_duration_seconds_count{job="acp-canary", status_code!~"5.."}[{{ interval }}])) / sum(rate(acp_http_duration_seconds_count{job="acp-canary"}[{{ interval }}])) * 100) or on() vector(0)

Request Duration

This is the time taken to serve a request. It can be helpful in detecting performance regressions in the new version.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: acp-request-duration
spec:
  provider:
    type: prometheus
    address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
  query: |
    avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0)

Queue Pending Messages

This metric represents the number of messages currently pending in the queue. A sudden increase might indicate a problem with processing the queue’s messages.

---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: acp-pending-messages
spec:
  provider:
    type: prometheus
    address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
  query: |
    sum(acp_redis_error_count{job="acp-canary"}) or on() vector(0)

Queue Processing Time

This metric represents the time it takes to process a message from the queue. Increased processing time can indicate performance issues with the new version.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: acp-lag-messages
spec:
  provider:
    type: prometheus
    address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
  query: |
    avg(histogram_quantile(0.95, rate(acp_redis_lag_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) by (group, stream) or on() vector(0)
Updated: Oct 27, 2023