Overview
This documentation guides you through the process of setting up Flagger Canary deployment with NGINX for Cloudentity. Flagger is a progressive delivery tool that automates the release process using canary deployments. It provides observability, control, and operational simplicity when deploying microservices on Kubernetes.
Canary deployment is a process which provides a balance between risk and progress, ensuring that new versions can be tested and rolled out in a controlled manner. Each step of the canary is described below.
- User makes a request: This is the initial user action. It could be accessing a website, using a feature of an application, etc.
- Request reaches the Ingress: The Ingress is like a traffic controller that manages incoming requests. It decides where to route each user request based on certain rules or strategies.
- Majority of requests are routed to the stable version (V1.0): To minimize risk, the Ingress initially directs the majority of user requests to the current stable version of the software.
- Few Canary Users are routed to the new version (V2.0): A small percentage of users (the “canary users”) are directed to the new version of the software. The purpose of this is to test the new version in a live production environment with a limited user base, reducing potential impact in case of unforeseen issues.
- Performance evaluation: The new version’s performance is monitored closely.
- If it performs badly: If any significant issues or performance degradation are detected with the new version, the canary deployment is aborted. In this case, all users are then routed back to the stable version until the issues with the new version are resolved.
- If it performs well: If the new version operates as expected and no significant issues are found, the percentage of user traffic directed to the new version is gradually increased over time. This is represented by the “Increase Canary Users over Time” step.
- Gradual replacement of the stable version: As the new version proves to be stable and efficient, it gradually takes over the entire user traffic from the stable version, thus completing the canary deployment process.
Note
For a complete and ready-to-use solution, consider exploring our Cloudentity on Kubernetes via the GitOps approach. Get started with our quickstart guide, and delve deeper with the deployment configuration details.
Prerequisites
Before proceeding, ensure that you have the following tools installed on your system:
Set Up Kubernetes Kind Cluster
If you don’t have a Kubernetes cluster, you can set up a local one using Kind. Kind allows you to run Kubernetes on your local machine. This step is optional if you already have a Kubernetes cluster set up.
Create a configuration file named kind-config.yaml
with the following content:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraPortMappings:
- containerPort: 30443
hostPort: 8443
protocol: TCP
This configuration creates a new Kind cluster with a single control-plane node and map port 30443
from the container to port 8443 on the host. It is used to access NGINX ingress on localhost
.
Now, you can create a new Kind cluster with this configuration:
kind create cluster --name=my-cluster --config=kind-config.yaml
Learn more
Visit the Kind official documentation.
Install NGINX Ingress Controller
Flagger uses the NGINX Ingress Controller to control the traffic routing during the canary deployment. Flagger modifies the Ingress resource to gradually shift traffic from the stable version of the ACP app to the canary version. This allows us to monitor how the system behaves under the canary version without fully committing to it.
To install the NGINX Ingress Controller in the nginx
namespace, use the following commands:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update ingress-nginx
helm install nginx-ingress ingress-nginx/ingress-nginx --create-namespace --namespace nginx --set controller.service.type=NodePort --set controller.service.nodePorts.https=30443
--set controller.service.type=NodePort
sets the type of the service to NodePort, allowing it to be accessible on a static port on the cluster.--set controller.service.nodePorts.https=30443
specifies the node port for HTTPS traffic as 30443. Any HTTPS traffic sent to this port will be forwarded to the NGINX service.
Learn more
Install Prometheus Operator
Prometheus Operator simplifies the deployment and configuration of Prometheus, a monitoring system and time series database. Prometheus collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets.
Flagger uses Prometheus to retrieve metrics about Cloudentity and uses this information to make decisions during the canary deployment. If the metrics indicate that the canary version is causing issues, Flagger will halt the rollout and revert to the stable version.
To install prometheus stack in the monitoring
namespace, use the following commands:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update prometheus-community
helm install prometheus prometheus-community/kube-prometheus-stack --create-namespace --namespace monitoring --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
– By default, the Prometheus Operator only selects ServiceMonitors within its own release. This flag ensures that the Prometheus Operator will also select ServiceMonitors outside of its own release. This is necessary when you want to monitor services in other namespaces or from other releases like Cloudentity.
Learn more
Visit the Prometheus Operator Helm Chart repository.
Install Flagger with NGINX Backend
Flagger is a Kubernetes operator that automates the promotion of canary deployments using various service mesh providers, including NGINX. Flagger requires a running Kubernetes cluster and uses Prometheus for monitoring and collecting metrics during the canary deployment process. In this step, install Flagger with the NGINX backend:
helm repo add flagger https://flagger.app
helm repo update flagger
helm upgrade -i flagger flagger/flagger --create-namespace --namespace=flagger --set meshProvider=nginx --set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090
helm upgrade -i loadtester flagger/loadtester --namespace=flagger
--set meshProvider=nginx
– the flag sets the service mesh provider to NGINX. Flagger supports multiple service mesh providers, and in this case, you’re specifying that you’re using NGINX.--set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090
– the flag sets the URL of the metrics server, where Flagger will fetch metrics from during canary analysis. In this case, the Prometheus server’s service address in the monitoring namespace is being used.
Learn more
Visit the Flagger official documentation.
Install Cloudentity Helm chart
In this step, you need to install Cloudentity on Kubernetes using the kube-acp-stack Helm Chart.
Define Flagger Canary Custom Resource for Cloudentity
In this step, you’ll define a Flagger Canary custom resource. This resource describes the desired state for Cloudentity deployment and plays a key role in enabling the automated canary deployment process.
This custom resource is broken down into the following sections:
-
Canary
includes configuration about the deployment and service to watch, the ingress reference, service port details, analysis, and related webhooks.Parameter Description provider Specifies the service mesh provider, in this case, NGINX. targetRef Identifies Cloudentity deployment object that Flagger will manage. ingressRef Identifies Cloudentity Ingress object that Flagger will manage. service Specifies ports used by Cloudentity service. analysis Defines the parameters for the canary analysis as below. interval Defines the period of single stepWeight iteration. stepWeights Number of steps and their traffic routing weights for the canary service. threshold Number of times a canary analysis is allowed to fail before its rolled back. webhooks These are used for running validation tests before a canary is started. In this case, a pre-rollout webhook is configured to check the “alive” status. metrics This section defines the metrics that will be checked at the end of each iteration. Each metric includes a reference to a MetricTemplate, a threshold, and an interval for fetching the metric. -
MetricTemplate
describes how to fetch the Cloudentity metrics from Prometheus. It contains a custom query that Flagger will use to fetch and analyze metrics from Prometheus. -
ServiceMonitor
is configured to monitor the canary version of Cloudentity. This enables Flagger and Prometheus to monitor the performance of the canary version during the canary analysis.
Note
Below configuration is an example configuration for the purpose of this article. For a complete list of recommended metrics to monitor during a canary release of Cloudentity, refer to the Recommended Cloudentity Metrics to be Monitored section.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: acp
spec:
provider: nginx
targetRef:
apiVersion: apps/v1
kind: Deployment
name: acp
ingressRef:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: acp
service:
port: 8443
targetPort: http
portDiscovery: true
analysis:
interval: 1m
threshold: 2
stepWeights: [5, 10, 15, 20, 25, 35, 45, 60, 75, 90]
webhooks:
- name: "check alive"
type: pre-rollout
url: http://loadtester/
timeout: 15s
metadata:
type: bash
cmd: "curl -k https://acp-canary.acp:8443/alive"
metrics:
- name: "ACP P95 Latency"
templateRef:
name: acp-request-duration
thresholdRange:
max: 0.25
interval: 60s
---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-request-duration
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0)
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: acp-canary
spec:
endpoints:
- port: metrics
interval: 10s
selector:
matchLabels:
app.kubernetes.io/name: acp-canary
Save the above YAML configuration in a file named acp-canary.yaml
and apply it to your cluster:
kubectl apply -f acp-canary.yaml --namespace acp
Applying those resource creates a second Cloudentity ingress, along with additional services for the canary and primary deployments. Check if the Canary resource has been successfully initialized by running:
kubectl get canaries --namespace acp
The status should be initialized
Learn more
Visit the NGINX Canary Deployments.
Triggering Canary Release, Simulating Latency and Observing Canary Failures
In this step, we will trigger a canary release by making a change in the Cloudentity Helm chart. We will then simulate latency to cause the canary process to fail. We will also learn how to observe the canary process.
Access Ingress on Kind with Local Domain
In order to access the ingress on kind using the local
domain, you need to map the domain to your
localhost in your hosts file.
Open the hosts file in a text editor.
sudo vi /etc/hosts
Add the following line to the file:
127.0.0.1 acp.local
Save your changes and exit the text editor.
Now, you should be able to access ingress on the kind cluster via https://acp.local:8443
from your browser.
Make a Change in the Cloudentity Helm Chart to Trigger Canary Release
helm upgrade acp cloudentity/kube-acp-stack --namespace acp --set acp.serviceMonitor.enabled=true --set acp.config.data.logging.level=warn
It starts a canary version of Cloudentity:
kubectl get pods -n acp -l 'app.kubernetes.io/name in (acp, acp-primary)'
NAME READY STATUS RESTARTS AGE
acp-69cdf99895-b6xcb 1/1 Running 0 3m6s
acp-primary-868d5889d6-5xj59 1/1 Running 0 11m
- Pod named
acp-<id>
is canary version of Cloudentity which is analyzed before promotion. - Pod named
acp-primary-<id>
is our currently deployed version of Cloudentity.
Simulate Latency Using tc
Command
We want to add latency just to the canary pod in order to simulate issues with the deployed new version.
-
Get Cloudentity pod network interface index, in the case below its
48
as indicated byeth0@if48
.kubectl exec acp-69cdf99895-b6xcb --namespace acp -- ip link show eth0 2: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP link/ether 76:07:42:2a:3e:e0 brd ff:ff:ff:ff:ff:ff
-
Identify pod interface name in the Kind cluster. We will use number from previous command, in this case it will be
veth3ab723f4
as indicated by number48
next to it (note that@if2
was removed from the name).docker exec my-cluster-control-plane ip link show 48: veth3ab723f4@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether da:1f:26:cf:a2:e9 brd ff:ff:ff:ff:ff:ff link-netns cni-43b9239b-e0d1-b12f-16a6-0fcfc72e8df0
-
Set latency on using name with the below command:
docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 10ms
It’s also important to note that adding latency this way, affects all individual packets. If a requiest is made of multiple packets, the latency sums up which can be observed later on.
Observe the Canary Process
During the canary deployment, Flagger gradually shifts traffic from the old version to the new
version of Cloudentity while monitoring the defined metrics. You can observe the
progress of the canary deployment by querying the Flagger logs with kubectl
.
kubectl -n acp describe canary/acp
kubectl logs flagger-5d4f4f785-gjdt6 --namespace flagger --follow
We will start a simple curl
command to connect to Cloudentity. As we had set latency
to 10ms, responses from canary version are delayed. You can observe over time as more and more
requests are hitting the canary version of Cloudentity.
If you used canary manifest from previous steps, canary lasts for 10 minutes and progressivly shifts traffic into the canary version starting from 5% in the first iteration to 90% in the last iteration.
while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done
Response time: 0.006966s
Response time: 0.119855s <- canary version (static delay of 10ms)
Response time: 0.068919s
Now, lets increase the latency even more causing the canary release to fail:
docker exec my-cluster-control-plane tc qdisc add del veth3ab723f4 root
docker exec my-cluster-control-plane tc qdisc add dev veth3ab723f4 root netem delay 100ms
The times are above 250ms threshold:
while true; do curl -o /dev/null -k -s -w 'Response time: %{time_total}s\n' https://acp.local:8443/health && sleep 1; done
Response time: 0.607907s
Response time: 0.010327s <- primary version (no delay)
Response time: 0.608345s
If the average latency is above the threshold, you can see the canary process failing as can be observed in the flagger logs:
{"level":"info","ts":"2023-05-18T14:20:58.628Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up acp.acp","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.617Z","caller":"controller/events.go:33","msg":"Starting canary analysis for acp.acp","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.734Z","caller":"controller/events.go:33","msg":"Pre-rollout check check alive passed","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:21:58.898Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 5","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:22:58.761Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 10","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:23:58.755Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 15","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:24:58.754Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 20","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:25:58.747Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:26:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 35","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:27:58.751Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 45","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:28:58.752Z","caller":"controller/events.go:33","msg":"Advance acp.acp canary weight 60","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:31:58.617Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 1.34 > 0.25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:32:58.618Z","caller":"controller/events.go:45","msg":"Halt acp.acp advancement ACP P95 Latency 0.97 > 0.25","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:33:58.616Z","caller":"controller/events.go:45","msg":"Rolling back acp.acp failed checks threshold reached 2","canary":"acp.acp"}
{"level":"info","ts":"2023-05-18T14:33:58.749Z","caller":"controller/events.go:45","msg":"Canary failed! Scaling down acp.acp","canary":"acp.acp"}
Clean Up
Once you’re done with your testing, you may want to clean up the resources you’ve created. Since all our resources are created within the kind cluster, we just need to delete the kind cluster to clean up.
To delete the kind cluster, run the following command:
kind delete cluster --name=my-cluster
This command deletes the Kind cluster named my-cluster
, and with it all the resources within
the cluster. Please replace my-cluster
with the name of your cluster if it’s different. Be aware
that this deletes all the resources within the cluster, including any applications or services
you’ve deployed.
Also, remember to clean up any changes you’ve made to your /etc/hosts
file.
Recommended Cloudentity Metrics to be Monitored
When performing canary deployments, it’s crucial to monitor specific metrics to ensure the new version is performing as expected.
Here are some recommended metrics for the Cloudentity application:
Error Rates
Monitoring the rate of various HTTP error codes (5xx) can help identify issues with the new version.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-error-rate
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
(100 - sum(rate(acp_http_duration_seconds_count{job="acp-canary", status_code!~"5.."}[{{ interval }}])) / sum(rate(acp_http_duration_seconds_count{job="acp-canary"}[{{ interval }}])) * 100) or on() vector(0)
Request Duration
This is the time taken to serve a request. It can be helpful in detecting performance regressions in the new version.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-request-duration
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
avg(histogram_quantile(0.95, rate(acp_http_duration_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) or on() vector(0)
Queue Pending Messages
This metric represents the number of messages currently pending in the queue. A sudden increase might indicate a problem with processing the queue’s messages.
---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-pending-messages
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
sum(acp_redis_error_count{job="acp-canary"}) or on() vector(0)
Queue Processing Time
This metric represents the time it takes to process a message from the queue. Increased processing time can indicate performance issues with the new version.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: acp-lag-messages
spec:
provider:
type: prometheus
address: http://prometheus-kube-prometheus-prometheus.monitoring:9090
query: |
avg(histogram_quantile(0.95, rate(acp_redis_lag_seconds_bucket{job="acp-canary"}[{{ interval }}])) > 0) by (group, stream) or on() vector(0)