Miscellaneous

How SaaS Platforms Can Benefit From Canary Deployments with Smoke Tests

Learn how canary deployments work at Cloudentity and how organizations with products delivered as SaaS can benefit from deploying their platforms in a similar way with smoke tests included at every step.

What Canary Deployment Is

Canary deployment is the process of releasing a new version of a product where the new version of a product is available only for a subset of users. During the deployment, this subset constantly grows within predefined thresholds only when all acceptance criteria and quality gates are fulfilled. Deployment is finished when 100% of users are using the new version of a product. When the quality gates aren’t met, there is a rollback process, and a new version of a product vanishes. In such a case, all users use the same – old version until the new canary deployment starts.

Blue Green Deployments vs. Canary Deployments

Previous Approach - Blue Green Deployments

An old strategy that we used is called Blue Green Deployments. If there was some trigger to deploy a new version, we were setting a new version on the side, not available publicly. We were running limited API smoke tests in isolation against our new version of the main Cloudentity service. Based on the results of the tests we were promoting the new version of the Cloudentity to the users / available publicly. The main difference with the current approach is the fact that in the happy path deployment flow, we were promoting the Cloudentity for all users at the same time, which was a global change in all regions. The current approach changes it.

Flagger Blue Green

Current Approach - Canary Deployments

At Cloudentity, in the Continuous Deployment process, we’re using Flagger. Thanks to Flagger we’re able to use some of their strategies.

Try it now

Using Cloudentity on-prem? You can set up canary deployments yourself.

Currently, we’ve prepared a more extensive deployment process. We’re using the Canary Release Strategy. The approach relies on gradually and slowly rolling out the new version to a small number of users. We’ve divided it into 2 parts. We named them Internal Canaries and Public Canaries. If there was a trigger to deploy a new version of the application the internal canaries start. When Internal deployment ends with a success - the public ones start.

Internal and Public Canaries

Internal Canaries

Internal Canary is an isolated deployment to which users cannot have access. In other words, it isn’t available publicly.

If we want to deploy a new version we’re preparing a new version of an application on a side. After is created we’re redirecting some traffic to the created application. In internal canaries, the traffic in 99% is generated by the tests, so almost all metrics in this deployment are based on traffic generated from the internal smoke tests. Based on metrics generated by the tests we’re determining if we should increase the traffic weight, promote the application, or stop the promotion and abandon the newest version. Thanks to flagger, we’re continuously increasing the traffic weight. Progressive percent base traffic redirection is set up to [1, 2, 3, 4, 5, 96, 97, 98, 99, 100]. It means that we’ve set 10 flagger iterations. At the beginning (actually the first 5 iterations), the majority of the traffic is going through the currently deployed application version. At The second half of the canary, the traffic is routed to the newly deployed version, the one that we want to promote.

Why such a percentage?

Before the canary is happening, we’re promoting a new version of tests. Internal Smoke tests are in the new version. We’re checking the backward compatibility here. We want to check the new internal smoke tests against the currently deployed version of Cloudentity. That is happening in the first 5 iterations. Once we confirm that are no backward compatibility-breaking changes then we start running tests against the new version of the application. Once the Internal canary succeeds, we are able to start the process of Public Canaries.

Flagger Canary Release

Public Canaries

Public canaries are the deployment process of the applications which are available publicly. All users are using them. The mechanism of the public canary deployment is almost identical to internal ones. To determine the success or failure of the deployment we’re using metrics that are generated by the results and traffic from common smoke tests and the traffic from real customers.

Progressive percent base traffic redirection is also different, compared to the internal canary. Here we have more natural values which are set up to [5, 10, 15,20,25, 35,45,60, 70,90]. The idea behind it is, at first to allow a small number of users to use a newer version of an application. If our quality gate passes with every iteration we’re constantly growing with the traffic redirection percentage. We’re allowing more users to use the newer version of the application. We’re increasing to 90%. If we also pass quality gates with that value, our promotion will be almost finished. The next step is to scale down the current version of the application. After that process, all users and common smoke tests will be hitting the newer version. During the canary process (even at the begging), the single user who starts hitting the new version of the application will stay hitting the new version in his active session.

Flagger Canary Release

Why It’s Good to Have Internal and Public Canaries

Assume we’ve introduced a bug or backward compatibility issues. If there weren’t Internal Canaries deployments, only Public Canaries would start, and at the beginning of the deployment, we could allow 5% of the users to use the broken environment. We want to avoid such situations. With the Internal Canaries, we have yet another Quality gate to prevent users from the bugs and unwanted behaviors from our environment. In our case, tests that are running during the Internal Canaries should detect the bug and prevent it from starting Public Canary deployment. Of course, we’re respecting the shift left strategy, the majority of scenarios are executed on the pipeline/PR level. There we have a lot of stages with unit/integration and acceptance tests. On the SaaS, we’re focusing on the Smoke Tests.

Smoke Tests in Canary Deployments

Smoke tests play an important role in Cloudentity Canary Deployments. Thanks to smoke testing during the Internal and Public Canary Deployments, we are able to check if the version if stable or not, and if the defined quality gates are passed.

Internal Smoke Tests in Internal Canary Deployments

We’re not running internal smoke tests all the time. – we only run them on demand. We’re scaling the tests only before the internal canary process. We’re running them for a few minutes. Once the decision is made either if the canary failed or passed we’re scaling them down, as we don’t need them anymore, at least to the next internal canary deployment. During the internal canary, tests are running constantly against the whole stack. Smoke tests consist of UI and REST scenarios.

Previously, in Blue-Green deployments, we were running REST scenarios only in the isolation against one application. The difference here is in the internal smoke tests we’re running them against the whole ecosystem. The scenarios are identical compared to the common smoke tests which are running all the time, 24/7 in every region.

Common Smoke Tests in Public Canary Deployments

Contrary to internal smoke tests, common smoke tests are running all the time. Their responsibility is to constantly monitor the quality of the most valuable flows of our product in every region on Stage and Production environments. If some tests start to fail during the canary, the promotion stops. The new version of the application is dropped. However, the tests still run, they should be green again as the previous version was stable.

Our smoke tests should be quick and tests the most valuable functionalities of the product. For example, we’re checking using UI and REST:

  • registration flow

  • login/authorization flows

  • visit various portal flows

  • specific flows which use fission or permissions systems (spicedb/grpc) under the hood

  • and more.

During each test run, we generate metrics that can be used for various quality gates. Tests by themselves are generating failure/passed metrics of each test, they’re also generating the traffic. Based on that traffic we can also fetch our product metrics like the number of 500’s status codes per second and the latency.

Those metrics are taken into account if the canary failed or passed.

  • the ratio of failed tests to all tests of internal or common smoke tests metrics based on which canary deployment is in progress

  • P90 latency of Cloudentity or Cloudentity-internal and their canaries equivalents

  • the ratio of 5xx Status codes of Cloudentity or Cloudentity-internal and their canaries equivalents

  • basic alive endpoints check of Cloudentity or Cloudentity-internal and their canaries equivalents

  • pending messages from Redis perspective should be less than 5

  • Messages processing time from Redis perspective should be under 15 seconds

Test Reports on AWS s3 Buckets

In Cloudentity, we have the possibility to monitor results from the tests based on Prometheus metrics using Grafana. Another way to check results and analyze results from the tests run during the canary deployments is to use dynamically generated allure reports. We’ve decided to store them on AWS S3 buckets. We have separated buckets per region. One bucket contains reports from various smoke tests. Next to reports from common smoke tests and internal smoke tests there are also others.

An example is authorizers' smoke tests. The mechanism for generating and sending Allure reports is generic. At the root of each s3 bucket, we are dynamically building a JSON navigation file that contains links to all available test iterations. Storage of Allure Reports on S3 buckets is cheap and we have provided a retention mechanism out of the box. Example part of the navigation file from internal smoke tests you can see below:

{
  "description": "The latest report is on the top of this JSON file. In order to find your allure report, please figure out Project Name and Iteration Number. More details you can find in QA section in the confluence.",
  "projects": [
    {
      "name": "3094058-2022-12-12-08-49",
      "iterationUris": [
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/13/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/12/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/11/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/10/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/9/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/8/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/7/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/6/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/5/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/4/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/3/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/2/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/3094058-2022-12-12-08-49/1/index.html"
      ]
    },
    {
      "name": "261f402-2022-12-09-19-23",
      "iterationUris": [
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/12/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/11/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/10/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/9/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/8/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/7/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/6/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/5/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/4/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/3/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/2/index.html",
        "https://acp-tests.tools.us-east-1.us.authz.cloudentity.io/reports/acp-tests-internal/261f402-2022-12-09-19-23/1/index.html"
      ]
    }
  ]
}

Backward-Compatible Tests

With Canary Releases, our smoke tests need to meet the requirement of backward compatibility. During Canary releases end users are non-deterministic chosen to use old or new versions of the product. The same case is with the smoke tests, especially with the UI smoke tests.

Sometimes it comes to the situation when we’d like to enhance the user experience for our main functionalities in order to make life easier for our end users. For example, one main flow from the UI perspective has changed. It means that smoke tests that are covering it should support both versions of that flow and determine in an intelligent way which version of the tests should run.

Every Quality Assurance Engineer knows that he/she should write smoke tests to support both versions of the application. When the team is going to deliver some feature that changes the existing main UI flow there needs to be communication between QA and the Frontend Developer. QA should write an updated test scenario and the Frontend Developer should provide the state or flag on the UI. Based on that flag QA can determine which flow in the particular test should run, against the old or against a new version of the product. When both test versions and UI state is implemented, during the canary release our smoke tests can determine against which version is running, and after that, the appropriate test flow will proceed. In other words, the requirement that new tests should run against an old and new version is fulfilled. However, if the support for the old and new versions in the test is not correctly implemented, the deployment process fails on the internal canaries level. The team is notified about the failure and they react to fix the backward-compatibility issue.

Summary

Canary deployments are a software release strategy in which new features or changes are rolled out progressively to a small subset of users before a full-scale deployment. This strategy is combined with smoke tests, which are basic tests run at every deployment stage to ensure the functionality of the most critical parts of an application. The combination of these two strategies offers organizations an enhanced approach to maintaining platform security and stability.

Incorporating canary deployments with smoke tests at every step offers organizations a powerful combination to ensure the security and stability of their platforms. Not only does this strategy mitigate risks, but it also optimizes resources, improves user experience, and strengthens stakeholder confidence. As the digital landscape becomes increasingly complex, such approaches will be indispensable for organizations aiming for seamless, secure, and reliable software deployments.

Try it now

Using Cloudentity on-prem? You can set up canary deployments yourself.

Updated: Aug 16, 2023