Canary Deployment: So Easy Even Your Grandma Could Do It!
Last Friday, the unicorn startup I work for celebrated surpassing 1 million paying Australian users. Working at a company serving such a large user base requires us to exercise extreme caution with every deployment to our production environment. Of course applying regression test on staging environment and then smoke testing after the deployment on production environment is a must. Even then, sometimes bugs slip through that we didn't catch. Lately, we've started using a strategy called "Canary deployment" to help improve how we roll out updates that mostly about refactoring or lib upgrade and introduce no new feature. For new feature, we always use feature flag to enable the feature for a small group of users, which I'll write in another post.
Canary deployment is straightforward. Instead of immediately giving the update to all users, we roll it out gradually in small stages. Initially, we apply the new update to just 10% of our total traffics. If any problems occur with the update, it only affects a small group of users, which is manageable. We can quickly fix any issues that arise and wait a few days to ensure no new problems emerge. Then, we gradually increase the deployment to 20% and repeat the process. After a few stages like this, we eventually release the update to 100% of our users. This method makes the rollout process of any new update take weeks, but it ensures that no one loses their job because of delivering bad code to the users.
Now, how do we applying this strategy? My company ultilizes Istio - a widely recognized service mesh solution that integrates seamlessly with Kubernetes. If you're a software engineer reading this, chances are high that your company is also leveraging Istio for similar purposes. Why mentioning Istio? Because it's so easy to apply canary deployment using it. A service mesh like Istio works on a simple idea: Wrapping each of the services in your cluster within an Envoy proxy. It controls all the traffic in and out of your services. Due to this capability, Istio can handle common tasks like authorization, CORS (Cross-Origin Resource Sharing), whitelisting, and, importantly, traffic routing.
Traffic routing in our company's cluster is straightforward. A service can be accessed externally through an ingress gateway or internally from other services. In either case, we configure one or more VirtualServices to specify routing rules for these services. A VirtualService isn't a service itself, it's a custom Kubernetes resource created by Istio. It acts as a set of rules applied to manage traffic to and from our services.
The above snippet is an example of routing rules in a VirtualService. In this configuration, we direct 90% of the traffic to stable-web.application.svc.cluster.local, which represents our stable containers, and 10% to the canary version ones.
Pretty straightforward, isn't it? This is a basic overview of implementing Canary deployment using Istio. The post title is just a funny joke. The actual implementation involves more complex details that I can't cover in a single post. That's all for today. See you in the next post!
Hi, as you r setting the istio weight, when an user request comes in, will it be randomized to be routed to canary/non-canary? If yes, if user refresh their page several times, will he see the canary feature being on/off unpredictably?
ReplyDeleteAh I forgot to mention one thing is that canary deployment is usually applied for refactor or lib upgrade which introduce no new feature. For new feature we use another technique called feature flag.
DeleteThanks for asking. I updated the content.
Delete