Kubernetes Guide

Clusters & Environments

It doesn't matter if you have a dedicated platform engineering team or if you're a small team where everybody works on application and infrastructure tickets every now and then. The ability to work on each independently is key to prevent one task from blocking other tasks and team members.

To ensure this, we implement a model with two clusters, also referred to as a cluster pair throughout the guide and clear roles as to which application's environments go where. But before we get to that, let's take a look at the status quo.

The legacy approach

On bare-metal or virtual machines a very common approach is to have a cluster per environment and or per application. Each cluster has one or more hosts as loadbalancers, app servers or database servers dedicated to one of the environments. Teams usually have a development cluster that runs the development application environment, a staging cluster that runs the staging environment and a production cluster that runs the production environment.

The problem with coupling clusters and application environments this way is that it introduces dependencies between infrastructure and application tickets and risks team members blocking each other when, for example, the development cluster and the development application environment are worked on at the same time.

Separating clusters and environments and dropping the legacy approach of one cluster per environment is the single most effective change to increase application delivery velocity.

The Ops-cluster and the Apps-cluster

To solve this problem this guide implements a pair of clusters. The two separate clusters allow testing changes to the cluster configuration and at the same time working on any application environment. To fight the force of habit and prevent people from assuming that the application development environment goes onto the development cluster, we call these clusters Ops-cluster and Apps-cluster.

+------------------------------+       +--------------------------------------------------------+
| Ops-cluster                  |       | Apps-cluster                                           |
|                              |       |                                                        |
|  +-----------------------+   |       |  +-----------------------+    +------------------+     |
|  | [TEST]-env            |   |       |  | [PROD]-env            |    | [TEST]-env       |     |
|  | cluster-services      |   |       |  | cluster-services      |    | application      |     |
|  |                       |   |       |  |                       |    |                  |     |
|  |                       |   |       |  |                       |    |   +-----------------+  |
|  |                       |   |       |  |                       |    +---| [PROD]-env      |  |
|  |                       |   |       |  |                       |        | application     |  |
|  |                       |   |       |  |                       |        |                 |  |
|  |                       |   |       |  |                       |        |                 |  |
|  +-----------------------+   |       |  +-----------------------+        +-----------------+  |
+------------------------------+       +--------------------------------------------------------+

The Ops-cluster is where changes to the cluster, its configuration and its cluster services are tested before they get applied on the Apps-cluster. Note how the cluster services test environment lives on the Ops-cluster while all application environments live on the Apps-cluster in the above diagram. Cluster services are services that are shared between applications and environments. Good examples of cluster services are ingress controllers, monitoring, logging and service meshes.

The Apps-cluster is where all applications and their environments live. If the Apps-cluster is down, you stop what you're doing and fix it. If you're colleagues can't work, that's a production level incident.

By having two clusters with clearly defined and distinct purposes you can move fast and break things on the Ops-cluster and at the same time move fast and break things in any of the application environments on the Apps-cluster without one blocking the other.

Security and compliance

A common argument for separating application environments onto different clusters is increased security or compliance requirements.

Lets tackle the compliance argument first. Usually it is somewhat along the lines of compliance requires us to keep dev and prod strictly separated and the only way to do this is using separate clusters. And while it's probably true that compliance requires strict seperation, it does usually not dictate how to implement that separation. It usually only requires state of the art measures. And when migrating to Kubernetes, separate clusters are far from the only way and more importantly, certainly not the state of the art one.

Kubernetes allows declaring security related configuration including firewall rules, quotas and more just like it allows declaring runtime configuration. By declaring every aspect of an application's runtime environment the same way and in the same place teams gain full visibility.

Let's look at an example of how to approach this step by step. First, make sure you get your teams and namespaces setup right. Then, get RBAC configured for teams and namespaces. The autentication and authorization section has the details how to do this without killing your devops culture. With namespaces and RBAC in place, look into network policies, to control network ingress and egress for an application environment's namespace. If you are required to ensure, dev and prod workloads don't run on the same nodes, you can either look into using anti-affinity to instruct the scheduler to not put dev and prod workloads on the same nodes, but keep scheduling highly dynamic. Or if you prefer a more static separation, different node pools might be an option.

The above, gives you declarative configuration, with version control and reviewability and continuous enforcement through the Kubernetes control loop.

If peer reviews are not engough in your environment to enforce security policies, you can additionally look into admission controllers that allow you to deny applying manifests not meeting certain criteria at the Kubernetes api level.

The benefits of transparency, reviews, validation and enforcement far outweigh the risk of potential bugs in the Kubernetes code. Manual steps are prone to human error and the compared lack of visibility makes them less likely to be spotted quickly and increases the risk of being subject to malicious actors.

So not only are manual steps you depend on in your day to day work slowing you down, they also pose a higher risk. Shared responsibility, automation and transparency through reviews enable high velocity and improve security.

Multi-region and multi-cloud

There are good reasons to have multiple clusters, just remember, separating your application environments is not one of them.

Among the good reasons are multi-region and multi-cloud setups. To keep latency between control plane components small it is not recommended to span Kubernetes clusters across wide area network boundaries. In cloud provider terms, clusters are usually spread accross multiple zones within one region, but not across regions.

So in cases where you need compute resources in multiple geographical regions for lower end user latency or high availability or even in multiple cloud providers for even higher availability, multiple clusters are the right way to go.

Similar to how cloud providers keep the regions indepentent of each other, you should also keep the clusters independent. And remember, every new region or provider always requires a new Ops- and Apps-cluster pair.