How to scale your DevOps from 100 servers to 1,000+

Codemotion Amsterdam 2019 had a huge variety of talks across more than a dozen tracks. One of the ones that stood out for me was this talk on DevOps, given by Pat Hermens of Coolblue. Coolblue is one of the biggest online retailers in the Netherlands. They generated revenues of €1.3bn in 2018. Since they were founded 20 years ago, they have seen exponential growth. This is reflected both in their revenues but also in their development team which doubles in size every 18 months and currently employs 240 developers.

The challenge of scale

Many people might naively wonder: what is it with scale? Why are 1,000 servers so much harder to handle than 10 servers? Pat shared a very pertinent quote from Edsger Dijkstra:

“Apparently, we are too trained to disregard differences in scale, to treat them as ‘gradual differences that are not essential.’ We tell ourselves that what we can do once, we can also do twice and by induction, we fool ourselves into believing that we can do it as many times as needed, but this is just not true!”

Some years back, Pat and a colleague gave a talk at a predecessor to Codemotion. This talk looked at the prerequisites in order to be able to scale your development successfully. They called this the “faster to master” checklist.

But simply checking all these boxes is not enough to ensure you can scale. You need to do a few other things.

Four Stories of DevOps scaling

Pat shared four stories with us to illustrate the other requirements to ensure successful scale-up: responsibility, autonomy, ownership and failure.

Responsibility

In the past, Coolblue used a hub and spoke model for deploying code. The Hosting and Deployment team (effectively DevOps) sat in the centre with each development team going through them for any decisions/knowledge about deployment. This model began to be a blocker since all requests had to go through the one team. As a result, informal knowledge sharing began to happen.

Andy in Team A might ask Eve in Team E how to deploy a new DB. The problem is that this informal knowledge sharing doesn’t scale at all. The solution is to turn the Hosting and Deployment team into a Centre of Knowledge. Then the deployment team becomes just another development team. If you couple that with an automated process for deploying a new infrastructure, you end up giving responsibility to the developers and empowering them.

Autonomy

Within bounds, autonomy is essential for scaling. All of Coolblue’s systems are called Vanessa-X. Their modern systems such as Vanessa-de-Prix and Vanessa-Longstocking, are web applications using Serilog, Splunk and DataDog to enable real-time data logging, dashboards and audit trails in an easy-to-integrate fashion. However, the company is still heavily reliant on Vanessa-Optimus-Prime. This is the original system and is a monolithic desktop application based on Delphi (which shows how old it is!). The system runs on thousands of machines across the company and is still central to how the rest of the system works.

An autonomous team was given the challenge of working out how to add Vanessa-Optimus-Prime to the new data logging, dashboard and audit system. “Easy” they thought. Each application has a collector class. We’ll just use that to collect all the data and send it to the logging system. Unfortunately, that kills performance so it doesn’t scale. Their next thought was to use UDP to send the logs to an agent on the network. This would act as an aggregator. But then security gently explained to them why this was not a good idea! (Basically, UDP traffic can kill a network and becomes a security nightmare). Aha! they thought. Why not switch to using TCP? That way you have controlled traffic in the network. Sadly, they hadn’t allowed for the sheer number of concurrent connections needed. The collector ran out of sockets and hung. In turn, this caused all the clients to hang as they waited for connections.

Finally, the team stopped trying to come up with quick fixes and looked at what they could do from an infrastructure viewpoint. They came up with a solution based on the AWS API gateway. This was then connected to a Lambda Function to pre-process the data which then was sent to Amazon CloudWatch. This solution worked well so they finally submitted it to DevOps for approval. The only change DevOps made was to insert Kinesis webstreams. This makes sure the system can’t become too expensive. Now the system is capable of handling about 5m log events per week.

Without the autonomy to try out multiple approaches, this problem would have taken much longer to solve.

Ownership

Ownership is sometimes scary. People feel exposed if they have to take ownership of important decisions. At Coolblue, the build environment is based on Team City. But who actually owns the environment? Well, actually teams own their own unique build environment. The first thing that happens before any build is the build.ps script is called. Each team can configure this script as they choose. As a result, pretty much any build configuration is feasible. And no one else even need know what you are trying!

Failure

One of Pat’s favourite books is “Failing Forward”, by John C. Maxwell. Core to Maxwell’s view is that what matters is how failure is accepted and what changes it triggers. Coolblue owns a fleet of delivery vehicles. Recently they added electric bicycles to the fleet. During the trial phase for the bicycles, everyone in the team suddenly received a Slack notification late one afternoon. One of the key aspects of Coolblue’s infrastructure is the dashboard that monitors their services. If anything goes wrong, it sends out a Slack notification. On this occasion, the on-call team was able to quickly spot that two processed were hanging. They terminated and restarted these and within minutes all was happy again.

So, what has that to do with the delivery bikes, you might ask. Well, during this trial phase, the maximum size of package allowed was set to 50cc. One of the most popular lines Coolblue deliver is a particular coffee machine. That machine comes in a 49cc box. However, this week, the manufacturer had chosen to add a milk frother as a free gift. Suddenly the boxes were too big. As a result, when the bike courier came to load the packages for their delivery round, they wouldn’t fit. In turn, that would put all bikes offline and the whole system would grind to a halt. Knowing this would be a problem, the team had reduced the maximum allowed package slightly, thus triggering the system to reassign all the loads properly. However, it turned out that not every instance had been updated properly. Thus, later in the day, these instances hung and triggered the outage.

After this event, the team was required to send an RFO (reason for outage). These RFOs are reviewed each month and any important lessons are learned and shared across all other teams. If there is a major failure, then a War Room is called, where everyone in the company is able to help review the issue. Fortunately, Pat said, there has been no need for any War Rooms for a long time.

Conclusions

Coolblue has been able to scale up pretty effectively. In part, this is down to embracing the “faster to master” checklist. But it is also down to how their company culture embraces the four key concepts of responsibility, autonomy, ownership and failure. Get these right, and scaling becomes much easier.

How to scale your DevOps from 100 servers to 1,000+

The challenge of scale