What happened
The apology letter published by the current President and CEO of Rogers Communications, Tony Staffieri, gives a slight insight into what happened on Friday, July 8th, 2022: They “narrowed the cause to a network system failure following a maintenance update in our core network, which caused some of our routers to malfunction early Friday morning”.
Cloudflare, a company who is able to monitor a lot of internet traffic worldwide as their main service is being a reverse proxy between the web and specific websites, wrote more details as they could observe traffic coming from Rogers networks coming down. They were also pointing out the root cause by name: The Border Gateway Protocol (BGP).
Adding one and one together, one arrives at a conclusion that there was a scheduled maintenance update on Friday morning, which had unplanned side-effects causing routers to malfunction. These routers were responsible for executing the Border Gateway Protocol which, in simple terms, tells the internet that Roger’s networks exist, are up and running, and how to reach each node inside them.
If it sounds familiar to you, you are right
BGP routes traffic within autonomous systems and external ones. Facebook’s/Meta’s outage in October 2021 was mainly caused by routes between Facebook’s own data-centers, as communicated in their post-mortem. There, a small misconfiguration made it into the system, causing a cascading effect of systems not being able to communicate with each other.
Remediation promises and why they smell
Before getting into the promises made by Rogers on how to prevent such a disaster in the future, we would love to point out that BGP itself is an illustration of the underlying issue. It is also called the three-napkin-protocol, referencing the initial design documents used. Three ketchup stained napkins. The Washington Post has a great article on this one. An adapted quote by Nobel prize winner Milton Friedman holds: “There is nothing more permanent than a temporary solution”. BGP was supposed to be this temporary solution, but it was expanded over the years and patched where possible. The last major hit was taken in 2014 when scalability issues arrived, which were patched again.
This is an ever repeating story, and also the reason we still have so many Cobol-run systems out there today.
People do not want to touch the essence of this protocol, as they do not want to mess with a running system. And this is where the problem lies.
The last three bullet points of the previously mentioned letter, which summarize the efforts as to what Rogers is planning to do about it, are:
- Fully restore all services: While this has been nearly done, we are continuing to monitor closely to ensure stability across our network as traffic returns to normal.
- Complete root cause analysis and testing: Our leading technical experts and global vendors are continuing to dig deep into the root cause and identify steps to increase redundancy in our networks and systems.
- Make any necessary changes: We will take every step necessary, and continue to make significant investments in our networks to strengthen our technology systems, increase network stability for our customers, and enhance our testing.
While item 1. appears canonical, item 2. is something worth talking about. Steps to increase redundancy in networks and systems is not the issue here. You can have double the routers available, but if the faulty update hits all of them, it does not matter if you have doubled them. They will all go down. This approach, which is also commonly called “wallet tuning”, would have not prevented what happened to Rogers here.
Item 3., however, leaves room for interpretation, but reads mainly as an emphasis of the previous item.
What are the impressions that are left to technical readers, given the information provided
The outage outline and the proposed remediation, to an expert viewer, leave four impressions. In the following subsections we will talk about each one.
Impression 1: A rollback was not possible
The apology letter mentioned “We disconnected the specific equipment and redirected traffic [...]”. This means there was no rollback possibility for the change, or it was so involved that they rather take machines offline. We would love to emphasize that this is not an unsolved problem in computer science. Rollbacks and good mechanisms to do that are common practice, but they seem to have not been applied here. Modern scalable software orchestration systems like Kubernetes are showing examples on how this can be achieved, given that Kubernetes is open source. Most routers run a version of Linux, so a containerized approach is possible.
Hence, despite technological possibility, the updates performed on this critical infrastructure are not designed to be able to be rolled back.
Impression 2: Staged update propagation to reduce blast radius was not applied
Even with rollback mechanisms, the rollback may be too slow. This is why there is a common practice to do staged updates/rollouts. This means that a certain update is only applied to a small subset of the network at first, to determine if there are potential issues. Every major cloud provider supports this (AWS example). The idea is simple: Define indicators that a service is running healthy, and let it sit on a subset of your infrastructure until all these indicators give a green light. Afterwards, propagate further until you reach 100% of your infrastructure. If there is an issue in between, roll back the affected nodes. The damage caused is hence controlled, and only reaches certain areas.
Needless to say, the graphs of the Cloudflare blog post show that everything happened at once. This indicates that no staged update mechanism was applied.
Impression 3: Updates are done in larger batches instead of many small increments.
As a summary of the previous two impressions, it becomes apparent that the updates on these systems are done in larger batches instead of small increments. Even without proper rollback mechanisms, small changes should be possible to be rolled back in minutes, or even an hour at the most. The duration of the outage indicates that many areas were touched at the same time, and that it was faster to disconnect systems and go into fire-fighting mode instead of kicking off a rollback.
Impression 4: Modeling and unit-testing of changes at the router layer may not have been applied.
Computer programmers and IT professionals of the modern age have tools at their disposal which were a dream in the 80s, 90s and 2000s. They can easily spin up multiple containers on their local machine, define network rules between them and test up scenarios. The effort needed is minimal: A couple of Docker-files and a bit of orchestration, which is enabled in the absolute minimal case using docker-compose.
Unfortunately, many people do not use these powers to catch potential issues on more than just the application level. Their testing stops at unit testing, and maybe some behavior driven automated testing. Although potentially a bit more involved, the modeling of the routers, their communications and the whole update process should be possible using a test-bed with containers.
What are the steps that should be taken, and how would that apply to your own organization
The impressions given in the previous section are also the recipe to fixing the issue. Rollback, staged and small batch update policies, and proper modeling are not only applied in practice today, but enterprise-level examples are available, mostly as open source. There is no excuse not to adopt them.
What holds people back?
For one, legacy systems were built during a different time, and most of them are still run with the “do-not-touch-unless-you-really-need-to” principle. The investment into slowly migrating them into a more modern architecture loses usually to the approach of short term wallet-tuning of increasing redundancy. Redundancy is simple, and more intuitive than “Let us have a system where we push updates multiple times a day, and where we continuously test the resilience of our production systems using e.g. the chaos monkey”. Although many IT/DevOps specialists we know have “The Phoenix Project” as one of their favorite reads, not many people are applying the lessons learned.
Another issue is that, while some tooling is available to achieve the goal of “many builds per day”, with good staging and rollout mechanisms, the area of proper pre-deployment modeling and configuration checking on every layer of your infrastructure is still fairly gray. How do you know the effects of a change of a configuration in one software/router on the entire infrastructure? How do the different services interact with each other, and where are the configurations? Do we even version these configurations? Tackling these questions is putting you on the road to success.
Software developers were once also asking similar questions. The instability of computers of the late 90s and early 2000s was beyond frustrating, but a good standardization of CI/CD practices paired with automated and sophisticated static analysis for code have made the work with computers as enjoyable as it is today.
IT work needs to be put on that road as well, and proper modeling and static analysis of one’s infrastructure are at the forefront of achieving the stability we want and deserve. It is possible, and given today’s trend towards “infrastructure as code”, it becomes easier and easier. Our tool CoGuard is built to be able to check configurations on every layer of the infrastructure, and understand their dependencies and interconnections (patent pending). The dependencies can be formulated as policies, which enable catching misconfigurations before deployment. Pre deployment modeling also enables you to see where you have lack of redundancy and potential security issues wrt. to the traffic happening between the nodes. Configurations give all this information away, and policies can be checked in seconds.
Conclusion
The changes needed to prevent the Rogers outage cannot be achieved by Rogers in a day. They need to be a focussed effort. Hopefully the focus will not just be an increase in the number of routers or servers, and one added unit test to capture the issue that happened on July 8th.
The observations and recommendations of the previous subsections are not coming from an academic ivory tower; they are practically achievable and applicable.
The third item of the message by the CEO of Rogers states that they will make any necessary changes. Given the impressions of this article, this means that the focus should be on increasing the agility of the system. Some work items include
- Model the infrastructure responsible for the Border Gateway Protocol Communication
- Capture all configuration files at the layer of the applications on the routers, as well as the infrastructure pieces.
- Create a set of checks which fail if a certain configuration or configuration combination is present. These interconnection checks can be easily formulated and checked by the CoGuard engine.
- Identify the areas preventing a good roll-back and staging mechanism, and address them. Stability in this process is ensured by the policy checking with each change. Most of the time, it means containerizing functionality and applications on your affected devices.
- Create ways to model your environment using the newly created containers, and include a spin-up of those into your testing pipeline.
- Once you have addressed everything that prevents you from a good rollback and staging mechanism, and have ways to model changes in advance in a safe environment, put the mechanisms as script into your CI/CD pipeline and test it. This will likely cause more configurations to be considered in your policy checking engine; hence, ensure that everything is captured in your code-base.
- The establishment of the previous item enables you to increase the deployment frequency. Start enforcing changes to come out often and in smaller batches instead of everything at once.
Each one of the above seven items is hard, but very rewarding once established. Plus, it enables businesses to do changes faster, and adapt to the competitive landscape without having to wait too long for IT. The ROI is beyond measure.
We would love to say that this outage is a wake-up call, but there have been other outages before this one. Hence, this is another poke that we should start doing things differently in the IT world, and not just fight issues with the intuitive reaction of “let’s just throw more money at it”. While Rogers is sharing a monopoly position with Bell and Telus in Canada, the business with the best agility and hence stability will always win. And this is not only just true for communications companies.