The landscape of IT is undergoing a significant shift from manual setup and installation processes towards container orchestration, accompanied by programmed resource provisioning. The infrastructure details being stored in code repositories enhances change management and facilitates testing.
However, as Mick Jagger aptly put it, "Old habits die hard." Despite the availability of numerous open-source container orchestration tools, there is a tendency not to fully leverage their functionality, especially in high-stakes environments where availability and reliability are paramount. This can prove to be a costly mistake.
In the aftermath of a recent incident involving the Launchnodes Node operator, we'll delve into some critical slashing concepts highlighted by the event.
Reading Between The Lines
“The root cause of the slashing boiled down to executing non-optimal fallback procedures during datacenter connectivity issues. In an attempt to restore validator connectivity, multiple validator client instances (an initial instance and a manually activated fallback instance) were pointed to a single Web3signer instance without slashing protection enabled at the Web3signer level and without blocking the initial instance from the signer (e.g. via firewall rules); this caused double votes to occur for the loaded validators, which led to attester slashings of 20 validators.”
Lido on Ethereum Launchnodes Slashing Incident
Let’s analyze the information provided in the post mortem, to better understand the infrastructure setup and automation processes and how the “non-optimal fallback procedures” can be identified in configuration rules.
Quote #1: “ an initial instance and a manually activated fallback instance”
Whatever needs to be done manually needs to be in the category of “breaking glass” - I.e. When every automated process you have fails, you acquire the root-credentials and your SREs are saving the day. This should be a very rare event.
Fallback mechanisms, or, as they are more commonly referred to as failover mechanisms, should NOT be manually activated. They should be all automatically managed and orchestrated.
This can be done in a clustered architecture by using e.g. a default-backend in Kubernetes architectures. Where possible, the clustered architecture should have multiple replica instances, which appear to the user like one machine accessed through the ingress controller.
Quote #2: “were pointed to a single Web3signer instance without slashing protection enabled”
According to this sentence, there were signer instances that had a slashing protection database connected, and there were ones that did not. Every signer needs to be connected to a slashing protection database to avoid the risk and high cost of a double-signing.
For IaC to work in practice, created resources must not be manually modified after they are live (an immutable infrastructure). Any modifications need to be made in the IaC code. This means that the build scripts and configurations for your deployments (test, staging, uat, prod) are as identical as possible (though there are parameters for DNS, IPs, secrets, etc.). You should use the same build scripts no matter where you deploy. Otherwise you risk a failure during a failover (apologies for the pun).
It appears that the Web3signer instance was created using a different script, i.e., it did not have the slashing protection. And when the existing validator clients were connected it behaved differentiated than the previous Web3signers.
This is why, at CoGuard, we do not discriminate between different environments. The checks must pass in each/all environments.
Quote #3: “without blocking the initial instance from the signer (e.g. via firewall rules)”
For a while, one of the common technical interview questions out there was “Implement a singleton in C++/Java/etc.”. In the object-oriented world, a Singleton is a very common way of ensuring that certain operations are controlled by a single object to keep track of state and to access a single data source.
This abstraction also translates into the infrastructure world. For current DevOps hires, the question would be: “How do you ensure a certain container is present exactly once?“
For Kubernetes e.g., there are two mechanisms to achieve that: StatefulSets and ReplicaSet.
StatefulSets are more strict than ReplicaSets, since they control the order in which containers are deployed and updated.
Important in the context of slashing, one needs to avoid re-using validator keys with different validators. I.e. the validators should follow the Singleton-pattern. This can e.g. be achieved by mounting the keys to the containers via Persistent Volumes. Mixed with a StatefulSet, one can define a “persistentVolumeClaimRetentionPolicy”, ensuring access to volumes is revoked/granted at the appropriate time.
Hence, the control to ensure that only one validator is using a certain key-pair can be achieved on a container level and does not require dynamic adjustment of firewall-rules.
Conclusions
The technology to avoid this situation exists. It is mostly open source. People just need to be aware about all the possibilities IaC and containers provide. The main benefits of IaC are:
- Deployment: removing the manual provisioning interaction with internal and public cloud providers means a quicker deployment speed.
- Recovery: identifying issues in the configuration of infrastructure can mean a quicker recovery from failures.
- Consistency: deploying the same resources each time, resolving infrastructure fragility.
- Modification: modifying resources can have a quick turnaround time.
- Version Control: storing infrastructure code in version control systems.
- Visibility: writing configuration as code serves as documentation for the infrastructure
Similar to developers learning the patterns of object oriented programming over the years, we need to ensure that DevOps specialists are aware of the patterns available to them in the currently provided tooling. One of the keys is static analysis of the infrastructure as code files and the configurations contained.
For the specific case of slashing, we at CoGuard, have created a rule-set that looks out for patterns being deployed, and resources being as isolated as possible. Infrastructure configuration for slashing and infrastructure, container and configuration security are some of the checks we do with Chainproof to ensure that companies have robust infrastructures where slashing risk is minimized.
Getting Started with CoGuard
CoGuard is a static analysis tool for infrastructure and application configurations. CoGuard looks for misconfigurations, security vulnerabilities and best practices in your IaC code, your containers' configurations, application configuration and network settings. Developers can get started with the infrastructure or deployment repository. Security teams can connect directly to cloud manage consoles. Scan your infrastructure for free »