A bug in the Nethermind client took down 8% of the validators in the Ethereum network. This should be seen as a warning to all blockchain networks dominated by a single client. Many networks, including Cardano, face the risk of network outages. If the bug causes a fatal client crash, the network may no longer be able to produce blocks. This risk can be prevented by the diversity of clients. Decentralization is not only about the production of blocks or on-chain governance but also about the diversity of teams that build alternative clients. Let's explain this often-neglected topic.
One Team, One Client
Blockchain networks are built by the team delivering software, the so-called client, that implements the rules of the protocol. The team knows best the design of the protocol and defines key features such as monetary policy and reward mechanism.
Volunteers (operators) install this client on their nodes. The network is gradually becoming more distributed and decentralized. The more client nodes there are in the network, the more resilient the network will be to the failure of one node.
This is a common and expected scenario. The result is a distributed network in which all operators have the same client from the same team.
Nodes producing new blocks are more important than user nodes that do not produce blocks. Therefore, it is not only the number of nodes in the network that is important, but also the decentralization. Decentralization is a decisive factor regarding nodes that are important for the network (and represent a single point of failure).
In the context of Cardano (but also Bitcoin), this means that the more pools the network has, the more resilient it is. Later we will explain what role the client plays in this.
In the Cardano network, there are randomly drawn slot leaders who mint blocks. If by chance the drawn slot leader doesn't mint the block due to a problem with the node or internet connection, it doesn't matter too much. In the next round, a different slot leader will be drawn, who will probably succeed in minting a new block.
The network seems to be resilient to individual node failures.
However, such a network is not immune to critical bugs in the client source code. A bug can cause all clients to crash at approximately the same time. In such a case, the minting of new blocks may stop completely. Users will not be able to submit new transactions, because there may not be any node in the network that would be able to process them (temporarily store them in the mem-pool).
In the picture, you can see that all the nodes in the network have failed and are unavailable. New blocks are not minted. Users cannot interact with each other.
If the operation of the network is dependent on a single client implementation, the risk of failure is relatively high.
The client implementation can be of high quality, which reduces the risk. The IOG team uses formal methods in client development which is good news. Unfortunately, the risk of failure is still real.
Why Is Client Diversity Important?
A preventive measure against network failure is client diversity. There must be more clients that will be implemented by other teams (perhaps in a different programming language).
Different clients must be compatible with each other. This means that they will work exactly the same in terms of communication protocol and network rules. Clients must act the same on the outside but may function differently on the inside.
The original team (in Cardano's case it's IOG) must make the formal specification of the protocol available.
Other teams must know how to implement the client. It is relatively difficult for teams to observe the client's functioning based on network communication or from the source code of the original implementation that is freely publicly available to everyone. It is always better to have a formal specification in hand.
Formal specifications are crucial because they provide an unambiguous description of how a protocol should function. This includes details about the protocol's behavior, its expected inputs and outputs, and the sequence of operations it should perform.
Having a formal specification is beneficial for teams implementing a client. It provides a reference point that can guide the development process and help ensure that the client is implemented correctly.
In the image below, you can see that the original team, in addition to the client implementation, also published a formal protocol specification. Three other teams, labeled A, B, and C, released three clients A, B, and C.
Not only the existence of alternative client implementations is important, but also their adoption by operators (node operators).
Ideally, the representation of individual clients in the network would be as proportional as possible. So in our example, each client would have 25% representation in the network.
In the picture below you can see a network in which no client has dominance. Operators use 4 implementations (versions) of the client, each with 25% representation in the network. This network is significantly more resilient to the fatal failure of a particular client.
If there was a fatal failure of one client implementation and a node was unable to resume its activity, only 25% of the nodes in the network would be affected. The remaining nodes with a different version of the client, i.e. 75% of the nodes in the network, would continue to work. The production of blocks would be slower, but it did not stop.
In the picture below, you can see a situation where the implementation of client A fails. The network can continue minting blocks thanks to other client implementations.
It is unlikely that different teams will make the same mistake in implementation and introduce a bug that can bring down the entire node. Multiple client implementations are a safeguard against an identical bug in the source code.
A bug can exist in the code for a long time without anyone noticing it and without showing up. Some unusual circumstances in the network can cause the bug to show up. For example, an unusual transaction may appear which causes a node to crash during validation due to a bug.
We can state that a network leveraging multiple, independent consensus and execution clients ensures that the majority of nodes operate effectively, securely, and without disruption. This state is reached even in the event of a fatal failure of one of the many versions of the client (provided that this version does not have a dominant position). This contributes to a resilient network by eliminating single points of failure with too many nodes running the same client software.
When more teams from different parts of the world work on different clients, it reduces the risk of a single point of failure. I mean the failure of one team. This is because each team would likely have its unique approach to solving problems, which adds to the robustness of the network.
The diversity of clients, perhaps a little surprisingly for some, increases the decentralization of the network.
There is often a debate about how to decentralize team management (eg IOG team). What is being forgotten in this debate is that the influence of the IOG team is based on the fact that their implementation of the Cardano client has a dominant position in the Cardano network.
Having multiple teams ensures that no single team has undue influence over the network’s development or direction. This aligns with the principle of decentralization, which aims to prevent any single entity from having too much control.
If there is only one team that controls the development of the client with a dominant position in the network, this is a certain form of centralization. The team is a single point of failure and at the same time (often) an unelected entity that controls the direction and properties of the protocol.
Having multiple teams brings many questions. For example, who should fund the development of alternative clients? Who and how should the teams coordinate with each other? How motivate operators to use alternative clients and not the original client from the oldest (original) team?
Node operators tend to use the client from the team that launched the network, as that team knows the protocol best and the client is the most time-tested. It is logical behavior, but it leads to centralization and the risks described above.
More Cardano Clients On The Horizon
In the Cardano ecosystem, several teams plan to build an alternative Cardano client. The teams are largely dependent on funding from Catalyst.
The team building the TypeScript implementation is probably the furthest along. Their version is still far from becoming a full-fledged client, but it handles basic operations such as fetching blocks and syncing the chain.
Perhaps the Cardano Foundation and Emurgo could consider an alternative client implementation. These entities received a portion of the funds from the initial ADA sale, so they would have secured funding.
Alternatively, members of the Intersect organization could also consider it.
Building an alternative client is complex and expensive. Additionally, IOG regularly deploys upgrades. This adds new functionality. All existing clients will need to keep up with the client from the IOG team in the future. In practice, this means implementing all upgrades (all necessary functionality).
Having more clients on the network also means being more vigilant about network upgrades. Network upgrades are currently handled by the hard-fork combinator. A wider diversity of clients can complicate this process.
It follows from the above that teams building alternative Cardano clients should be an integral part of the ecosystem, including funding for which ADA coins from the project treasury should very likely be used.
This is a challenge for the community. Decentralization is the responsibility of the community, so it should ensure that Cardano is decentralized at the client level.
All major blockchains, including Bitcoin and Ethereum, face the risk of fatal failure because their operation depends on one dominant version of the client from the team that launched the network.
Ethereum performs best in client diversity. 5 alternative Ethereum clients have more than 1% share. However, the Geth client has almost 80% presence in the network. Bitcoin is essentially dependent on the original (modified) version of the client created by Satoshi. Cardano is dependent on the client from the IOG team.
An anomaly may appear that causes a fatal network failure. In the case of Cardano, this happened in January 2023. However, there was an automatic restart and the affected nodes were up and running within minutes. The operation of the network was not significantly disrupted. The network 'healed' itself.
Ethereum has already had several incidents. With one of them, the network was not able to finalize the blocks. The Solana network requires a reboot from time to time.
If blockchain networks are to become the financial backbone of the world, it is necessary to ensure their 100% reliability. This is currently not guaranteed. The Cardano community should think about what steps lead to a higher diversity of clients.