The fall of old dogmas and the emergence of a new paradigm for building resilient computer systems for critical infrastructures
Given the increasing reach of the Internet and its role in the day-to-day life of ordinary citizens, the security of computer systems that make up the Internet has been a growing concern. On the one hand, weapons to attack the Internet are more accessible and the attacks increasingly powerful and stealthy; on the other, attackers have shifted their attention to critical infrastructure such as the power grid. In recent years, many aspects of these critical systems have changed, and the changes have made the systems more vulnerable to cyber attacks. A successful attack on computer systems of the power grid will be a catastrophe.
It is therefore imperative not only to make such critical computer systems resilient to severe attacks but also to ensure that appropriate defence mechanisms kick in automatically throughout the life of the systems.
My research over the last few years has focused on finding answers to two questions:
•How to make a computer system resilient to severe attacks?
•How to guarantee that this resilience comes into play automatically throughout the life of the system?
When I started my research in 2005, the dogma was that a resilient computer system should consist of a set of computers configured in such a way that a successful attack on a few of them will not affect the entire network. This technique is called intrusion tolerance and keeps a system operational despite successful attacks (intrusions) on any subset of machines that make up the system.
Moreover, it was also believed that intrusion tolerance should be combined with complete absence of timing assumptions, which means, in simple terms, that the speed at which computers work (processor speed, communication time, and so on) could change arbitrarily without affecting the ability of a system to tolerate intrusions.
The starting point of my research work was a simple intuition: if the speed of a computer system can change at random, how is it possible to prevent a human (and thus arbitrarily fast) attacker from compromising the computers one by one until every single machine is compromised?
I formalized this intuition by producing a theoretical proof that it is impossible to build a secure intrusion-tolerant system without any timing assumption. The proof generated some controversy in the research community because it contradicted the prevalent belief that intrusion-tolerant systems should make no timing assumption.
To test my theory, I analysed several intrusion-tolerant systems that were allegedly free of any timing assumption. However, after careful analysis, I concluded that these systems were in fact built over hidden timing assumptions. These systems assumed that from time to time, a cleaning process will be executed that removes the effects of intrusions and that the process will be carried out often enough and fast enough to thwart any attack. In other words, there was a timing assumption except that it was implicit rather than explicit. The serious problem in this timing assumption was that it could be easily sidestepped by an attacker by simply turning off the cleaning process after compromising a computer.
Having thus confirmed my initial intuition, I began thinking about a solution. I decided to retain the idea of a cleaning process, but to use it more securely by making it tamper-proof as it were. I named this paradigm proactive resilience, which, in essence, consists of adding a secure component to each computer of an intrusion-tolerant system. It is this secure component that triggers the cleaning process and guarantees that the process is executed in time and is beyond the reach of any attacker so long as the secure component is isolated and cannot be compromised. This may seem a tall claim, but given the very simple component, which is used only for cleaning operations and nothing else, it is easy to guarantee that it is not compromised. In practice, this component can be a hardware board (similar to the graphics board of a regular computer) or an application deployed in an isolated virtual machine.
The last part of my research work was a practical, real-life demonstration. The proactive resilience paradigm was used in the construction of the crutial information switch (CIS), an intrusion-tolerant firewall (a protective device) developed as part of the European project CRUTIAL. The goal of the CIS is to protect the information infrastructure of the power grid from cyber attacks, and its effectiveness has been validated both by reviewers (a panel of specialists) of the project and by the research community in general through publication in prestigious international journals and magazines.
As in any other research, much more remains to be done. Currently, I am working on other application scenarios for proactive resilience (antivirus solutions and the Internet DNS or domain name service, for example) and fresh results will appear soon. Stay tuned! (Paulo Sousa, University of Lisbon, www.atomiumculture.eu)