What are Self Healing Systems & How Can You Develop One?
When people get injured, their bodies self-heal. What if technology could do the same?
Companies are racing to develop self-healing systems, which could improve quality, cut costs and boost customer trust. For example, IBM is experimenting with ‘self-managing’ products that configure, protect and heal themselves.
What Is A Self Healing System?A self-healing system can discover errors in its functioning and make changes to itself without human intervention, thereby restoring itself to a better-functioning state. There are three levels of self-healing systems, each of which has its own size and resource requirements:
In typical applications, problems are documented in an ‘exceptions log’ for further examination. Most problems are minor and can be ignored. Serious problems may require the application to stop (for example, an inability to connect to a database that has been taken offline).
By contrast, self-healing applications incorporate design elements that resolve problems. For example, applications that use Akka arrange elements in a hierarchy and assign an actor’s problems to its supervisor. Many such libraries and frameworks facilitate applications that self-heal by design.
Unlike application level self-healing, system level self-healing does not depend on a programming language or specific components. Rather, it can be generalized and applied to all services and applications, independent of their internal components.
The most common system level errors include process failures (often resolved by redeploying or restarting) and response time issues (often resolved by scaling and descaling). Self-healing systems conduct health checks on different components and automatically attempt fixes (such as redeploying) to recuperate to their desired states.
Hardware level self-healing redeploys services from an unhealthy node to a healthy one. It also conducts health checks on different components. Since true hardware level self-healing (for example, a machine that can heal failed memory or repair a broken hard disk) does not exist, current hardware level solutions are essentially system level solutions.
Reactive Versus Preventive HealingReactive Healing
Reactive healing is healing in response to an error and is already in widespread use. For example, redeploying an application to a new physical node in response to an error, thereby preventing downtime, is reactive healing.
The desirable level of reactive healing depends on how much risk a system can tolerate. For example, if a system relies on a single data center, the possibility of the entire data center losing power, resulting in all nodes not working, may be so slim that designing a system that responds to this possibility is unnecessary and expensive. However, if it is a critical system, it may make sense to design it to recuperate automatically after such an event.
Preventive healing proactively prevents errors. Take the example of proactively preventing processing time errors by using real-time data. You send an HTTP request to check the health of a service and better use resources. If it takes more than 500 milliseconds to respond, you design the system to scale it, and if it responds in less than 100 milliseconds, you design the system to descale it.
However, using real-time data can be troublesome if response times change a lot, because the system will scale and descale constantly (this can use a lot of resources in rigid architecture, and a smaller amount of resources in a microservices architecture).
Combining real-time and historical data is a better (and also more complex) preventive healing approach. Using our response time example, you design a system that stores response time, memory and CPU information and uses an appropriate algorithm to process it alongside real-time data to predict future needs. So, if memory usage has been increasing steadily for the past hour and reaches a critical point of 90 percent, your system determines that scaling is appropriate, thereby preventing errors.
Designing Self-Healing Systems: Three Principles & a Five-Point RoadmapPrinciples