Self-Healing Code: A Journey Through Auto-Remediation
Managing modern Product Infrastructure and applications is daunting because the bigger the infrastructure, more complicated the operational challenges you face. Things break, daemons die, services stop, clusters fall – not to mention writing the root cause analysis (RCA) documents and runbooks on how to fix the same problem in the future. If you keep on adding monitoring, you end up having a huge pile of alerts and failures every day.
To ensure availability of product as you scale, either you automate or you die. This is where Auto- Remediation comes into picture.
Auto-Remediation, or Self-Healing, is a workflow which triggers and responds to alerts or events by executing actions that can prevent or fix the problem.
The simplest example of auto-remediation is restarting a service (let’s say apache) when it’s down. Imagine an automated action that is triggered by a monitoring system to restart the service and prevent the application outage. In addition, it creates a task and sends a notification so that the engineer can find the root cause during business hours, and there is no need to do it in the middle of the night. Furthermore, the event-driven automation can be used for assisted troubleshooting, so when you get an alert it includes related logs, monitoring metrics/graphs, and so on.
- Basic knowledge of Python
- Basic Knowledge of Saltstack (Python based open source configuration management tool)
- Knowledge of Nagios
- Basic Bash or Python scripting knowledge
Presentation - https://speakerdeck.com/arusing/self-healing-code-a-journey-through-auto-remediation
Arun Singh is a seasoned Site Reliability Engineer & DevOps professional, currently employed with Adobe Systems India Pvt Ltd. He has extensive experience in Linux administration, several monitoring tools, python and bash programming, SQL and NoSQL databases etc. He has developed an expert level of understanding in Saltstack, a configuration management tool. His recent work has been revolving around improving the efficiency and productivity of site reliability engineers and devops professionals in day to day work scenario.