Default HubSpot Blog

Automated root cause analysis and its criticality in IT landscape

Written by Stephanie Jones | Jul 19, 2019 5:35:26 AM

Innovation – Is it complex?

“Innovation is the ability to see change as an opportunity - not a threat,” this visionary statement by Steve Jobs’ did cast a spell in the business environment and a series of revolutionary technologies did take shape. The last two decades witnessed technological evolution to an unimaginable extent. The ‘then’ annual deployments, on-premise IT infrastructure, a big room of servers flashing red lights during a server downtime – these concepts are no more a reality. With technological evolution in place, physical servers have been replaced with cloud servers. Although it might have automatic and self-correcting capabilities, yet identifying the anomaly and its relevant corrective measure is a challenge with microservice, cloud computing solutions. This operational complexity can be well addressed by Root Cause Analysis (RCA) in the world of IT process operations.

Anomaly Detection and RCA

If a critical application displays error due to the unavailability of database, it is imperative for the IT operations team to identify and understand the root cause of the error, through apt detection leveraging RCA. Anomaly detection is the competence that can locate a problem whereas, RCA is the process of understanding the precise reason that has caused the problem. It is a process through which defects are categorized and analyzed thereafter. An RCA, especially in IT process automation landscape, can ensure corrective actions right on time to prevent re-work and save time and money.

What is Automated RCA?

Automated RCA can identify defects, review them and finally carry out an extensive analysis, thus eliminating manual intervention. Had automated RCA not been in place, the process of analysis would have become expensive, as it involved analyzing unstructured data.

Wondering how we can make processes more seamless and automated? RCA can be automated by employing classification algorithms from the data mining domain. Information about the factors that caused the problem can be extracted from process execution data, recorded by the IT support system. Here, the user does not require specialized analytical capabilities to improve the business process since, the entire process of data mining is automated.

Tools used for Automated RCA

Few of the tools used to extract data in the Automated RCA landscape include:

  • Per-analysis extraction using simple database queries
  • Regular automated extractions using ETL tools
  • Failure Mode and Effects Analysis (FMEA)
  • Direct data integration for extraction using existing interfaces.
  • Kaizen or Continuous Improvement
  • Impact analysis

Factors contributing to defect in IT landscape

Let’s look at the factors that contribute to creating issues in the IT environment. Few of them are:

  • Insufficient requirements
  • Lack of testing tools
  • Communication gap
  • Negligence
  • Inaccurate assumption
  • Design Gaps
  • Deployment issue
  • Unsuitable environment
  • Useless Test data

The RCA Process Flow

The process followed by automated RCA is as follows:

 

Adoption of Automated RCA – How far has it reached?

Absence of skilled resources among organizations deter them from carrying out active process analysis, which is why they tend to adopt automated RCA to enable fact-based optimization strategies and assure quality processes, effectiveness and compliance. RCA is essential to reduce mean time to repair (MTTR) through automation. In a separate instance, major service interruption of software solutions in an enterprise can hamper existing IT operations. This can be aptly addressed, and disruption of IT operations can be avoided by a service platform that enables RCA automation, by testing point-of-sale (POS) business processes, across the network.

Automated RCA – Is it unavoidable?

Automated RCA includes applications that are a complex combination of code and infrastructure. With the efficient functioning of codes, it communicates with database, firewall and storage signaling error message, in case an application code, with new features, is not performing as expected. Code signal anomalies can occur in the form of log entries, slowdowns or poor customer experiences. For example, Application Performance Monitoring (APM) tools are a form of code signals, that can detect an issue but cannot identify its root cause. Although log files are mostly managed manually, yet, these simple code signals, allow a glimpse of what went wrong. In case of errors detected through log files, the developers are unable to determine if the issue is code related or infrastructure related. In a separate instance, a log file can ensure an identification of a possible error but, cannot determine the root cause behind it. Automated RCA is a definite savior in these situations, helping to delve deep into the actual root cause of the error.

Real Use cases highlighting criticality of RCA

For instance, highlighting a use case where the critical role of automated RCA in the IT landscape is clearly exhibited. The customer service application of a company started facing data errors which costed the company more than $5,000 an hour. Errors were identified in database log file thus; connoting that the failure was due to faulty storage. However, replacing the storage system didn’t fix the entire issue, leaving the thought behind that there could be issues elsewhere.

Finally, through Automation RCA, the root cause was identified in the faulty power supply. To delve deep into the root cause of the issue, the company lost approximately half-million dollars in a few weeks’ time, also in the interim, its customers lost faith in the company. In situations similar to the above, automated RCA is 90% faster and effective in analyzing the triggering factors behind errors.

Overwhelming consequences of IT downtime

The consequences of an IT issue are devastating and of paramount importance. According to Gartner, the average cost of unscheduled IT operations downtime is $5,600 per minute. Infact, one hour of productivity loss in a company of 5000 employees will cost approximately $300,000 loss to the company. In such scenario, it is essential to locate the root cause of the issue and fix it immediately, which is difficult to understand, that too with modern IT service delivery infrastructure, the complexities are three-fold.

The issue could be either be network-related, storage-related or even related to the database performance. Even for that matter, a simple mis-configuration of an application can cause damage beyond imagination. Adding to the already complicated situation, IT experts are mostly domain experts with very little knowledge on the overall picture, thus making it difficult to deep dive into the root cause in a prompt and precise manner.

Predominantly there was a prevalence of manual RCA, where views from experts can differ. One might believe that the issue lies with the web server, whereas someone else might think it lies with the hardware failure. The best possible solution to this problem is to automate RCA through the AIOps technology. RCA along with IT process automation can correlate all symptoms and run tests to locate the root cause. It can help IT organizations resolve service outage accurately. Automated RCA involves collecting and correlating data (that is monitored), logging records, events and other related information. An automated RCA can locate the root cause of the service issue in less than 30 seconds, contrary to days and months which a manual RCA might have taken. This results in reduced outage time hence, considerably reducing business impact. The goal of almost all IT organizations these days is to resolve issues before they impact employees and customers along with impacting the business negatively. With the emergence of public, private and hybrid cloud, automated RCA has reduced the dependence on humans to track minor problems.

Albert Einstein once opined, “We cannot solve our problems with the same thinking we used when we created them.” So true! Enterprises believe that in entirety, because of which we have witnessed a transition from manual RCA to automated RCA!

Automated RCA is inevitable

We understand, when a defect occurs in a single microservice application, it creates a ripple effect impacting all other microservices. All of these microservices will, in turn, create incidents finally resulting into alert storms. When alerts start flooding the inbox of the Incident Manager, it is difficult to prioritize the point of action. As devices, services and applications are all interconnected – that is why the skepticism arises. An automated monitoring solution can assist the DevOps team to strategize better and react on the actionable information, an aspect which is critical in the IT landscape. Ask a question today, have you magnified your inefficiency to the point that you become automated today?