Since log analysis entails looking for a needle in a haystack, the main question is: What is the fastest way to find the needle?
Logs are ideal for storage and performance purposes because they carry so much information in a tiny space. On the other side of the coin, this makes log files hard to read and can be a costly affair during software downtime, as getting to the root cause of a software incident becomes cumbersome.
Addressing this stalemate, SiliconANGLE Media’s livestreaming studio theCUBE and root-cause-as-a-service provider Zebrium Inc. recently aired an exclusive event focusing on why finding the root cause should not be a painful process because all that is needed is automating the observer.
During the event, Dave Vellante, chief research officer of Wikibon Inc. and theCUBE industry analyst, led several sessions highlighting root cause as a service, or RCaaS, and how Cisco Systems Inc. validated RCaaS at 95.8% accuracy. Experts guests included Zebrium’s Larry Lancaster, founder and chief technology officer, and Rod Bagg, founder and vice president of engineering, as well as Cisco’s Atri Basu (pictured, left), resident philosopher, and Necati Çehreli (pictured, right), technical leader of the customer experience innovation, automation and disruption team. (* Disclosure below.)
“It’s one thing to observe the stack end to end, but who is automating the observers?” Vellante asked in his introduction to the event. “Zebrium is using unsupervised machine learning to detect anomalies and pinpoint root causes, and delivering it as an automated service.”
In case you missed it, here are three key insights from the “Root Cause as a Service” event:
1) Not automating the observer is a recipe for failure.
Whenever a software failure or incident strikes, finding the root cause requires a DevOps engineer, site reliability engineer, or developer to go through log files manually. Not only is this mindboggling, but it’s costly, because the mean time to resolve, or MTTR, can range from hours to days. Therefore, automating the observer becomes necessary to avoid failure, according to Bagg.
“It’s great to know that something went wrong, but the root cause of why it’s happening is going to be buried in log files … to get there fast, you better automate or you’re just doomed for failure, and that’s where we come in,” Bagg stated.
Automating the observer is a stepping stone toward tackling downtime fast, and Zebrium caters for this through RCaaS, which enables root cause analysis in minutes. Bagg pointed out how RCaaS could have saved an SRE of a certain AIOps company hours of downtime.
“He hadn’t put that integration in, so it wasn’t in his dashboard when he had this incident, but it was certainly in ours,” Bagg said. “It literally would’ve saved him hours and hours. They had this issue going on for over 24 hours, and we had the answer right there in five minutes.”
Here’s theCUBE’s complete video interview with Larry Lancaster and Rod Bagg:
2) RCaaS takes the pain of going through logs away both on-prem and in the cloud.
Dealing with logs is not an easy manual task, because even with a keen eye and significant expertise, finding the right context still takes hours. Millions and billions of lines of software and infrastructure log data have to be analyzed to unravel the details of the problem.
Since a person is limited to what they can filter manually, RCaaS takes the pain away of digging through logs through unsupervised machine learning, and it can be deployed both on-prem and in the cloud, according to Lancaster.
“Observability is a property of a system, but the problem is if it’s too complicated, you just push the bottleneck up to your eyeball,” he stated. “You can run it on-prem, just like we run it in our cloud. You can run it in your cloud or your own infrastructure.”
RCaaS not only offers an end-to-end view, but also provides a detailed root cause analysis for fast resolution. Even though this sounds too good to be true, Cisco verified the benefits: The company gave Zebrium’s RCaaS solution a 95% accuracy rating after testing it, according to Lancaster.
“People have been trying to figure out how to automate this human part of finding the root cause indicators for a long time, and until Zebrium came along, I would argue no one’s really done it right,” he said. “So [Cisco] ran that data through the Zebrium software, and what they found was that in more than 95% of those incidents, Zebrium reflected the correct root cause indicators at the correct time.”
3) Log analysis doesn’t need to be a black-and-white process.
Since software logs are esoteric and compressed, getting visual cues becomes cumbersome. As a result, doing log analysis emerges as a black-and-white process because reading between the lines is difficult. Nevertheless, Zebrium’s RCaaS adds color to this process for better and faster insights, according to Cisco’s Basu.
“If you think about log analysis, it is indeed black and white,” he stated. “You’re looking at it on a terminal screen where the background is black, and the text is white. But what Zebrium does is it provides a lot of color and context to the whole process by using their interactive histogram and summaries of every incident.”
Even though log analysis plays an instrumental role when unearthing details of software downtime, it is a labor-intensive and time-consuming process.
About 8,000 engineers under Cisco’s support arm — Technical Assistance Center — used to spend 24,000 hours daily doing log analysis, which compromised their efficiency, according to Basu.
“The anecdotal evidence was that, on average, an engineer will spend three out of their eight hours reviewing logs either online or offline,” he pointed out. “ … 8,000-plus engineers, and so three hours a day; that’s 24,000 man-hours a day spent on log analysis.”
After facing challenges around maintaining its internal automation system, Cisco wanted to automate 50% of its log analysis, but RCaaS drove it to 95%, according to Cisoc’s Çehreli.
“With a sample set of close to 200 SaaS, we found out the majority of the time, almost 95% of the time, the engineer could find the log they were looking for in Zebrium’s analysis,” he stated.
Here’s theCUBE’s complete video interview with Atri Basu and Necati Çehreli:
To watch all of theCUBE’s coverage of the “Root Cause as a Service” event, see the complete event video below:
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.