QualiFYI | Ep. 9: Rookout CTO Reveals the Secret to Accelerating Software Debugging

Author:
Arta Shita

Published:
April 29, 2021

In this episode, Liran Haimovitch, Co-founder and CTO of Rookout, discusses the software debugging process and some of the most common challenges they face along the way. Check out the video and full transcript below!

The Secret to Accelerating Software Debugging

Arta Shita: Hello everyone, and welcome back to QualiFYI. I'm your host Arta Shita and I am joined here today by Liran Haimovitch, the Co-founder and CTO of Rookout, the leading cloud debugging solution. Liran, welcome.

Liran Haimovitch: Hi Arta. It is great being here.

Q: So excited to have you. What's the one thing you've noticed everybody hates most about debugging?

A: I think the one thing people hate most about debugging is the uncertainty of it. I mean, quite often, you're never sure how close you are to solving the problem. I mean, just today we solved the bug we've been working on for three weeks and we actually spent two and a half weeks chasing the wrong direction. We were sure we had a problem in one whole part of the application. And then we just, after two and a half weeks of work, we found that it had a missing log somewhere, and everything we thought we knew was completely wrong about it.

And just yesterday, we restarted in search for the log and this uncertainty is just so annoying. It's so painful because you're not sure where you are. You know, you're not sure if you're making progress. You're not sure when you're going to get it over with. And quite often, especially for the important bugs, it's not just between you and the bug. You have a team, you have a boss. If a customer reported that bug, then he's interested in that customer success. Account Executives, everybody's asking about the bug and this uncertainty, well, you know, people just don't like uncertainty. And the worst thing is, that you're constantly missing out on data. You're constantly missing out on the data you need to understand the bug and to know where you are on that journey.

Q: Speaking of challenges, what's the most common challenge when trying to solve a customer issue such as this?

A: So the biggest challenge tends to be about getting a reproduction of the system. I mean, while the customer might be very loud about reporting a bug, they're not always the most cooperative about helping you. So that bug, they are not always very cooperative regarding the data itself. The data you need might be very sensitive from a security perspective. It might be very difficult to reproduce the bug due to the chain of events that caused it. And going back to a lack of understanding in the bug, sometimes if you don't know what the bug is, then you find it harder to reproduce it. And so the first thing you want to do, when a customer reports the bug, is being able to reproduce it and especially reproduce it in an environment where you can observe it and you can understand it.

And unfortunately, quite often, reproducing the bug is the hardest thing, especially if you don't have access to easily observing the bug in a convenient environment. So let's say for instance, a customer reported that bug in the production environment on his account, but you don't have access to the bug in that environment. Then you're probably going to be trying to reproduce a bug in the production environments on your account, or even more so in a development environment, in a different account.

And then you might be spending most of the time, just trying to figure out what in that account, what in that customer, what in that environment is making the bug. And you're going to be spending so much time doing that, but at the same time, if you could just observe the bug in its natural habitat, so to speak, if you could just see what's happening in the production environment, in that specific customer account, you're going to have a much easier time reproducing the bug and understanding it. And so actually that's much of what we're doing at Rookout. We are empowering engineers to debug at any environment they choose regardless. And we provide all the safeties they need from stability, performance, availability, security perspective, so that they can operate wherever is needed and reproduce the bug in wherever it's most convenient for them.

Q: So, how do industry trends like relying on open source and transforming to cloud native microservices-based applications affect your ability to debug?

A: What we're seeing that these new industry trends, which are allowing us to deliver higher quality software at larger scale with larger teams than ever before, makes debugging much harder. And it makes debugging much harder in the development, on the development side of things, developing the software development life cycle. And the reason is that those tools are very scale-oriented techniques are great for running in the cloud. They are great for operating at scale. They're not so good for running on your home laptop. And I mean, it's much harder to spin up multiple microservices, especially as you get more and more microservices on your laptop. It gets harder to spin up those cloud dependencies, whether it's databases or queues or whatever. And we find that more and more engineers spend more of their time running their software in the cloud and then kind of tweaking it, testing it and so on.

And you just don't have that level of control. The level of visibility you're used to when operating in the cloud. And in a way, open source is even making that worse because let's say you're running your microservices in the cloud, so you can always add a log line. Sure, it's going to take you 20 minutes to redeploy that microservice to the Google cluster. We build the application, the container image, and we run the deployment, but 20 minutes is not that bad. But if you have to add a log line to an open source package, then that means you have to get a clone. Originally you have to figure out how to build it, how to package it and how to make the new custom version, the dependency of your application, which can easily take you hours, sometimes even days.

Now, as we were working locally, it wasn't that bad because you could just use the debugger, any debugger. It doesn't care if it's your code, if it's a third party code, open source code, the debugger can work through all of that; but adding logs, doesn't, it's not as easy. And that's actually one of the benefits of Rookout, because using Rookout, you can debug your own code. You can debug other microservices. You can even debug open source code very easily, just like you would with a traditional debugger, except you can do it in the cloud.

Q: Very convenient, especially with how working remotely has completely changed in the last year for obvious reasons.

A: Yeah. We're seeing that. We're seeing that with our customers, we're seeing it internally. And the move to remote work is making it very important to empower and just to be more independent. If in the past, engineers were walking in the same room or in the room next door, it was much easier to ask a question for somebody else or to get privileges for somebody else or ask somebody about how is this working? What could you do? Do you know what this is doing? Can you prove this out for me? It was much easier to work collaboratively. Now with remote work, people are working in different areas. Sometimes they work in different time zones, different parts. And so, it's much more critical to empower people to work independently because synchronization is much, much harder. And so again, it's very important to empower engineers to independently get the control and visibility they need without relying on other people, without being overly dependent.

Q: So, can you share any funny anecdotes with debugging? I'm sure you have several.

A: So, I think one of the first bugs, we are actually seeing a customer resolve using Rookout in production. It was about a bug they have been chasing for almost eight months. It was about accessing an internal portal. And most of the company could exit that into our portal, but a handful of employees couldn't access it. And they've devalued for months and months, and months. And still, those guys couldn't get in the system, no matter what. And they went into the login page, they click on Google again and they got an "error" page, and nobody could figure out what's going on. So, that those guys deployed Rookout and they were kind of using our non-breaking breakpoints. And they were debugging through the event stream event-driven, seeing the login requests coming in, going through the code. And they added an if statement, a conditional statement in there.

And that conditional statement was if the cookies got too big, they were truncated. And they decided that cookies over 4k was too big. And because it's passed an unnecessarily flag to Google, they actually got the full Google profile for the employees. And that specific employee apparently had a very big Google profile, those few employees, and all of the sudden those cookies were huge. They went over 4k and they got truncated. Now within that event, where they were truncating the code, they actually added a comment. And that comment was to make sure to add the log line. So we know if this is happening, this might be a bug. And they've chased that missing load line for eight months. And all because somebody forgot to add that log line. And actually we're seeing that with almost every customer we're talking to today, engineers are living in what we like to call "logging FOMO".

Q: Logging FOMO?

A: Yeah, because we're afraid. We're afraid something's going to go wrong or not going to have the logs in place to fix it. And we're afraid that pushing those new logs is going to take us weeks and months and we're going to be blind, without those logs. And so that's, I mean, that's how Splunk and Elastic became billion-dollar companies because engineers are just throwing logs they don't need out of the fear that they might, someday, be needed.

And, the thing is, it's not even helping because you're just pushing logs. Not because you need them, because you think might, maybe someday need them. And at the end of the day, you're just going through trash because there's so much unnecessary noise; so much unnecessary logging. And that's something we need to change in the way we develop software. We need to make it possible to efficiently and agilely create new logs, create new metrics, create spans, without going through cumbersome processes, such as releasing a new version for every small change we're trying to make, for every new piece of data we want to collect.

Q: It's kind of like a false safety blanket, creating all of these logs for essentially no reason.

A: Yeah. Because you're just hoping that maybe this is going to be useful someday. But, at the end of the day, if you're not educated, if you're not able to add those logs when you need them, chances are all of those getting propelled is not going to help you because you never know what you're going to truly need.

Arta Shita: Thank you so much for your time, Liran. I'm excited to learn more about Rookout and how we can work together one day. So, thank you for your time and really appreciate having you as a guest.

Liran Haimovitch: Thanks for everything.

Arta Shita: visit quali.com to learn more about Infrastructure Automation at Scale.

Topics: Dev/Test

4 Major Business Problems Solved by Infrastructure Automation

Every company and organization that develops software and applications needs environments. As they race toward innovating new products and...

Infrastructure Automation at Scale: Blueprinting vs. Terraform

[This blog was originally published in November of 2019 and updated with new content in May of 2021.] Whether you are a software architect,...

Why Infrastructure Automation Is Critical for Cyber Security

Revelations about the recent SolarWinds hack have highlighted the evolving sophistication and growing effectiveness of cyber attacks,...