Troubleshooting Black Holes

Troubleshooting Black Holes

October 7, 2019

Let me tell you about Troubleshooting Terry. Terry’s a fictional software engineer on my team.

At the start of last week, Terry was asking about whether or not a certain piece of (seemingly legacy) code was still in use. He needed to know because this affected the scope of his work—if the answer was yes, there would be another case that he’d have to handle.

I didn’t know. I told him this, and also asked what he had done to figure this out on his own. As it turns out, he hadn’t really done anything. He said he would do some more investigation.

Since I was curious about the answer as well, I did some investigation on my own. After a slight trek through the code, some documentation, and a database table, I figured out that this code was essentially unused. Good to know.

A few days later, Terry came back to me with a snippet of a conversation he was having with someone else. It turns out that he had asked our PM about what he should do about the legacy code. The PM wasn’t really sure, and was trying to redirect him to someone else who might know.

At this point, I realized that Terry has fallen into a troubleshooting black hole. Terry’s stuck troubleshooting his problem, but he’s just spinning his wheels without getting an actual result.

Cause and Effect

What causes a troubleshooting black hole? What does it lead to?

Ultimately, a troubleshooting black hole is caused when someone is unable to get the information they need to fix their problem. The specific reason for this might vary. Maybe they can’t, maybe they won’t, maybe they don’t know how.

Regardless of the reason, the end result is that stuff doesn’t get done. Or it gets done very slowly.

Reducing Black Holes

There are a few things to help reduce these troubleshooting black holes. However, the things you do depend on your role and what your actual problem is.

As An Engineer

Recognizing Black Holes

The first skill that you need to learn is how to recognize when you’re in a troubleshooting black hole.

You always have the potential to enter one of these black holes whenever you run into a problem. The key thing that helps you avoid the black hole is having an idea of the size of the problem.

Knowing roughly how long it takes to reach a solution allows you to estimate how long it will take to reach a solution or some kind of checkpoint. From there, you can timebox.

Of course, you may not know how to estimate the size of the problem you’re facing. In this case, you can ask someone to get an idea of how long they think you should spend working on it before checking back in for more help. You can also set a hard fallback, such as 15-60 minutes, after which you’ll talk to someone if you haven’t been able to understand the issue and come up with an acceptable solution.

The other key indicator that you are stuck is that you aren’t actually making any progress or aren’t clear on the progress you’ve made. You’re often at this point when you’re going in circles, don’t know where to continue investigating, and/or are unable to clearly explain the issue.

Troubleshooting

Naturally, your ability to troubleshoot determines whether you’re making progress or just stuck. Your troubleshooting ability is roughly equivalent to the rate that you’re able to find useful information.

Almost all the information you’re looking for when troubleshooting falls into one of two categories: what is supposed to be happening and what is actually happening. It’s important to be clear which is which, otherwise you’ll likely end up making incorrect assumptions later on.

When it comes to figuring out what is supposed to be happening, you often want to find some authority on the topic. This could be documentation, the person who owns/is an expert at that topic, etc. Understanding what is supposed to be happening helps you orient yourself more quickly, allowing incorrect things to jump out more easily.

The key thing to remember about this information is that it is never guaranteed to be correct. This is pretty much by definition—if the intended behavior is identical to the actual behavior, you wouldn’t have an issue. The amount of drift between intended and actual behavior is based on the quality of your information (or the source of your information).

It makes sense to construct your own understanding of the issue when drift is high or likely to be high. It also makes sense when there’s a high cost to figuring out the intended behavior from another source.

When it comes to figuring out what is actually happening, you often need to rely on your own ability to investigate. This is, more or less, your ability to figure out what is going on when the relevant code is being run.

One of the most important skills here is the ability to read code. This includes not just understanding pieces of code, but being able to navigate between different parts of the code base and understand how everything fits together. The more accurately you can do this, the better you will be at statically determining what is actually happening.

Sometimes you will not be able to statically determine what is going on, due to not having enough skill or because you made a mistake somewhere. This is normal, but you have to be able to recognize that you misinterpreted something. Improving your understanding is done dynamically, by modifying or interacting with the appropriate code, to confirm if what it’s doing matches up with your expectations. Ways to do this include changing the code, adding/reading logs, and/or using any tools/endpoints available to investigate the state of your system.

As A Leader

As a leader, your biggest concern is not whether or not your team members are falling into troubleshooting black holes. They will, no matter what you do.

Instead, you should concern yourself over whether or not your team members are falling into troubleshooting black holes and you don’t find out about them until much later on. This is about mitigating the result of these black holes, not preventing them entirely. Additionally, you should concern yourself with teaching your team members how to better mitigate troubleshooting issues.

Establish Forcing Functions

A leader’s implicit assumption is that when someone on their team can’t figure something out, they will speak up. Unfortunately, troubleshooting black holes are created when this behavior doesn’t happen.

A simple forcing function is to just have your team member tell you the issue’s resolution or ask you for more help when they aren’t making progress. In particular, you should establish some explicit bounds—tell them to get back to you by a certain time.

If you can’t trust your team member to resurface the issue, you can also go to them instead. Just rely on your own inner timer or set an explicit reminder. Both approaches work, but the former is a bit better as it helps to get your team in the habit of resolving issues.

Show, Don’t Tell

For some issues, your team member ends up in a troubleshooting black hole because they don’t know how to troubleshoot that particular issue. They didn’t have the right skillset. In these situations, rather than telling them how to fix the problem, you want to walk them through how they can resolve this on their own. You want to teach them the skill.

It’s important to do this side-by-side in a way that they can observe what you’re doing and thinking. This also allows your team member to talk with you to clarify the points that they didn’t know or are confused by. Depending on the issue, it can be better to let your team member drive this investigation while you help show them what they should be paying attention to.

Clarify and Reinterpret

Other times, your team member has the right troubleshooting skills. They just don’t know how to use the skills properly or they were really inefficient in the way they applied their skills. This usually indicates that they didn’t understand what they were working with, or they haven’t fully mastered that skill.

In this case, it helps to replay the fix. This can actually be replayed, where your team member walks you through each step, or it can be just a discussion. During this replay, you’ll likely hear a lot of low-level details about each individual step that your team member did. Your goal is to help them understand how each of these steps either contributes to their understanding of the problem or to their solution to the problem.

You do this by frequently pausing to check their understanding. One way to check is to ask them to explain the current state or situation. This might mean explaining what is currently going on or explaining how they are interpreting something. Another way is to summarize for them—”so you wanted to do X because of Y?”

comments powered by Disqus