Reasoning through unreasonable performance degradation
Friday Oct 9, 2020We’ve all run into a service or application that has suffered from performance degradation. Usually it is noticable straight after a big release and somewhat easy to pinpoint to some change in your code.
But what when it isn’t that straight forward? How could you go about finding the root cause or maybe even multiple causes of the issue? Let’s have a look at the what to look for and how to prepare.
One note before we dive in. The pointers below are with GC languages in mind they may or may not apply for non-GC languages.
Recognizing performance issues
When facing an issue we need to get an idea of what kind of problem we’re dealing with. We can judge that by some basic metrics on the machine level and some generic behaviour patterns.
Our baseline
When maintaining an application it really helps to have a general sense of what your application’s metrics look like. Even when it is stable have the occasional look to see how the metrics look. This will help spot issues and/or abnormalities. You might even catch issues before they cause functional impact. Later on we will have a look at what to measure.
It is hard to know exactly what to look for or remember so here are a few things I would start with.
- Are there trends? Per day or per week?
For example: you could have a usage spikes every day around 10am that are perfectly normal.
- What is a normal range for my usage?
For example: my memory usage is usually between 400Mb and 600Mb.
It will be to much to expect to have this overview right from the start. Just spend a few minutes everyday and you’ll develop a feel for the levels with time.
Now on to issues we might encounter. In general I see 4 types of issues, lets have a look.
Memory pressure
The first one we’ll have a look at is memory pressure. This usually translates to both high cpu and high (but stable-ish) memory usage. The application likely becomes intermittently unresponsive and generally slow compared to normal throughput.
This is usually the result of lots of memory allocation and quick release of those objects again. The garbage collector (GC) is sort of keeping up but it has to run far more often than usual. The GC is the cause of the high cpu in this case, so we can focus on finding the cause of the memory build up first.
Be on the lookout for high load on an endpoint or message queue(s). Or look find methods/call stacks with lots of short lived object assignments.
We’ll go into how to find the probable call stack later on.
Memory leak
Closely related to memory pressure is a memory leak. You’ll have the same profile for cpu, high usage. The difference is in the memory profile. You’ll see the memory increase and hit the max, your graphs may even show it going through the max you’ve set for your runtime. At which point you’ll probably get out of memory exceptions.
A simple albeit forced example would be having a globally defined list where you keep adding new objects to without ever removing anything. Find variables for a specific class and reason about each if they ever go out of scope. If there are many places where this happens then focus on the ones that are in the (probable) call stack for our issue.
Loading a large amount of data could have the same effect. E.g. loading many millions of rows into memory or maybe a large file. Garbage collect will go into overdrive causing cpu to spike and memory will rise with each row.
CPU load
CPU load is pretty straight forward, you’ll see high or max cpu usage for extended periods of time. If there is also a memory component I would first look at memory pressure or memory leak causes. However if memory is stable and close to normal, there is a lot of processing happening without i/o bottlenecks. So far pretty straight forward.
Easiest way to reproduce is an infinte loop without i/o and little object creation.
Possible hotspots to look at are loops or check the amount of threads that are doing work.
Resource depletion
This is probably one of the easier ones to detect and find a rootcause. If you notice issues in your througput or response times without any obvious cpu and or memory spikes you probably have an I/O issue somewhere. Slow rest calls, slow database queries. Add some proper timers on all incoming and outgoing requests and follow the call stack. You’ll want those metrics anyway for monitoring purposes.
Narrowing the search
Now that we have an idea what to look for lets see how we can narrow down what to investigate. So there are a few tools available to us that can help us out here.
- Change/recent release
- Configuration changes
- Environment changes (OS, infrastructure, …)
- Alerst and spikes in metrics
- Logs
The first point is the lowest hanging fruit, if you have an obvious changeset that caused the issue you can take that to narrow down the changes to look at. Beware that some changes can have effect on other parts of your application. For example removing a database field that was no longer used could cause an index to drop that is still used.
Points 2 and 3 are all about identifiying if it is inside or outside your app. Being resonably sure that the cause is inside your application already reduces the areas to look at.
The last two points are all about finding hotspots in your code. Those areas that you need to prioritize to investigate. You will need the right metris in place, at the end of the article I’ll give you some tips where I would put metrics to make your life easier.
As for logs they can indicate some dependency misbehaving. For example you might have a library dealing with attachments that uses native calls to inspect the bytes. It might log that it is trying to assign larger memory blocks than it encountered. After that it is a matter of tracking down where you use that dependency and investigate that piece of code to try and optimize it. Or if it is a bug you can see if there is a fix in a newer version.
Ironically to narrow down the search in your code, you have to expand the search in your logs and metrics.
No single root cause
There might not be a single root cause or the root cause might be obscured by other issues. If you have trouble tracking a single thing down simply start fixing the smaller things you found. They might aleviate the issue and may even make the root cause more pronounced.
Worst case you have a cleaner codebase, best case it was the rootcause and you can reason back to the issue you discovered on why it was.
Also never exclude parts of the code without investigation. You should prioritize and asses the risk/chance or you have no place to start. But these things can be hiding in the smallest parts that you might just overlook.
How to prepare
To round this off here are two tips to put yourself in a better position when issues do occur.
First is to have your error handling in place, don’t swallow errors without some kind of logging if you really don’t want to propagate it.
Second is to add metrics on every point in your code where requests/messages come in and go out. This will help you segment your code by looking at if calls are still happening and if they’re still on the old throughput levels.
Still not sure what to add? Here is some extra reading that may help you to setup metrics before you need them.
The four golden signals by google SRE.
Good luck with your next performance issues, may it never come up.