Blog   |   Automation   |   August 2, 2016

Streamlining Diagnostic Processes

By Mitchell Kowalchick, DevOps Engineer at CloudCheckr

Actionable Diagnostic Steps You Can Take Right Now

For this blog post I wanted to go through some easy to implement techniques that can really assist in your diagnostic approach.  While some may find this to be a simple concept, many entities do not institute a convenient prioritization system and this will make things difficult for the administrative team.  In subsequent blog posts I will attempt to further detail the diagnostic approach at a higher level, but for now let’s start with the basics.
 

Naming

Naming strategies, while often overlooked, act as an important prioritization tool.  When things become chaotic it is important to settle your key servers first.  In my experience I have found tiered or ranked designations for server names to be the most helpful.  Not only do these conventions allow you to really focus on your vital resources, but they allow you to quickly sort through the infrastructure to where the problems are occurring.  Typically naming could be associated with the identification portion of the problem solving cycle by separating your infrastructure into individual fragments you can troubleshoot more effectively.
Consider the ideologies represented in Maslow’s hierarchy of needs… This is a psychological paradigm in which one must attain the lower level before moving up to the next.  As an example most people would need to be able to breath before thinking about getting a Frosty at Wendy’s.  The ideals of Maslow’s triangle can be used in structuring your infrastructure and naming convention in such a way that vital resources are prioritized and core functionality is easy to find.  For example priority servers can be designated tier 1 and more aesthetic or less profitable servers might be dubbed tier 4.  This assists in both preventative and emergency maintenance: The daily routine quickly turns into spot checking the tiers from most important to the least and improves resource management capabilities.
 

Timing

Identifying trends can severely reduce the amount of investigation required in finding the problem and is typically much less taxing to perform.  Let’s take a look at some of CloudCheckr’s heat map reports for RDS and one of our servers.
The server above is clearly experiencing some issues and is unhealthy, and to many it can be difficult to find a starting point.  If we look a bit closer though the heatmap allows us to see some very clear trends between the hours of 5 to 7 am and 5 to 8 pm.  Now that we have some smaller pieces to work with we can dig into logs recorded at those times to get a clearer picture of the problem.  Once logged onto the server I was able to identify the processes, queries, and jobs that were the most taxing on the server.  We can then dig in further and determine if the server has too many queries being forced upon it or we have an indexing/code related issue that is causing a bottleneck on the database.
 

Gauge the effectiveness of your attempted fix

After attempting a fix it is very important to watch the trend fall to a healthy state.  In many cases the initial fix will bring smaller issues to the surface yet they can still be diagnosed in a similar fashion.  Many times developers need to prioritize the more beneficial changes before moving forward to larger infrastructure changes so, when possible, acknowledge the effectiveness of the change and if the server is healthy enough move to another more costly issue before losing time in the nuances of one particular server.  The process of making small, timely, and effective changes will typically deliver much greater performance and stability.
 
To see Mitch’s previous blog post for his thoughts on the DevOps position and where it is headed, check it out here. You can also learn more about the Top 3 Areas You Need to Optimize for DevOps.