Most enterprise databases today run on shared storage volumes (SAN, NAS, etc…) that are accessed over the network or via Fibre Channel connection. The shared storage concept is great for helping to keep storage infrastructure and management costs relatively low but creates cross silo finger pointing when there are performance issues. In this blog post we will explore a real world example of how to avoid finger pointing and get right down to figuring out how to fix the problem.

One Rotten Apple Can Ruin The Whole Bunch

This story dates back to June of 2012 but I just came across it so it is new to me. One of our customers had an event which impacted the performance of multiple databases. All of these databases were connected to the same NetApp storage array. Often when there is an issue with database performance the DBAs will point the finger at the storage team and the storage team will tell the DBA team that everything looks good on their side. This finger pointing between silo’s is a common occurrence between various groups (network, storage, database, application support, etc…) within enterprise organizations.

In the chart below (screen grab taken from AppDynamics for Databases) you can see that there was a significant increase in I/O activity on dw_logvol. This issue impacted the performance of the entire NetApp storage array.

As it turns out dw_logvol was used as a temporary storage location for web logs. There was a process that would copy log files to this location, decompress them, and insert them into an Oracle data warehouse for long term storage. This process normally would not impact the performance of anything else connected to the same storage array but in this case there happened to be corrupted log files that could not be properly decompressed. This resulted in multiple attempts to retransmit and decompress the same files.

Context and Collaboration to the Rescue

Storage teams normally don’t have access to application context and application teams normally don’t have access to storage metrics. In this case though, both teams were able to collaborate and quickly realize what the problem was as a result of having a monitoring solution that was available to everyone. The fix for this problem was really easy, just remove the corrupted files and replace them with versions without any corruption. You can see activity return to normal in the chart below.

Modern application architectures require collaboration across all silo’s in order to identify and fix issues in a timely manner. One of the key enablers of cross-silo collaboration is intelligent monitoring at each layer of the application and the infrastructure components that provide the underlying resources. AppDynamics provides end-to-end visibility in an analytics based solution that help you identify, isolate and remediate issues. Try AppDynamics for Databases and Storage for freetoday and bring a new level of collaboration to your organization.

A few weeks ago I was presenting at CMG Performance and Capacity 2013 and during my presentation we (myself and a few audience members) got slightly side-tracked. Our conversation somehow took a turn and became a question of why it was so hard to get performance data from the network and storage teams. Audience members were asking me why, when they requested this type of data, they were typically stonewalled by these organizations.

I didn’t have a good answer for this question and in fact I have run into the same problem. Back when I was working in the Financial Services sector I was part of a team that was building a master dashboard that collected data from a bunch of underlying tools and displayed it in a drill-down dashboard format. It was, and still is, a great example of how to provide value to your business by bringing together the most relevant bits of data from your tooling ecosystem.

This master dashboard was focused on applications and included as many of the components involved in delivering each application as possible. Web servers, application severs, middleware, databases, OS metrics, business metrics, etc… were all included and the key performance indicators (KPIs) for each component were available within the dashboard. The entire premise of this master dashboard relied upon getting access to the tools that collected the data via API or through database queries.

The only problems that our group faced in getting access to the data we needed was with the network and storage teams. Why was that? Was it because these teams did not have the data we needed? Was it because these teams did not want anyone to see when they were experiencing issues? Was it for some other reason?

I know the network team had the data we required because they eventually agreed to provide the KPIs we had asked for. This was great, but the process was very painful and took way longer than it should have. Still, we eventually got access to the data. The big problem is that we never got access to the storage data. To this day I still don’t know why we were blocked at every turn but I’m hoping that some of the readers of this blog can share their insight.

Back in the day, when my team was chasing the storage team for access to their monitoring data, there weren’t really any tools that we could find for performance monitoring of storage arrays besides the tools that came with the arrays. These days I would have been able to get the data I needed for NetApp storage by using AppDynamics for Databases (which includes NetApp performance monitoring capabilities). You can read more about it by clicking here.

Have you been stonewalled by the network or storage or some other team? Did you ever get what you were after? Based upon my experiences talking with a lot of different folks at different organizations this seems to be a significant problem today. Are you on a network or storage team? Does your team freely share the data they have? Please share your experience, insight, or questions in the comments below. Just to clarify, I hold no ill will against any of you network or storage professionals out there. I’d just like to hear some opinions and gain some perspective.

The other day I had the opportunity to speak with a good friend of mine who also happens to be a DBA at a global Financial Services company. We were discussing database performance and I was surprised when he told me that the most common cause of database performance issues (from his experience) was a direct result of contention on shared storage arrays.

After recovering from my initial surprise I had an opportunity to really think things through and realized that this makes a lot of sense. Storage requirements in most companies are growing at an ever increasing pace (big data anyone?). Storage teams have to rack, stack, allocate, and configure new storage quickly to meet demand and don’t have the time to do a detailed analysis on the anticipated workload of every application that will connect to and use the storage. And therein lies the problem.

Workloads can be really unpredictable and can change considerably over time within a given application. Databases that once played nicely together on the same spindles can become the worst of enemies and sink the performance of multiple applications at the same time. So what can you do about it? How can you know for sure if your storage array is the cause of your application/database performance issues? Well, if you use NetApp storage then you’re in luck!

AppDynamics for Databases remotely connects (i.e. no agent required) to your NetApp controllers and collects the performance and configuration information that you need to identify the root cause of performance issues. Before we take a look at the features, let’s look at how it gets set up.

The Config

Step 1: Prepare the remote user ID and privileges on the NetApp controller. The following commands are used for the configuration.

Step 2: Configure AppDynamics to monitor the NetApp controller. Notice that we configure AppDynamics with the the username and password created in step 1.

Step 3: Enjoy your awesome new monitoring (yep, it’s that easy).

The Result

After an incredibly difficult 2 minutes of configuration work we are ready for the payoff. In the AppDynamics for Databases main menu you will see a section for all of your NetApp agents.

Let’s do a “drill-up” from the NetApp controller to our impacted database. Clicking into our monitored instance we see the following activity screen.

By clicking on the purple latency line inside of the red box in the image above we can drill into the volume that has the highest response time. Notice in the scree grab below that we have a link at the bottom of the page where we can drill-up into the database that is attached to this storage volume. This relationship is built automatically by AppDynamics for Databases.

Clicking on the “Launch In Context” link we are immediately transfered to the Oracle instance activity page shown below.

In just the same manner as we can drill-up from storage to database, we can also drill-down from database to storage. Notice the screen grab below from an Oracle instance activity screen. Clicking on the “View NetApp Volume Activity” link will launch the NetApp activity screen shown earlier for the volumes associated with this Oracle instance. It’s that easy to switch between the views you need to solve your applications performance issues.

Imagine being able to detect an end user problem, drill down through the code execution, identify the slow SQL query, and isolate the storage volume that is causing the poor performance. That’s exactly what you can do with AppDynamics.

Storage monitoring in AppDynamics for Databases is another powerful feature that enables application support, database support, and storage support to get on the same page and restore service as quickly as possible. If you have databases connected to NetApp storage you need to take a free trial of AppDynamics for Databases today.