March 16, 2017

DevOps The Hard Way

I'm going to be doing more management and 'glue' business the next year or so. Part of this business is selling and personifying the value of DevOps. Like Cloud, this is something that is insufficiently understood at a deep(er) and nuanced level. So, as is customary, I'm going to tell a story.

The story, like most of mine, comes from an experience that burned. Something that left scars and had me wondering how people cold get into this mess. And so like the man says, share your scars. The situation was that I was on the very cutting edge of what I could do with Essbase + Essbase Studio. The requirement for drill through was fairly obvious. We all knew the limits of how much data we can squirrel into a multidimensional cube. So I used Essbase Studio to map back to the Oracle DB and bring back some records. Now it turns out that my customers wanted something on the order of 10,000 records in this detail. Well that doesn't seem like much. You could grab 10,000 records across a dozen columns, cut and paste them from one Excel spreadsheet to another right? That should only take a few seconds. Not the drill-through. That data had to come over the network. Well you could copy a spreadsheet with a dozen MB of data from a network drive to your desktop, right? That should only take a minute. Not the drill-through. We had to fulfill a query request from a database.

My queries were taking 7 minutes and 30 seconds and then dying.

I had to find out why. Thus began my painful birth of being DevOps. The first thing I had to learn was the difference between a view and a materialized view. Well that wasn't so difficult to learn. But I had always assumed that my DBA was materializing data for me. Well he didn't have enough disk space to do that for ad-hoc queries. So that meant I had to learn the procedure for requesting new disk space from the DBAs. How much did I need? I don't know. A terabyte? Impossible! Impossible? I can go to Best Buy and get a terabyte. Yeah but one live terabyte means four other terabytes according to our backup and DR, and we're at the limit of the current server which means we'd have to get a SAN device and... well how about an NFS drive? Nope. Can't have an NFS drive that would slow down everything I need local storage. Well, we'll get back to you. But how can you be sure that the database is the bottleneck? I don't know.

I had to find out where. What is timing out? Was it the Excel add-in? No. Was it the java middleware? Maybe. Who knows how to read the profile of the java middleware? Well there's no documentation for that, you'll have to call the engineers at Oracle. OK. Open up a service request and get an appointment. Who has access to the middle tier? Get access to the middle tier so you can log on. Oh by the way, the one support engineer is in Mumbai. That means you stay late, past 7pm Pacific time to get your answers, when he's available. OK change the profile, add in this line for the timeout. That didn't work? Oh you have to get the latest patch. Will it work with the version of Essbase Studio we're running here? Oh snap, we're going to have to burn a new version for you, but you're going to have to upgrade your java app server. OK now the explicit timeout is 15 minutes.

Still times out.

I had to find out how. What is the mechanism that creates the time out. Get this new tool called Fiddler, it will help you debug the HTML stream. Debugging HTML streams? Well, maybe it's the size of the download that's stopping things. OK did that. It's not the size. Well the corporate standard timeout is 10 minutes.. What corporate standard? The corporate standard on the firewalls between the users and the data center. Well can we get an exception? Maybe.

So it basically took six weeks for me to deal with the various network engineers, database admins, support staff and their management to prod them all to buy what I was trying to sell, which was the viability of this entire project. My only leverage was that I was consistently riding herd on the problem and I was a very expensive third party contractor. So the project was late and the entire overhead of the difficulty in justifying business as usual in the various departments was the only thing that motivated people to go to extraordinary lengths to solve the problem. Everybody wanted the problem to be somebody else's problem. And until we found out exactly what the problem was, everyone was pointing fingers until the last possible minute. It turned out to be a default in one of the load balancers that everyone assumed was set to 10 minutes, but communicated 7.5 as an override to the other. Those machines required firmware upgrades as well.

I have been accustomed, throughout my entire career in BI to be responsible for the entire data supply chain. That I could do. But middle-tier service configurations, firewall settings and DR disk availability was all above my pay grade. I was not paid to know and I was too expensive to be paid to learn. In that way, I'm accustomed to being like the wiley developer whose time is too valuable to waste learning these operational details. At the same time, I was equally demanding of all those dependencies. Give me more memory on the app server! Open up the damned ports I want! Get more disk, you lummox! Of course let me not forget the memory constraints on the end user machines.

All of this was a terrestrial implementation and it had other setbacks too, but it was a fascinating six month engagement. I of course learned a lot about these other systems with respect to how they affected my entire piece of the data warehousing applications. I sensed that I had the capacity to understand, but I'd never remember unless I had some responsibility and permission to make changes. That would be impossible without the cloud. But even when I had the cloud, it was more than just having control of the associated systems but really understanding how they worked. That's a story for another day. What was clear was that it was very difficult to manage all of the departmental areas, and get the priority within those departments (at their various locations) to solve a showstopper problem in this one application. It was 2011 and we were testing the very limits of the IT capabilities of a global corporation. DevOps might be a cool thing to talk about with web startups, IE a DevOps engineer would be cool for your website, but I saw the fundamental management problem that had everything to do with the way multimillion dollar Enterprise applications were built and maintained, and essentially why they were one-shot deals.

Comments

DevOps The Hard Way

I'm going to be doing more management and 'glue' business the next year or so. Part of this business is selling and personifying the value of DevOps. Like Cloud, this is something that is insufficiently understood at a deep(er) and nuanced level. So, as is customary, I'm going to tell a story.

The story, like most of mine, comes from an experience that burned. Something that left scars and had me wondering how people cold get into this mess. And so like the man says, share your scars. The situation was that I was on the very cutting edge of what I could do with Essbase + Essbase Studio. The requirement for drill through was fairly obvious. We all knew the limits of how much data we can squirrel into a multidimensional cube. So I used Essbase Studio to map back to the Oracle DB and bring back some records. Now it turns out that my customers wanted something on the order of 10,000 records in this detail. Well that doesn't seem like much. You could grab 10,000 records across a dozen columns, cut and paste them from one Excel spreadsheet to another right? That should only take a few seconds. Not the drill-through. That data had to come over the network. Well you could copy a spreadsheet with a dozen MB of data from a network drive to your desktop, right? That should only take a minute. Not the drill-through. We had to fulfill a query request from a database.

My queries were taking 7 minutes and 30 seconds and then dying.

I had to find out why. Thus began my painful birth of being DevOps. The first thing I had to learn was the difference between a view and a materialized view. Well that wasn't so difficult to learn. But I had always assumed that my DBA was materializing data for me. Well he didn't have enough disk space to do that for ad-hoc queries. So that meant I had to learn the procedure for requesting new disk space from the DBAs. How much did I need? I don't know. A terabyte? Impossible! Impossible? I can go to Best Buy and get a terabyte. Yeah but one live terabyte means four other terabytes according to our backup and DR, and we're at the limit of the current server which means we'd have to get a SAN device and... well how about an NFS drive? Nope. Can't have an NFS drive that would slow down everything I need local storage. Well, we'll get back to you. But how can you be sure that the database is the bottleneck? I don't know.

I had to find out where. What is timing out? Was it the Excel add-in? No. Was it the java middleware? Maybe. Who knows how to read the profile of the java middleware? Well there's no documentation for that, you'll have to call the engineers at Oracle. OK. Open up a service request and get an appointment. Who has access to the middle tier? Get access to the middle tier so you can log on. Oh by the way, the one support engineer is in Mumbai. That means you stay late, past 7pm Pacific time to get your answers, when he's available. OK change the profile, add in this line for the timeout. That didn't work? Oh you have to get the latest patch. Will it work with the version of Essbase Studio we're running here? Oh snap, we're going to have to burn a new version for you, but you're going to have to upgrade your java app server. OK now the explicit timeout is 15 minutes.

Still times out.

I had to find out how. What is the mechanism that creates the time out. Get this new tool called Fiddler, it will help you debug the HTML stream. Debugging HTML streams? Well, maybe it's the size of the download that's stopping things. OK did that. It's not the size. Well the corporate standard timeout is 10 minutes.. What corporate standard? The corporate standard on the firewalls between the users and the data center. Well can we get an exception? Maybe.

So it basically took six weeks for me to deal with the various network engineers, database admins, support staff and their management to prod them all to buy what I was trying to sell, which was the viability of this entire project. My only leverage was that I was consistently riding herd on the problem and I was a very expensive third party contractor. So the project was late and the entire overhead of the difficulty in justifying business as usual in the various departments was the only thing that motivated people to go to extraordinary lengths to solve the problem. Everybody wanted the problem to be somebody else's problem. And until we found out exactly what the problem was, everyone was pointing fingers until the last possible minute. It turned out to be a default in one of the load balancers that everyone assumed was set to 10 minutes, but communicated 7.5 as an override to the other. Those machines required firmware upgrades as well.

I have been accustomed, throughout my entire career in BI to be responsible for the entire data supply chain. That I could do. But middle-tier service configurations, firewall settings and DR disk availability was all above my pay grade. I was not paid to know and I was too expensive to be paid to learn. In that way, I'm accustomed to being like the wiley developer whose time is too valuable to waste learning these operational details. At the same time, I was equally demanding of all those dependencies. Give me more memory on the app server! Open up the damned ports I want! Get more disk, you lummox! Of course let me not forget the memory constraints on the end user machines.

All of this was a terrestrial implementation and it had other setbacks too, but it was a fascinating six month engagement. I of course learned a lot about these other systems with respect to how they affected my entire piece of the data warehousing applications. I sensed that I had the capacity to understand, but I'd never remember unless I had some responsibility and permission to make changes. That would be impossible without the cloud. But even when I had the cloud, it was more than just having control of the associated systems but really understanding how they worked. That's a story for another day. What was clear was that it was very difficult to manage all of the departmental areas, and get the priority within those departments (at their various locations) to solve a showstopper problem in this one application. It was 2011 and we were testing the very limits of the IT capabilities of a global corporation. DevOps might be a cool thing to talk about with web startups, IE a DevOps engineer would be cool for your website, but I saw the fundamental management problem that had everything to do with the way multimillion dollar Enterprise applications were built and maintained, and essentially why they were one-shot deals.