Monthly Archives

Coinciding with a break in the London rain, September 25th marked our first UK LogicMonitor User Group. (Earlier this year we started regional User Group events as a way for customers to interact with our team, hear about product updates, and share experiences with other users.)

CEO Kevin McGibben with the LM UK team.

At the London event, attended by some leading online media, SaaS and MSPs in the London area, we began with a “lunch-and-learn” to review enhancements in our latest product release. We spent the balance of the afternoon in a LogicMonitor portal, demonstrating the latest features, fielding questions and providing a sneak preview into some future additions.

News of extended Engineering support hours (9am-1am GMT) was well-received

Unanimous interest in a native iOS application

Most of the feedback was positive, and we were successful in eliciting valid critiques that help us continually improve

We agreed on methods to more proactively update clients of product updates

All in all it was a fantastic event. Good venue, lots of war stories, many laughs.

We’re looking at adding LM User Group events in Q4 in Los Angeles and Q1 of next year in NYC. If you are interested in interacting with our team and other LM users, don’t be shy – let us know where you want to see us next!

As the new hire here at LogicMonitor brought in to support the operations of the organization, I had two immediate tasks: Learn how LogicMonitor’s SaaS-based monitoring works to monitor our customer’s servers, and at the same time, learn our own infrastructure.

I’ve been a SysA for a longer than I care to admit, and when you start a new job in a complex environment, there can be a humbling period of time while you spin-up before you can provide some value to the company. There’s often a steep and sometimes painful learning curve to adopting an organization’s technologies and architecture philosophies and make them your own before you can claim to be an asset to the firm.

But this time was different. With LogicMonitor’s founder, Steve Francis, sitting to my right, and its Chief Architect to my left, I was encouraged to dive into our own LogicMonitor portal to see our infrastructure health. A portal, by the way, is an individualized web site where our customers go to see their assets. From your portal, you get a fantastic view of all your datacenter resources from servers, storage and switches to applications, power, and load balancers just to name a few. And YES, we use remote instances of LogicMonitor to watch our own infrastructure. In SysA speak, we call this ‘eating our own dog food’.

As soon as I was given a login, I figured I’d kill two birds with one stone and familiarize myself with our infrastructure and see how our software worked. Starting at the top, I started looking at our Cisco switches to see what was hooked up to what. LogicMonitor has already done the leg-work of getting hooks into the APIs on datacenter hardware, so one has only to point a collector at a device with an IP or hostname, tell it what it is, ( linux or windows host, Cisco or HP switch, etc) provide some credentials and ‘Voila!’ out comes stats and pretty graphs. Before me on our portal was all the monitoring information one could wish for from a Cisco switch.

On the first switch I looked at, I noticed that its internal temperature sensor had been reading erratic temperatures. The temperatures were still within Cisco’s spec, and they hadn’t triggered an alert yet, but they certainly weren’t as steady as they had been for months leading up to that time. For a sanity check, I looked at the same sensor in switch right next to it. The temperature was just as erratic. Checking the same sensors in another pair of switches in a different datacenter showed steady temperature readings for months.

Using the nifty ‘smart-graph’ function of LogicMonitor, I was able to switch the graph around to look at just the data range I wanted. I even added the temperature sensor’s output to a new dashboard view. With with my new-found data, I shared a graph with Jeff and Steve, and asked, “Hey, guys, I’m seeing these erratic temperature’s on our switches in Virginia. Is this normal?”

Jeff took a 3 second glance, scowled, and said, “No, that’s not right! Open a ticket with our datacenter ticket and have them look at that!”

That task was a little harder. Convincing a datacenter operator they have a problem with their HVAC when all their systems are showing normal takes a little persistence. Armed with my graphs, I worked my way up the food-chain with our DC provider support staff. He checked the input and output air temperature of our cabinet, and verified there was no foreign objects disturbing air flow. All good there. We double-checked here that we hadn’t made any changes that would affect load on our systems and cause the temperature fluctuation. No changes here. But on a hunch, he changed a floor tile for one that allowed more air through to our cabinet. And behold, the result:

Looking at our graph, you’ll notice the temperature was largely stable before Sept. 13. I was poking around in LogicMonitor for the first time on Sept. 18th. (Literally, the FIRST TIME ) and created the ticket which got resolved on Friday Sept. 21. You can see the moment when the temps drop and go stable again after the new ventilation tile was fitted. ( In case you’re wondering, you can click on the data sources on the bottom of the graph, and that will toggle their appearance on the graph. I ‘turned off’ the sw-core1&2.lax6 switches since they were in another data center )

And I’ll leave you with this: Monitoring can be an onerous task for SysAs. We usually have to build it and support it ourselves, and then we’re the only ones who can understand it enough to actually use it. Monitoring frequently doesn’t get the time it deserves until it’s too late and there’s an outage. LogicMonitor makes infrastructure monitoring easy and effective in a short period of time. We’ve built it, we support it, and we’ve made it easy to understand so your SysA can work on their infrastructure.

You released new code with all sorts of new features and improvements. Yay!

Now, after the obvious things like “Does it actually work in production”, this is also the time to assess: did it impact my infrastructure performance (and thus my scalability, and thus my scaling costs) in any way.

This is yet another area where good monitoring and trending is essential.

As an example, we did a release last night on a small set of servers.

Did that help or hurt our scalability?

CPU load dropped for the same workload (we have other graphs showing which particular Java application this improvement was attributable to, but this shows the overall system CPU):

There was an improvement on a variety of MySQL performance metrics, such as the Table open rate (table opens are fairly intensive.)

But…not everything was improved:

While the overall disk performance and utilization is the same, the workload is much more spiky. (For those of you wondering how we get up to 2000 write operations per second – SSDs rock.)

And of course, the peak workloads are what constrain the server usage – with this change in workload, a server that was running at a steady 60% utilization may find itself spiking to 100% – leading to queuing in other parts of the system, and general Bad Things.

As it is, we saw this change in the workload and we can clearly attribute it to the code release. So now we can fix it before it is applied to more heavily loaded servers where it may have had an operational impact.

This keeps our Ops team happy, our customers happy, and, as it means we dont have to spend more money on hardware for the same level of scale, it keeps our business people happy.

“A cloud is made of billows upon billows upon billows that look like clouds. As you come closer to a cloud you don’t get something smooth, but irregularities at a smaller scale.”

~Benoit Mandelbrot

The cloud, as seen by the end user, is a wondrous tool full of seamless functionality and performance limited only by their internet connection. The truth is, the “water particles” which make up these clouds are machines. And machines fail. Through the use of cloud providers like Amazon’s EC2, Rackspace and others, we get to add a layer of abstraction between the machines and ourselves and share in the wonder of end users.

There’s a catch: Adding layers of abstraction creates complexity, and complexity increases the potential for problems. In addition, while you no longer need to worry about the state of the physical machine, if your cloud instance runs out of CPU, memory, or disk space, your application will take a hit. So, whether shipping hand-built servers to data centers across the globe or spinning up new machines from a cloud provider, the need for management and monitoring is paramount. But fear not! Now, thanks to LogicMonitor’s hosted, full stack, datacenter monitoring being integrated with RightScale’s cloud computing management, you can have your RightScale managed hosts automatically added into your LogicMonitor portal! The next time a huge surge of traffic forces you to spin up a few hosts, monitoring them is taken care of.

Between the cloud management services provided by RightScale and the full stack, SaaS-based data center monitoring provided by LogicMonitor, you can know exactly what’s happening with your devices, both physical and… nebulous.