50Micron.com

Ranting and raving about storage and technology.

So… My honest take on StarTrek:Discovery last night. So far, so good. Compelling story, GREAT effects… Cast seems to be fleshing out nicely, though I anticipate some dramatic changes in the first couple of episodes. Because of the delay in programming they were a little rushed at the end, which meant you really didn’t know …

A friend passed away recently… On going through his computer files, we found years worth of photos with a .ccc suffix… Ransomware… With two teenagers in the house, my biggest fear is some network replicating bug that takes down my entire network. Apparently it hit him a while ago, and he didn’t tell me. (I …

Surprised to find this blog still here. It’s been…oh…a long time since I’ve ventured into the blogging world. Work has kept me busy…going into year 4 of a six month contract and making all sorts of discoveries of late. Discovery #1 – Brocade is still a third-rate switch company. The hardware is fairly bulletproof, when …

When I have an application that goes down (and face it, it does happen) I want the person responsible for getting it back up and running to be within choking distance. And if he’s within choking distance the servers need to be as well, because otherwise he’s powerless to actually fix the problem, and I’m …

A friend passed away recently… On going through his computer files, we found years worth of photos with a .ccc suffix… Ransomware… With two teenagers in the house, my biggest fear is some network replicating bug that takes down my entire network.

Apparently it hit him a while ago, and he didn’t tell me. (I was his IT guy, but he hated the idea that he might have made a mistake). Years of pictures, potentially important, lost, probably forever… (As the files in his user directory had been restored long after the computer in question had been wiped, there was no indication of which virus caused the problem.)

So what to do about backups.

The only totally secure system is one disconnected from the network, and powered off. The minute you connect *ANY* computer to the internet it becomes vulnerable. Sure there are steps you can take to prevent data-loss, anti-virus, a good firewall, etc. But eventually you’re probably going to run into a site with embedded malware on it, and it’s all over.

Personally, I like the idea of off-host, disconnected backups.

Every morning I wake up, stumble downstairs, get a cup of coffee and a bowl of cereal, and sit down for my morning staff meeting. (Where I find out the messes I’m going to have to clean up from the night before)

While I’m sitting there, I take the 4TB drive out of the removable bay and replace it with the OTHER 4TB drive I’ve got sitting on a shelf. One marked Odd, one marked Even. At Midnight every day, Acronis True Image kicks off a disk image backup of my boot drive, and important data drive. (My games drives and multimedia drives are ignored, because Steam, Origin, and iTunes pretty much covers those, it only takes bandwidth to recover those from the cloud, and those can’t be modified by my computer.

I like this way because A, I’m never out more than 24 hours worth of work, and B, I know that there is no virus on the planet that can infect a hard-drive sitting on a shelf in a plastic case. (Though I bet some idiot somewhere is trying to figure that one out)

So my RPO (Recovery Point Objective) is usually “within 24 hours”. RTO is about 5-6 hours to do a full restore. (I keep the data drive even though it’s synced to Office365 because we all know, corruption mirrors just as fast as good data.)

All that for a few hundred dollars in hard drives and a $25 removable drive bay for my PC, I’m protected.

So my question is this: What do you use for your home/home-office backups? Acronis is getting a bit long in the tooth, and I’m considering alternatives.

My wife used to tell me that I’m the only person she knew who could relax after working all day in front of a computer, sitting in front of a computer.

She doesn’t know IT geeks well does she. 😉

She’s right. I’ve recently re-discovered computer-gaming, and PC building in general…just for grins decided I was going to build myself a no-holds-barred monster PC for both Gaming and work-related stuff.

So a couple of interesting bits. I started out with a Single GTX980 and an All-In-One watercooler for the CPU only. Then the upgrade bug hit and I took it the rest of the way. Second GPU, customer watercooling loop (Including for some reason RAM Coolers, which are pretty but don’t do a lot) then adding a third GPU, then backing that out because 3-Way SLI isn’t as stable as I would have liked it to be, and most importantly, modifying the case to add a window which for some reason wasn’t available in the case I purchased.

The Nanoxia Deep Silence 6 is an amazing case. all 1mm steel, weighs a ton, but quiet as hell. I got sick of my office sounding like a server room so opted for more passive cooling options. (The WC is quiet, just a couple of fans that all run at slow speed unless I’m gaming.)

End result is a computer that runs Rise of the Tomb Raider on Ultra graphics across a 5670×1080 “Surround” display without breaking 55 degrees.

Next to replace the 3x 23″ monitors with 3x 27″ 4K Monitors. 🙂 (MIght need the third GPU for that, good thing I kept it) 🙂

Surprised to find this blog still here. It’s been…oh…a long time since I’ve ventured into the blogging world. Work has kept me busy…going into year 4 of a six month contract and making all sorts of discoveries of late.

Discovery #1 – Brocade is still a third-rate switch company. The hardware is fairly bulletproof, when it comes to reliability… But they’re still married to the idea of “local-switching” as an alternative to building a backplane that’s worth a damn. Sorry, I’ll take the Cisco MDS 9700 series any day of the week and twice on Sunday.

Discovery #2 – Well not a discovery really. EMC Symmetrix (Symmetrix/VMAX) is still the flagship storage array. If you put anything else in, you’re going cheap. Not that there’s anything wrong with that, but it’s time to admit that that’s what you’re doing.

I say that having worked now with HP 3par – which I put as equivalent to the Clariion/VNX line in stature and performance, and HP XP7 (Hitachi G1000) which is higher end, but a bloody nightmare to manage. (I don’t know if that’s Hitachi or just HP’s version of Hitachi that makes it a nightmare, I’ll have to wait until I get hands on an actual Hitachi to see)

Let me be clear, this is a personal preference. Both arrays, 3par to some extent, and XP7 to greater extent, seem to be trying to steer people away from using the CLI to manage their arrays. GUI’s are fine, but they don’t offer the level of control that you need to micromanage the hell out of your storage (as I like to) And GUI’s also make scripting changes more difficult, and more prone to error.

I haven’t had a chance to really beat the daylights out of the XP7 yet but will in the months to come, I’ll report further as I discover.

When I have an application that goes down (and face it, it does happen) I want the person responsible for getting it back up and running to be within choking distance. And if he’s within choking distance the servers need to be as well, because otherwise he’s powerless to actually fix the problem, and I’m putting my business in the hands of someone paid minimum-wage (or only slightly better, night-time computer-operator wages) and his ability to go out and physically push a button (and god hope it’s the right one)

If you don’t hold your data, you don’t really own it. If you don’t hold your data it can go away at any point.

Several years ago I was renting space in a datacenter up in Springfield – for a little web-hosting business I was using, but also so i could run some equipment for testing and training. (the hosting almost paid for the space, so it wasn’t out of line)

Someone on the datacenter network had a PXE server running to install software. On the public network

Well the hosting company, which was incompetent to it’s core, didn’t put their users in separate vlans like would normally be done in shared environments.

They also did “cloud application” hosting on crappy 1cpu, 1PS supermicro servers that came with PXEBoot enabled.

They lost a half-dozen servers before they realized what was going on. I mean lost as in they PXE booted, wiped the drives, and started installing this custom application that was installed on another customers systems. (Thankfully I had my environment firewalled off from the datacenter network, I was pretty safe)

That was customer data that was just GONE. No backups, just missing servers. Servers that they were paid to keep safe and secure.

The hilarious part is there is that the chance of this happening in real life is non-zero. Not that it is likely to happen, but it’s impossible, statistically speaking, to completely rule out the idea.

Now there are “big” cloud providers like AWS or…well…AWS. The chances of your datacenter getting lost there is less, they’re not going to disappear, and they’re a pretty together company so the odds are in your favor.

But what if it were to happen?

Say I’m a small business (I am actually) and because I’m cheap, I want to outsource all of my datacenter operations, email, etc, to “Bobs clouds and stuff.” Email, Database, Custom Widget Application, all of it.

The migration is easy, virtualize my systems and upload them right? (Or the smarter way is to create new ones and migrate to them, but that’s a different story.)

But what if Bob decides that he’s done, that he’s going to shut everything down and run to aruba because his ex-wife is after him for 10 years of back-child support? Or comes down with a rash no-one can identify and dies?

Ok, a little far-fetched, but you get the drift. What’s a small business’ recourse if their cloud provider just folds? Do you have any? Can you pay the lawyers to fight out who owns what while you’re not making any money because your entire operation has been “turned off”?

It’s a horrifically overstated problem, but it brings out the potential downside to cloud computing. You don’t actually have control. You are putting your data, your livelihood, your company’s very being, in the hands of someone else who may or may-not care.

I’m a control freak. Anyone who knows me or has tried unsuccessfully to have me committed in the past 20 years knows that.

I want control of my data. I want it in my hot little hands. I want to have tapes. I want to know where they are and I want to have instant access to them at 2am if I wake up and find I’ve had a nightmare about all of my data being gone.

Last week I had to sit through one of those “competitive sales pitch” meetings. You know, where Company A compares their product to Company B and of course, tries to make you draw the conclusion that Company A’s product is light-year ahead of the competition, even if it isn’t.

Now I’m under NDA, so i can’t disclose the brands, or in fact anything about the specs involved, but i can speak to the tone of the meeting.

It was mean, and spiteful, and nasty, and put me off Company A’s product entirely. (Needless to say, we’re not buying any)

Listen. I know every hardware vendor things their product is the best thing since sliced bread, (and really, what isn’t right?) But if you’re going to do a comparison, make it about how great your product is, not how lousy your competitor’s is. When you do that, you come off as petty, and bitter, and spiteful, and not very believable.

Show me the numbers. And not the marketing numbers, the real numbers. You say your array can do 1.5 million IOP/S, show me the breakdown. You say your switch can do sub-microsecond switching, don’t forget to clarify that that’s only to adjacent ports, you say your backup software can backup a multi-terabyte system, show me that it can restore it as well.

And don’t show me slides with pictures of your parts and talk to me about how much better looking, prettier, well laid out, your hardware is. It means nothing. Functionality is everything. Yes you’ve combined multiple redundant components into one chip, but now, if that one chip fails, you’re losing 8x the functionality. (IE the only thing you’ve taken out of the system is the redundancy.)

I’m a big proponent of “you get what you pay for” Especially in enterprise systems. You show me a vendor who is selling their hardware for 10% of what another “comparable” vendor is, and my first question is “what is missing.”

When I used to teach, I always told the story of making the roast. It’s a parable, but it works.

As follows:

I was making a roast one day, and I cut the ends off it before i put it in the pan. My kid asks “Why did you cut the ends off the roast?”

“Because that’s how my mom did it.”

Curiosity got the better of me and I asked my mom “Mom, why do you cut the ends off a roast when you make it?”

“Because that’s how Grandma did it.”

Again, curiosity – I call my grandmother and ask HER: “Grandma, why do you cut the ends of the roast?”

“Oh, well my pan is too short.”

<head meets desk>

There is an inherent danger in doing the things the way they’ve always been done without giving thought to why. Situations change, technology evolves, and suddenly the “way you’ve always done it” bcomes the most inefficient way possible because some new method has come along, or even worse, becomes the WRONG way to do something because the underlying technology has changed.

“Hard” vs. “Soft” zoning comes to mind. No-one in their right mind does hard-zoning anymore…Most vendors discourage it, a few won’t even support it.

But 15 years ago, it was best practice. Things change, technology changes, so people MUST change along with it.

Plumbers define a slow drain as one in which your teenage-kid has tried to wash clump of hair vaguely resembling a tribble down.

Ed Mazurek from Cisco TAC defines it quite differently:

…When there are slow devices attached to the fabric, the end devices do not accept the frames at the configured or negotiated rate. These slow devices, referred to as “slow drain”devices, lead to ISL credit shortage in the traffic destined for these devices and they can congest significant parts of the fabric.

Having fought this fight, (and being that I’m *STILL* fighting this fight) I can say quite simply that when you have a slow-drain device on your fabric, you have a potential land-mine…

My (Non TAC) take is this:

Say I have the following configuration:

HostA is a blade server with 2x 4G HBAs
StorageA is an EMC VMAX with 8G FA ports.
HostA is zoned to Storage A with a 2:1 ratio. (1 HBA – 2 FA’s)

Now when Host-A requests storage from Storage-A. (for instance say you have an evil software/host-based replication package that shall remain nameless that likes to do reads and rights in 256K blocks.) Storage-A is going to assemble and transmit the data as fast as it can. *IF* you have 32G of storage bandwidth staring down the barrel of 8G of host bandwidth, the host might not be able to accept the data as fast as the storage is sending it. The Host, has to tell the switch that it’s ready to receive data (Receive-Ready or “R_RDY”) The switch tries to hold on to the data as long as it can, but there are only so many buffer credits on a switch. If the switch runs out of buffer credits, this can affect other hosts on that switch, or if it’s bad enough, even across the fabric.

So it’s possible to find yourself having trouble with one host, only to have EMC or Cisco support point the finger at a completely unrelated host and telling you “That’s the offender, kill it dead.”

Symptoms

Random SCSI Aborts on hosts that are doing relatively low IO.

When a slow-drain is affecting your fabric, IO simply isn’t moving efficiently across it. In bigger environments, like a Core-Edge environment, you’ll find that you’ll see random weirdness on a completely unrelated server, on a completely unrelated switch. The slow-drain device is, in that situation, causing ISL traffic to back-up to (and beyond) the ISL, and is causing other IO to get held because the ISL can’t move the data off the core switch. So in that case, a host attached to Switch1, can effectively block traffic from moving between Switch2 and Switch3. (Because Switch2, being the core switch, is now ALSO out of B2B credits.)

The default Timeout value for waiting for B2B credits is 500ms. After that, the switch will drop the IO, forcing the storage to re-send. If the host doesn’t receive the requested data within the HBA configured timeout value, the host will send a SCSI abort to the array (you’ll see this in any trace you pull)

Now the Array will respond to the host’s ABTS, and resend the frames that were lost. Here’s the kicker, if the array’s response gets kludged up in that the host will try to abort again, forcing the array to RESEND the whole thing one more time.

After a pre-configured # of abort attempts, the host gives up and flaps the link.

Poor performance in otherwise well-performing hosts.

The hardest part about this one, is that the host IOSTAT will say it’s waiting for disk, and the Array will show the usual 5-10ms response times, but there really is no good way of measuring how long it takes data to move from one end of the fabric to the other.

I had a colleague who used to swear that (IOStat wait-time – Array Wait Time) = SAN wait time.

The problem with that theory, is that there are so many things that happen between the host pulling Io off the fabric, to it being “ready” to the OS. (Read/Write queues at the driver level come to mind)

There are a few, rather creative ways to mitigate a slow-drain…

You can hard-set the host ports to a lower speed.

Well ok, I’m lying. This does the opposite of fixing the problem, this masks the issue. Hard-setting a host down to, say 2GB doesn’t prevent the slow-drain…what it DOES do is prevents the host from requesting data as quickly. (or for that matter as often) Did this, saw it work, even though every ounce of logic I’ve got says it shouldn’t. (it should, by all measures, make the issue much worse by preventing the host from taking data off the SAN at a much higher rate)

You can set the speed of the storage ports down.

Yes, realistically, this will work. If you absolutely have to, this will help. By reducing the ratio of storage bandwidth:host bandwidth from 4:1 to 2:1, you are preventing the storage from putting as much data on the network at any given time. This prevents the back-up and should keep B2B credits from running out. However, there is a simpler issue and that is…

1:1 Zoning

It’s been EMC’s best practice for ages, single-initiator, single-target. While locking down the storage ports will work, and will alleviate the bandwidth problem, simplifying your zoning will do the same job, and have the added bonus of being easier to manage. The only downside, is that some people like for the host not to lose ½ of it’s bandwidth when a director fails. (In the case of 1:2 zoning, you lose ¼ of your bandwidth when a director fails, not ½)

Reduce the queue depth on the server

Yes, it will work. Going from the emc Recommended of 32, to 16, or even 8, restricts the number of IO’s the host can have out on the fabric at any given time. This will reduce congestion…

And lastly, my favorite:

Implement QoS on the array.

EMC supports QoS on the VMAX arrays out of the box. So if you can, limit each host to the bandwidth that the host HBA’s are capable of. (if you have multiple arrays, you’ll have to do some clever math to figure out what the best set-point is) This allows you to continue to use the 1:2 zoning (2FA’s for each HBA) and prevents the slow-drain device from affecting your whole environment..

Set the “NO CREDIT DROP TIMEOUT” to 100ms. on each host-edge switch

This one is dangerous – doing this causes the switch to drop IO’s when there are no buffer credits much faster… This has the upside of forcing a Slow Drain Device to drop on it’s face BEFORE it can affect other hosts, in theory… But remember that the other hosts are experiencing the same types of timeouts, they’ll get dropped too.

A great article on Cisco.com about what it is, in much more detail than I could hope to get here, in case you need to sleep at night.

A couple of years ago my PDC died. The only physical box in my environment and the one physical server died.

I was 2,700 miles away. I wasn’t going to be back any time soon, and stuff was broken. (Thankfully, customer data was on the Linux Webhosting environment, so nothing lost there, except their backups)

My setup involves 1 physical server, and about 14VM’s (on two physical hosts) The physical server does a number of things. In addition to being the PDC/Infrastructure Master, etc. It holds my backups, gives me a plase to run consoles for various management agents…etc.

It died. Rebooted after a power-failure in the hosted datacenter I was throwing good money away on. (don’t EVEN get me started)

Anyway, technical mumbo-jumbo.

Recovered the original DC as a domain member using the following steps:

3. On DC1, run DCPROMO and removed Active Directory – (There were a couple of minor gotchas to do this – like an idiot I didn’t write them down, but they were easy fixes, easily googleable. (is too a word) This removes all AD membership and makes it a stand alone workstation.