D-Day for replacement is April 18th (taking a day off from my job to go and do the same things, just for hobby, feels really weird, yes), with a 6 AM wake up call, flight to AMS, 8/10 hours to do everything and a flight back to LON (LTN to be precise, because I didn’t double check before hitting “Buy”).

Now to the sad part: there is no (easy) way to just move the drives to the new server and have everything working, so I have to reinstall it from the ground up. This means my stuff (including this blog, because loose-coupling is a thing but I decided to run its DB and NFS from another country… …for some reason) will be down (or badly broken) during that time window and possibly longer, depending how much I manage to do while I’m onsite.

The timing couldn’t be better for a clean start, as in the last few months I had been considering the option to move away (escape) from Proxmox (which, as an example, is so flexible that its management port number is hardcoded everywhere and can’t be changed) to something else, most likely oVirt or OpenNebula. Haven’t taken a decision yet, but I’ve really fallen in love with the latter: it’s perfect for the cloud-native minds and runs on Debian, whereas oVirt would force me to move to the RPM side of the world.

Deeply apologise in advance for my rants on Twitter while I try to accomplish this mission. Stay tuned.

Giorgio

* I.AM.100%.CLOUD. There are two things you can’t (yet) do in the cloud: physical backup of your assets that live in the cloud and testing stuff which requires VT extensions. This is what I’m doing here: ZA is my bare-metal lab.

** this is not ZA Rev2. It was supposed to be, but it came in with a faulty backplane so I pushed for it to be entirely replaced. I don’t have a picture of the new one with me at the time of writing but… yeah, it looks exactly the same (with better cable management).

In term of diagnosing and solving this problem HP’s tech support has been useless so far, so in the last few weeks I’ve been digging deeper and deeper into this, and here are my findings (in logical, and not chronological, order).

Let’s start from scratch, for the benefit of who has not been following this from the very beginning: I’ve installed, tested and shipped the machine with the main drives only (Samsung 850 EVO SSD), as the capacity ones I wanted to use (SATA 2.5″, 1TB, 7200rpm) turned out not being easy to find on the market.

When I was finally able to buy 2+1 drives of the exact HGST model I was after, I screwed them to their caddies and shipped them to the colocation: when they confirmed the drives had been placed into the server, I rebooted it and configured them in a mirrored mdraid array.

Then I noticed that power consumption had gone up from 0.3 to 0.5 Amps:

The raid (re)build was still ongoing and CPU usage was high, so I just ignored this, even if even during previous spikes I had never seen such an high power consumption. To mi surprise, the morning after power usage was still 0.5A, even if the rebuild had finished hours before and load average was back to 0.0something.

With no evidence of something being wrong with the system itself, I blamed the drives (HGST HTS721010A9E630) and started researching for someone else facing the same issue with them. Nothing came out, as expected, and got confirmation from some docs that the power usage to be expected was way lower than what I was seeing.

By chance, I found some threads on the HP forums mentioning situations where non-genuine hard drives were causing “high noise”. Being unable to check the noise by myself without travelling to the colocation, I went ahead and had a look at the fans speed in my iLO, to realise all of them were running at 100%: at that stage I didn’t knew the pre-upgrade reading (now I do: 19%), but while testing it at home (in a way less controlled and warmer environment than the datacenter) I had never seen anything above 30%.

At this stage, I had finally found the cause for that huge power usage: extremely high fan speed. It was now time to try and explain the latter. First thing I checked, of course, were temperatures around the system: everything was good according to the iLO, no alarms nor criticals (not even warnings) and SMART readings were fine, with 20/21C on every drive. Nothing was explaining why the DL320 was trying so hard to cool itself down.

Then I found this article, where David described the same problem and found the perfect name for this phenomenon: Thermal Runway. Based on his description, looks like I’ve been very lucky, as other HP ProLiant servers are even shutting themselves down due to wrong temperature readings. Needless to say, my hard drives P/N were in its list of known bad ones.

Scraping the IPMI details, I found the sensor who was causing this whole thing: “05-HD Max”, which was at 58C. I’ve researched its details, and looks like it’s not a physical sensor, but rather an average of all of the SMART readings. With the temps for my four drives being around 22/23C max according to SMART, there was no way their average could have been 58C. Making things worse, this sensor has an hardcoded, non editable warning threshold at 60C.

With no clue on what to do next, I tried asking HGST if there was a firmware upgrade available (the DL320 G8 is on latest version of everything), but after 15 days, a number of emails and multiple levels of escalation they didn’t even manage to understand what I was asking for, so I decided to give up with them.

At this stage, with all the details I was able to gather I logged a support case to HP, and at the same time bought two new Seagate HDDs (ST500LM021-1KJ15), just to learn, after trying them, that they cause the same problem.

After a very honest first answer where HP’s tech support told me that the system was speeding up FANs as the drives were not recognised as HP genuine, they changed their mind and started pretending the 58C reading was real, and my drives were really running so hot.

I was lost again, and started wondering what did prehistoric people do before the cloud came, when they had this kind of hardware issues. Their first step was probably to go in front of the broken server, so I jumped on a plane and did the same.

(a picture of MY-ZA while undergoing surgery)

First thing, I was able to confirm the 58C reading was definitely wrong (as expected, anyway, but I was looking for a proof to show HP), and SMART was right: drives were super-cold, even if extracted while running. Moreover that sensor was jumping from 24C to 58C in 2/3 seconds after placing them in, which is rather hard (just think about the thermal shock).

Second, I tried to put the drives in different positions (and on a different port of the P420i RAID controller), and the issue was still there.

As last resort, I connected them to the onboard B120i HBA, and the system started working properly. Sensor 05 back to normal, drives running ok, etc. Not a good solution tough, as I’ve paid for the P420i + cache and under no circumstance I will do without it.

Fortunately, while upgrading my iLO4 to firmware 2.55, I noticed that after resetting it sensor 05 was temporarily disappearing, until the next operating system reboot. With this sensor disappearing, everything goes back to normal: fans to 30%, consumption to 0.3A, my bank account not at risk anymore.

sensor 05 has disappeared: 03, 04, … 06.

So, even if not particularly good looking and clean, I had found a solution: resetting the iLO. I went ahead and installed freeipmi, then made sure “bmc-device –cold-reset” is run 30 seconds after the system boots.

I’m still holding some kind of hope in HP support: I asked them to provide me with a way to permanently disable that sensor or raise its threshold, at my risk (read: voiding warranty).

It’s hard to describe how frustrated I am with both with HP servers, policies and support: not being able to test all existing parts and so having some “genuine” and some non genuine ones is okay, but artificially messing up a temperature reading to increase power consumption (and thus costs) and force their customers not using parts from 3rd parties can only be defined with a word: sabotage.

Giorgio

Share this:

Like this:

Let me start from the beginning: during my relocation last year, I left my desktop computer behind. It hadn’t been my primary machine for a while and I was probably powering it on only once a month, but it was still my core repository for backups and long term storage.

As I went 100% cloud years ago (no USB drives, no external HDDs, etc) my “current” dataset is now online, synchronised with my laptop(s). Still, there are some hundreds of GBs of “cold” (as in: I will probably never need them again) pictures/docs/archives that I want to be able to access, even remotely, at any time. After exploring some mid-range NAS solutions, I ended up realising that despite having a reliable internet connection, my flat was not the best location for hosting it, so started looking around for a decent colocation space.

It didn’t take much time to figure out that space and power in a datacenter are so expensive that a NAS isn’t suitable nor effective for this purpose.

I’m sure you’re now wondering what the hell I am doing here. The answer is easy, and anybody with an engineering mindset can probably confirm: sometimes we need to spend time and energy in experiments even if we know they will fail, because what we want to figure out is how exactly they will fail.

To be honest, even if I knew this choice was sub-optimal at the very least, I was like: “Hey, what could go wrong? It’s just a server”.

Well, now I know the answer: anything – (and if you cross this with Murphy’s law…).

My background is in traditional IT, but looks like I quickly forgot about the pain of having to deal with bare metal. To make sure this doesn’t happen again, here’s a quick reminder that might also help you all:

Servers are expensive: this is a $2800 machine (I’ve paid roughly 50% of that), that will cost around 70/80$ per month just by colocation and bandwidth. Moreover, in 2 years time it will be obsolete.

They’re slow, reaaaaaaalllllly slooooooww. This thing wastes 10+ minutes just to get to the operating system boot. Don’t forget this if you’re doing something that requires a lot of reboots (like trying different RAID configurations, updating a newly installed Windows, etc). We’re now in an era where the boot time of an instance is shorter than what it takes to you, slow and inefficient human, to copy and paste connection details in your SSH client.

What about the risk? Well, it’s huge. I have onsite support, but no spare parts. So, should something bad happen, the downtime will be counted in hours, at least.

They don’t scale. This “thing” has already reached the maximum amount of RAM it can hold. What if I need more? I have two options, double the colocation space (and thus cost) and buy a similar second server, or buy a larger one to replace it and begin a slow, complex and painful migration.

Agility? What? – You must manage it as you would do with a pet. If something breaks, repair it, if the OS is out of date, upgrade it. Well, in a world where if an instance is broken you immediately spin up a new one, having to fix an OS doesn’t seem appropriate.

SSDs do have a well defined lifespan. This is not something you care about if you’re using a cloud hosting service, but here you should keep it in mind, as they will eventually die. Both at the same time, as their load will be similar.

After having spent the last 7 days (evenings to be fair, as I have a job during the day) on this project, I think I have definitely debunked the theory about cloud not being effective for personal workloads.

Project failed, time to terminat…

…no, wait, you can’t terminate a bare metal server: it’s an investment, it’s a long term decision, you can’t just roll back as you would do with a cloud instance.

Oh, God.

* don’t even try to understand my host naming convention. There are no standards, names are just random letters. Servers are cattle, not pets, right?