I’ve been a strong Telegram advocate since its launch in 2013, mainly because of the advanced features and technical state of the art compared to competitors – as a consequence, I’ve been looking very closely at their infrastructure for the last few years.

The two large scale outages that recently hit their users and the sequence of events following them made me ask some questions around their platform. For most of them I have only found additional question marks rather than answers, but here it is what I have so far.

Let’s start from the outages: in case you missed that, on March 29th and April 29th this year, Telegram went down in their Amsterdam datacenter due to a power failure, causing disruptions, according to their official communications, to users in EMEA, MENA, Russia and CIS.

Power outage at a @telegram server cluster causing issues in Europe. Working to fix it from our side, but a lot depends on when the datacenter provider puts the power equipment in order.

Repairs are ongoing after a massive power outage in the Amsterdam region that affected many services. Telegram users in Europe, MENA, Russia and the CIS are currently unable to connect. We apologize and will update you on the progress. https://t.co/WyfYimZoKV

Zooming in on the latter: it’s still ongoing at time of writing this article (8:30AM UTC), and is showing up with clients unable to connect to the platform and both https://www.telegram.org/ (website) and https://api.telegram.org/ (api endpoint) failing with an HTTP error code 500.

Let’s start with the items that, to me, don’t add up: first and foremost, the outage. In case of “massive power outage” in the Amsterdam area, I would expect to see a traffic drop in AMS-IX, the largest Internet Exchange in the region, but there is none (it should be showing around 01 AM):

There are indeed reports of an outage that affected Amsterdam (below the one from Schiphol Airport), but no (public) reports of consequent large datacenter failures.

Who’s involved in running large scale platforms will be surprised by at least two things here: the fact that they are serving an huge geographical area from a single datacenter and their inability to reactively reroute traffic to the other locations they are operating, even in case of extended outage (no DR plans?).

A quick search on Twitter shows that even if the official communication states the issue is only affecting the EMEA region, users from Canada, US, Australia, Japan and other countries are facing it as well.

I used Host-Tracker to have a deeper look into this: an HTTP check to Telegram’s API endpoint and their website fails with an HTTP 500 error from every location across the world:

I went ahead and began digging to find out more about their infrastructure, network and the other locations they are running from.

And here comes the second huge question mark: the infrastructure.

A bunch of DNS lookups across the main endpoints show they are always resolving to the same v4 and v6 IPs, in a way that doesn’t look related to the source location of my queries.

They look to be announced by AS62041 (owned by Telegram LLP): this kind of DNS scheme made me think they were running an anycast based network, so next logical step has been analysing latencies from multiple locations.

Turns out, latency is averaging 20/30ms from EMEA, 100/150ms from AMER, and 250/300ms from APAC: as if from all of those countries you were being routed to the Amsterdam datacenter.

What I’m seeing in terms of latency is confirmed by analysing reverse lookups of routers found in the different paths to Telegram: in my trace from Australia the last visible hop is et3-1-2.amster1.ams.seabone.net (notice that “ams”), most of the traces from US are landing on xcr1.att.cw.net (195.2.1.14) which 1 millisecond away from my lab in Amsterdam and a couple of samples from US and Canada are running all the way up to ae-2-3201.ear3.Amsterdam1.Level3.net, which is self-explaining.

Important to highlight, there are no outliers: I couldn’t find a single example of very low latency from APAC / AMER, that would have proved the existence of a local point of presence. A summary of my tests in the table below:

To get the full picture, I decided to dig into AS62041 main upstream carriers (CW AS1273, TI Sparkle AS6762, Level3 AS3356) and see how they were handing over internet traffic to Telegram.

Turns out, CW is always preferring the path to xcr1.att.cw.net/195.2.1.14 (tested from some locations across the world), our little router-friend in Amsterdam. TI Sparkle always lands on amster1.ams.seabone.net and Level3 only has paths to ear3.Amsterdam1 (tested from Asia and US). Level3’s BGP communities are interesting: routes are tagged as “Europe Backbone” and “Level3_Customer Netherlands Amsterdam”:

Telegram is also peering with Hurricane Electric (AS6939): their routers in US, JP, AU have a next hop of ams-ix-gw.telegram.org/80.249.209.69 for 149.154.164.0/22. That hop seems to be Telegram’s AMS-IX facing router, and the IP is definitely part of AMS-IX:

As said in the opening, there are definitely more questions than answers in the article. It’s as if there was no Telegram infrastructure outside Amsterdam, and over there it was running in a single datacenter. This would explain why users across the world are seeing an outage that should only affect EMEA and close areas, and why Telegram is not taking steps to reroute users to another datacenter/location during the failure in AMS.

Am I missing something very obvious? Please let me know!

UPDATE: With the help of some friends and random people, I found out more details. Find them (with -ongoing- updates) in the dedicated post.

D-Day for replacement is April 18th (taking a day off from my job to go and do the same things, just for hobby, feels really weird, yes), with a 6 AM wake up call, flight to AMS, 8/10 hours to do everything and a flight back to LON (LTN to be precise, because I didn’t double check before hitting “Buy”).

Now to the sad part: there is no (easy) way to just move the drives to the new server and have everything working, so I have to reinstall it from the ground up. This means my stuff (including this blog, because loose-coupling is a thing but I decided to run its DB and NFS from another country… …for some reason) will be down (or badly broken) during that time window and possibly longer, depending how much I manage to do while I’m onsite.

The timing couldn’t be better for a clean start, as in the last few months I had been considering the option to move away (escape) from Proxmox (which, as an example, is so flexible that its management port number is hardcoded everywhere and can’t be changed) to something else, most likely oVirt or OpenNebula. Haven’t taken a decision yet, but I’ve really fallen in love with the latter: it’s perfect for the cloud-native minds and runs on Debian, whereas oVirt would force me to move to the RPM side of the world.

Deeply apologise in advance for my rants on Twitter while I try to accomplish this mission. Stay tuned.

Giorgio

* I.AM.100%.CLOUD. There are two things you can’t (yet) do in the cloud: physical backup of your assets that live in the cloud and testing stuff which requires VT extensions. This is what I’m doing here: ZA is my bare-metal lab.

** this is not ZA Rev2. It was supposed to be, but it came in with a faulty backplane so I pushed for it to be entirely replaced. I don’t have a picture of the new one with me at the time of writing but… yeah, it looks exactly the same (with better cable management).

In term of diagnosing and solving this problem HP’s tech support has been useless so far, so in the last few weeks I’ve been digging deeper and deeper into this, and here are my findings (in logical, and not chronological, order).

Let’s start from scratch, for the benefit of who has not been following this from the very beginning: I’ve installed, tested and shipped the machine with the main drives only (Samsung 850 EVO SSD), as the capacity ones I wanted to use (SATA 2.5″, 1TB, 7200rpm) turned out not being easy to find on the market.

When I was finally able to buy 2+1 drives of the exact HGST model I was after, I screwed them to their caddies and shipped them to the colocation: when they confirmed the drives had been placed into the server, I rebooted it and configured them in a mirrored mdraid array.

Then I noticed that power consumption had gone up from 0.3 to 0.5 Amps:

The raid (re)build was still ongoing and CPU usage was high, so I just ignored this, even if even during previous spikes I had never seen such an high power consumption. To mi surprise, the morning after power usage was still 0.5A, even if the rebuild had finished hours before and load average was back to 0.0something.

With no evidence of something being wrong with the system itself, I blamed the drives (HGST HTS721010A9E630) and started researching for someone else facing the same issue with them. Nothing came out, as expected, and got confirmation from some docs that the power usage to be expected was way lower than what I was seeing.

By chance, I found some threads on the HP forums mentioning situations where non-genuine hard drives were causing “high noise”. Being unable to check the noise by myself without travelling to the colocation, I went ahead and had a look at the fans speed in my iLO, to realise all of them were running at 100%: at that stage I didn’t knew the pre-upgrade reading (now I do: 19%), but while testing it at home (in a way less controlled and warmer environment than the datacenter) I had never seen anything above 30%.

At this stage, I had finally found the cause for that huge power usage: extremely high fan speed. It was now time to try and explain the latter. First thing I checked, of course, were temperatures around the system: everything was good according to the iLO, no alarms nor criticals (not even warnings) and SMART readings were fine, with 20/21C on every drive. Nothing was explaining why the DL320 was trying so hard to cool itself down.

Then I found this article, where David described the same problem and found the perfect name for this phenomenon: Thermal Runway. Based on his description, looks like I’ve been very lucky, as other HP ProLiant servers are even shutting themselves down due to wrong temperature readings. Needless to say, my hard drives P/N were in its list of known bad ones.

Scraping the IPMI details, I found the sensor who was causing this whole thing: “05-HD Max”, which was at 58C. I’ve researched its details, and looks like it’s not a physical sensor, but rather an average of all of the SMART readings. With the temps for my four drives being around 22/23C max according to SMART, there was no way their average could have been 58C. Making things worse, this sensor has an hardcoded, non editable warning threshold at 60C.

With no clue on what to do next, I tried asking HGST if there was a firmware upgrade available (the DL320 G8 is on latest version of everything), but after 15 days, a number of emails and multiple levels of escalation they didn’t even manage to understand what I was asking for, so I decided to give up with them.

At this stage, with all the details I was able to gather I logged a support case to HP, and at the same time bought two new Seagate HDDs (ST500LM021-1KJ15), just to learn, after trying them, that they cause the same problem.

After a very honest first answer where HP’s tech support told me that the system was speeding up FANs as the drives were not recognised as HP genuine, they changed their mind and started pretending the 58C reading was real, and my drives were really running so hot.

I was lost again, and started wondering what did prehistoric people do before the cloud came, when they had this kind of hardware issues. Their first step was probably to go in front of the broken server, so I jumped on a plane and did the same.

(a picture of MY-ZA while undergoing surgery)

First thing, I was able to confirm the 58C reading was definitely wrong (as expected, anyway, but I was looking for a proof to show HP), and SMART was right: drives were super-cold, even if extracted while running. Moreover that sensor was jumping from 24C to 58C in 2/3 seconds after placing them in, which is rather hard (just think about the thermal shock).

Second, I tried to put the drives in different positions (and on a different port of the P420i RAID controller), and the issue was still there.

As last resort, I connected them to the onboard B120i HBA, and the system started working properly. Sensor 05 back to normal, drives running ok, etc. Not a good solution tough, as I’ve paid for the P420i + cache and under no circumstance I will do without it.

Fortunately, while upgrading my iLO4 to firmware 2.55, I noticed that after resetting it sensor 05 was temporarily disappearing, until the next operating system reboot. With this sensor disappearing, everything goes back to normal: fans to 30%, consumption to 0.3A, my bank account not at risk anymore.

sensor 05 has disappeared: 03, 04, … 06.

So, even if not particularly good looking and clean, I had found a solution: resetting the iLO. I went ahead and installed freeipmi, then made sure “bmc-device –cold-reset” is run 30 seconds after the system boots.

I’m still holding some kind of hope in HP support: I asked them to provide me with a way to permanently disable that sensor or raise its threshold, at my risk (read: voiding warranty).

It’s hard to describe how frustrated I am with both with HP servers, policies and support: not being able to test all existing parts and so having some “genuine” and some non genuine ones is okay, but artificially messing up a temperature reading to increase power consumption (and thus costs) and force their customers not using parts from 3rd parties can only be defined with a word: sabotage.

Like this:

Exactly one year ago today I was sitting in a room in Amazon’s London Holborn office, attending the New Hire induction and waiting for my manager to pick me up and introduce me to the rest of the Technical Account Managers team.

It has been one year already – it’s about time to tell my story, and share my experience in this (amazing) reality.

(this is me at this year’s London Summit, looking for something, somewhere)

Looking back at the first year (or, in Amazonian terms: “those first 365 day one’s.”), I can easily highlight a few different phases. Here they are, in a more or less chronological order.

—

Phase 1: “lost” (in an hexagonal office)

Technical Account Managers (TAM) spend a lot of time with customers, and only drop into the AWS office when required. As a new starter this can be a little daunting, especially when trying to get set up – configuring your mobile, using the vast array of internal tools you have at your fingertips and the simple things, like finding the toilet.

The good news is: everybody is always happy to help you. Literally: everybody. In my first days I had phone calls with most of my team mates, shadowing sessions in front of customers, and even asked a mix of random people in the office for various kinds of help: they always guided me, as if it was a single, big family and that helped me, and I never really felt lost (yeah, I know, but it looked as a good title for this chapter…).

(about the toilet, if you’re wondering: I realised that as our office was hexagonal – or kind of -, everything was “straight on and then on the left”)

I’ll skip phase 1.5, the official training: we spend about two to three weeks in classes with Support Engineers before getting hands on with the day to day job. The training is what you’d expect from training, but it provides a great opportunity to meet and learn from tenured colleagues. This is also when I personally went from getting lost in the London office to getting lost in the Seattle campus (every. single. time.).

Phase 2: the ramp up (aka: “OMG I don’t know anything”)

The ramp up that comes after the training is exciting: you’re back, you’ve had 2/3 weeks to try to learn as much as possible and after three weeks of training, you think you know what you are doing – you’ve learnt the theory, you know how to use the tools, you think you know what to do when, and you’re ready to get on with it.

In theory.

What you realise at this point is that yes, it’s true, and you’re working with Amazon Web Services. If you work with cloud, you hear this name daily, and becoming part of it doesn’t simply feel real for a while.

One of the first matters I understood was that the only thing I was bringing with me in AWS was my brain: your past experience can definitely help, but Amazon is so different from other companies that you have to learn, literally from scratch, almost everything. If you’ve been hired it’s because you share the mindset, so it’s not hard and it’s not an obstacle, it’s just something to keep in mind.

The main differences? First, and by far, is our “Customer Obsession”. We obsess over our customers, and not over our technology: every discussion we have ends up focusing what’s best for our customers, and how we can improve their experience. We work every day making sure we help them doing what’s best for their platforms – not for us – and we spend our time listening to them and trying to figure out how to make their life easier.

The second one is definitely what’s summarised in our “Everyday is Day One” motto, which is much more tangible than you would expect from something that is written on every wall in an HQ. Our customers and us are moving so quickly that you must always be ready to wake up and start as if you were in a completely new world. You learn new things daily and the technology you were using / evangelising three months before could not be the best one for a given use case anymore.

This is all about change and how it becomes part of your daily routine.

Phase 3: the First Customer

After a few months you’re ready to onboard your first customer. I had spent some time shadowing and helping a more tenured colleague, and in November I was ready for onboarding my first “very own” account.

At that point in time I was confident on my daily tasks, had already had to deal with critical situations, and everything was looking good. But the first customer you onboard onto AWS Enterprise Support is just different: you’re starting a journey together, with some pre-defined goals and some others that will eventually show up.

It’s just matter of weeks, and you will start knowing your customer’s team members by first name, and recognising who’s logging a support case just by looking at their writing style.

Yes, that’s a very close relationship: some of my colleagues love to say that we work for Amazon, but on behalf of our customers.

Phase 4: the first event

You don’t really feel part of the customer’s team until you go through your first event. An event could be anything, from a planned traffic spike or feature launch, to, ehm, yes, an unplanned downtime.

Let’s pick a feature launch: it’s something big, the customer’s development teams have been working for months on it, the marketing team is heavily pushing and the operational teams do have a single focus, making sure everything will work smoothly.

This is where our teams become glued together with the customer’s: we share a goal, we share a focus, we setup “war rooms” and make sure everything is in place and properly architected for when the big day arrives. The TAM acts here as a customer facing frontman for an army of Support Engineers, Subject Matter Experts, Service Team Engineers, and many more – and during this kind of events, everyone comes together.

And then it happens – detailed and obsessive planning ensure everything works smoothly and meets expectations, leaving plenty of time to celebrate – and to realise that none of this would be possible without the super close relationship we develop with our customers.

Phase 5: personal development

This is not really a phase (mainly because it never ends), but after you’ve been in the company for 6/8 months you begin having really clear ideas on how things work, where you want to go and what you want to do.

AWS is a world of opportunities, for any kind of person: in this first year I joined a team which is helping our customers with the migration of strategic workloads and presented at the AWS Summit in London.

I’m currently trying to decide what to target next.

Phase 6: retrospective

As said, technology is evolving quickly, and so are we and our customers. When you reach the one-year mark, you try to look back and this is when you really understand where you used to be, and where you are now.

Where your customers were, and where they are now: the distance they have most likely covered in a single year looks unbelievable.

Phase 7: writing a blog post about your first year

Come on, I’m just joking.

—

Time to wrap up: I’m enjoying my new working life, my team, my mentor(s), my manager(s) and the extended Enterprise Support team. I have the opportunity every day to work with exciting customers, to actually be part of my customer’s teams and to experience the latest innovations first hand.

There is a question I get asked a lot, especially from people who know my background: do I miss being hands on, had to do with operations? Not really. First, we have time and business needs for testing and using any new product we launch, so I still spend some time actually “playing” with stuff. Second, despite the name, this role is super-technical – we get to see a lot of operations, development and devops.

If you are reading this and looking for a new and interesting challenge, or would like to consider joining the AWS team, then get in touch.