This is the 166th article in the Spotlight on IT series. If you'd be interested in writing an article on the subject of backup, security, storage, virtualization, mobile, networking, wireless, DNS, or MSPs for the series PM Eric to get started.

It was the summer of 2011, and I had a work placement lined up at the IT department of a haulage company in South England. The day started much the same as every other day: It was 8:30 a.m. I looked like something out of “The Walking Dead,” my eyes half closed, with a tablet in one hand with an RDP connection open to our monitoring server and a cup of coffee in the other. I walked into the IT office and started checking through emails and any reminders I’d set from the previous day. I soon got down to business, and started following up on open tickets, when I noticed that fateful limited connectivity warning plastered on the networking icon at the bottom of my screen.

I thought nothing of it, and tried to ping outbound, with no luck… Rebooted, and again, no luck.

I logged onto one of my colleague’s machines, he was on the afternoon shift and, consequently, not in the office. His machine had local access too. I went up to the work floor and tested it on a couple of machines — still no luck.

By this point I was starting to get worried. The company had their infrastructure set up so that other geologically remote sites had an MPLS (multiprotocol label switching) line into head office, where I was located. From there, there were two Internet pipes, a primary and backup. In theory, if the primary connection failed, the backup should have taken over automatically with no human input necessary.

I spent about an hour trying to find the root (no pun intended) of the problem, trawling through endless amounts of cables, both fibre and Ethernet, managed and unmanaged switches, patch panels, and our routers and edge firewalls. I couldn’t find anything wrong.

Then, the rather sharp eye of the IT manager I was working under, who was in the server room with me looking for the cause of the fault, saw something rather odd. We were, logically speaking at least, checking equipment close to the network edge. We had two Ethernet cables, mixed in with the hundreds of others, that were responsible for connecting our patching and switching infrastructure to the outside world and the other sites — one for the primary Internet pipe and one for the backup.

You know those little green and orange LEDs on Ethernet ports that make server rooms look really cool? Well, this one wasn’t flashing. The mind-boggling thing was that everything was switched on, and connected — no cables had come out of place and no appliances had blown.

I realised that it might be a software issue and ran down stairs to call our carrier, who had no idea we were offline or that anything was wrong. So much for 24/7 managed monitoring...

By this time, I’d been in the office for around four hours. We had calls coming in left, right, and centre from our other sites and employees complaining about not being able to access the Internet.

Thankfully, our telephony and data switching networks were separate. All we could do was tell them we were working on the issue — even though we didn’t know what was wrong.

At about 1 p.m., six hours after the issue arose, the carrier got back to us and informed us that an engineer had inadvertently disabled that particular port on that appliance, after, as he put it; “detecting unusually high amounts of traffic” entering our data network.

I couldn’t believe what I was hearing. There’s making a mistake and then there’s just plain incompetence. His actions had killed us off for what turned out to be almost a whole day of productivity. But what could I do? They hadn’t technically breached our service line agreement. Around 1:15 p.m. the connection came back up, and we were online. I tried politely to enquire as to why our backup line hadn’t kicked in, and they gave me the usual first line support speech: they’d look into it and get back to me.

Email, for us, was our lifeline. We were lucky in the respect that our spam filtering was done off-site, which meant mail was queued off-site and de-queued when we came back online. Seeing all that mail pouring into our transport server at once was pretty interesting, but it went fine nonetheless. No mail was dropped or lost, thankfully!

After doing some investigation into past failures and events that the company had experienced before I had arrived, I found that it wasn’t the first time this had occurred. Our backup line had never worked and never kicked in — it was effectively dead weight, draining funds from our budget.

Unfortunately, I left the company at the end of the summer, shortly before they switched to a far better carrier, which was able to offer the company a connection eight times better than the one we had for the same price we were paying the current carrier.

Even the big players make mistakes; sometimes you need to go out and look at all parties involved as opposed to just you and your company, when trying to locate a fault. I guess the bigger your infrastructure is, the more you have that can go wrong!

Thankfully, no long-term damage came of the day’s events. It’s funny to think that such a large problem could be caused by such a small thing. I won’t deny: I did enjoy the hands-on experience. The hardware that makes a network is so complicated to look at from a distance, but rather simple to map out, and work with up close.

Have you had any experiences dealing with ISP incompetence? Share your story in the comments below!

I used to work for an ISP, so I have seen a lot of ISP incompetence. Fortunately, it wasn't for the company I worked for, but rather all of our competition. The worst was one company that had their equipment powered way too much (against regulations kind of "way too much"), and it constantly interfered with our connections.

Have a similar issue with our ISP, a major provider. In this case their router is dropping our connection. We've proven it to them, they have done site tests and see that there is a problem but they still refuse to replace the device, instead wanting to "monitor the situation further". Really?????

Does anyone have a dictionary? If you look up ISP Incompetence, it will answer with Bell Canada. They are absolutely useless and without question the WORST ISP in Canada. Their tech support is terrible - it's based in India. If you get someone good who has spent time outside of India, they understand what you're going through and really do their best to help. That's a 1 in a 100 event! Most of them read from their scripts and have poor English, both understood and spoken. Technically, I doubt most of them are even at A+ level. It's not their fault, they've been hired to do a job and not properly trained. Getting back in to Canada, the Bell Canada field techs can be pretty bad. They can't grasp that they are not dealing with residential lines, but with business lines and that there is an expectancy of service. The other major ISPs in Canada (Telus, Aliant, Sasktel and MTS) are pretty good in all aspects. Sasktel probably ranks top for ease of ordering lines, experience dealing with their people and reliability of service - the others are close behind.

If you want a backup line, use a different carrier. And ones using different lines. For example, it probably doesn't make sense to use multiple DSL providers if both are using AT&T's DSL lines. Be wary of connections that are just resold/wholesale versions of someone else's broadband. Resold broadband isn't bad per se, but you want two connections that are less likely to both go down at the same time.

We once had our ISP assign 4 of our 10 Static IPs to another account effectivly cutting out our Static VPN tunnels to 3 remote sites and email services. Fortunately, like you, we had off-site spam filtering in place which queued our messaging. It took three days to get everything back in working order, because the ISP did not want to disconnect their new client and kept trying to convince us to use a new set of IP Addresses. 3 months later we cancelled our services with them and moved over a "major" telcom (took that long to do the fiber build-out). So glad to not have to work with the smaller ISP anymore.

Oh we had something similar years ago, the year I started in fact. The local carrier had a vault up the street and in a fit of genius decided to use one of those "unused" modules in the equipment stack for a new customer. Unfortunately for us, the module was not unused, in fact it was our only internet feed. We called as soon as we were sure it wasn't internal. They did some checking and figured out what happened. However, they then refused to swap the module back as it would take down their new customer! We had to stay down until a new module could be ordered and installed. It took about a day and a half to get it fixed.

Needless to say that did nothing to help the frayed relationship we had with them. They had trouble keeping us up in the first place since the vault regularly flooded. We did eventually switch off of them and ended up on TW Telecom's local fiber and have loved it ever since.

We had a very similar situation this past week. Thankfully our backup worked like it should have worked though. The reason we were given for the outage to our main feed was "some maintenance was done" in a nearby town. They were down for somewhere around 12 hours near as we could tell.

Try doing that with a satellite connection on a ship. Add in being docked in a country that is on the list of places recommended not to travel to & you can kiss goodbye to any site visits from engineers as well!!

I am the 1st (and only) IT guy for a fair-sized halfway house for fellow vets. I have been there a year and its getting much better as things get done. I am also stuck with our primary provider and back up being the same folks, with a 5 year contract signed a year before my arrival. Best I could manage was backup on DSL, primary on fiber, since they are on different poles ;-) . It's the auto-failover never working that gets me up early. Thanks for sharing!

15 iPAQs (yes, in 2012 running Win98SE) Compaqs and a basement full of bricked Dell Dimension 4100s. The Compaqs went to a museum and the Dimensions are for the day program folks to destroy.

Well that sounds very familiar. Our previous and even current carrier's tech support are somewhat of a disappointment. You would think (or at least hope) that they would train their techs to handle consumer and business services differently, but not for our current company. I mean when they installed our service, they totally forgot to install our static IP's onto our modem, so we had internet for a few moments, but when they reset the modem, poof... lost internet, but you will all get a huge kick out of this, because I did. So when we had a huge latency issue we had multiple techs come out because none of them could find a problem even though there was a issue at the tap outside which I told them about. But anyways, what I was getting at was when all 5 techs tried to test the internet with their laptops hardwired into the modem, none of them could, because they didn't have "The Smarts" to set their laptops to one of our static IP's.

It was unbelievable that none of the techs knew that. As you can all imagine, I was not happy that I had to babysit these techs since I had a lot of other things to take care of.

After about a month of monitoring, and me calling every few days, they finally dispatched a line tech, who came and fixed the issue at the tap, which was a faulty splitter and connection. which I could see, but was told the same canned response "it is within normal operating conditions"

I was happy that they finally fixed the issue, but was still amazed at how everyone, except the line tech talked to me like I had no clue what I was doing.

I don't quite understand how it took your team half a day to discover the port had been disabled. Was the ISP managing the switch for you and you weren't able to see the traffic?

Even a simple ICMP test to different devices on network and off network would have give you the results necessary to ascertain where the problem lies.

Glad to see though that the company moved on from such an abysmal ISP.

Unfortunately, yes. We had no access, the provider was managing it for us. There were only two of us in the Office that day. Me, and the Manager. One Tech was on holiday, the other Manager was out of the Office, and the other Tech was working off-site.

If you want a backup line, use a different carrier. And ones using different lines. For example, it probably doesn't make sense to use multiple DSL providers if both are using AT&T's DSL lines. Be wary of connections that are just resold/wholesale versions of someone else's broadband. Resold broadband isn't bad per se, but you want two connections that are less likely to both go down at the same time.

Yes, and to everyone else who suggested that. I would never have a Backup circuit from the same provider again, It's not worth the hassle.

We used to have 2 dedicated analog circuits. Time came and we didn't need one of them any more. Of course Verizon turned off the wrong one, killing the connection between two of our larger sites. After two hours of calls trying to find anyone who even knew what I was talking about, I went to the CO where the problem was and found a tech there (normally the office is not staffed). I knocked and the guy came to the door and reconnected the circuit within a few minutes. None of my phone calls ever got any reasonable response.

Since then the Verizon field staff have bailed me out a few times, even calling me when they get an order that's all wrong because they knew it wasn't something I would do.

If you want a backup line, use a different carrier. And ones using different lines. For example, it probably doesn't make sense to use multiple DSL providers if both are using AT&T's DSL lines. Be wary of connections that are just resold/wholesale versions of someone else's broadband. Resold broadband isn't bad per se, but you want two connections that are less likely to both go down at the same time.

I agree with you, but sometimes it's all that's available. And going with another wholesaler can avoid accounting errors or even the misconfiguration that was at the heart of this story.

For one office I supported, the only Internet available was at&t DSL: landlord said no to satellites, the cable company wanted $120,000 to bring in cable, and cell phone coverage was awful. So I choose at&t DSL and backed it up with a DSL reseller. When the accountant "forgot" to pay the at&t bill and at&t disconnected, the reseller kept us going...

Way, way back in the day I worked for a warehousing and shipping company. We had a dozen facilities linked vi 56k leased lines. (Gandalf rs-232 muxes, Wyse terminals, Okidata fanfold printers. :-) It never failed: the days I passed an Ameritech truck parked next to the vault at the end of our street, we would have problems. It seemed an impossible task for them to hook someone else up without wrecking at least on of our lines.