A Journey

We had three days with totally fluctuating internet. The fluctuation was almost like a sine wave. Nobody could really figure out what went wrong and where the problem was.

NOTE : This post is not just another ‘masalla’ post. I am writing down the actual experience I had.

DAY 1 : August 26

All this started on August 26th sometime in the early morning hours when browsing speeds and the bandwidth usage touched the lowest levels in the last month. As I keep monitoring the bandwidth usage (bandwidth monitoring and download progress bars appeal me somehow for certain unknown reasons. I keep looking at progress bars when I download something. I just get lost in kind of dreamworld while looking at them.), I was surprised to see the low usage because everyone was in the campus and usage should touch the peak levels. It returned to normal after a short period of time and browsing was normal. But this pattern kept repeating itself. I went to attend the class. I returned at 11:30AM and rushed to server room to checkout whats going on. By that time server room was swamped by the phone calls from different research centers.

Nobody was actually able to figure out what was going on. All that we knew was that there was heavy broadcast from a segment on the network. We suspected it as the same problem which we faced last week. But isolating the problematic area is heck of a job and nobody was ready to check the network devices at the leaf level because of following reasons (1) It’ll take almost a day to check individual NIC in all the labs, (2) There is no security that problem will be resolved.

We took a tough decision of shutting down the network in entire problematic segment. This worked and network was fine. No fluctuations. But it proved out to be a wrong decision. We didn’t inform the people in the affected network (which unfortunately consisted of major research centers at IIIT i.e. CVIT, CDE, CVEST, LTRC (temp) etc.) and immediately we had to face the phone calls from HODs. One thing that I learnt from this situation is that Internet connectivity is equally important for everyone at IIIT including faculty members. Though we keep blaming students for being addicted to internet. Internet here is not an addiction, its a need. We had to re-up the network. And the rest of the network started fluctuating again. Everybody left for lunch.

As the time passed, the frustration among the users grew and everybody was almost shouting. Everybody wanted to know why its was taking so long to solve this problem. After lunch one of the admins went to the problematic area and started debugging at the individual switch level. But he faced a real tough time as most of the switches at leaf level are unmanageable (you can’t see any error reports unless you plug into individual switch). And we have a lot of switches (by a lot I mean a real lot of switches). And the switches are cascaded in such a dangerous manner that isolating a problem becomes way difficult. By evening that day we could isolate two research labs and three other segments which were generating heavy broadcast. We shut them off and everybody left for the day. There was a kind of blackout in those segments. No internet, no LAN.

During the night, I kept monitoring the network. A lot of people pinged and complained about the DNS resolution problem. Web pages were loading at a high speed but the name resolution was taking a lot of time. I tried looking at the logs and the traffic. Everything was fine except that the nameserver was swamped by the mail servers for name resolution. I tried a few hacks but nothing worked.

DAY 2 : August 27

I didn’t have any class that day. Admin XYZ called me at around 10:30AM and requested to come to server room if possible. I was sleeping and I hardly wake up at that time. But I didn’t want to miss the opportunity. Got up quickly and rushed to server room wasting as least time as possible. I was in server room at 11:00AM.

Admins suspected some problem with proxy as the fluctuation persisted even after cutting off the problematic areas. By the time I reached server, admins switched over to the stand by proxy machine. And to get started from Zero, entire network except the main building was shut down. We waited for almost half an hour. Everything worked absolutely fine. No fluctuations at all. So, main building is fine.

At around 11:40AM, network was restored in all the hostels. We waited for another half an hour. No fluctuation yet. But hell lot of phone calls sensitizing the situation. Everybody including seniors members rushing to server room. We suspected some attacks from hostels on the server in labs. But we were wrong. The problem is in the library building. But where?

Till lunch time, no network in areas except main building and hostels. As the time passed, the issue became more and more serious. It became difficult to answer phone calls from senior members as the word “Heavy Broadcast” now became irritating for them. They were listening to this since last two days. But nobody actually knew the exact answer. The origin of the broadcast was still not known.

Admin XYZ rushed to the library switch. Now XYZ was in live contact with admin PQR in server room and restoring the network in research centers one by one. Restore network in one research center, wait for half an hour. If no fluctuation, proceed otherwise revert back. Using this technique (this was the only solution), we restored network in all the centers except two. Connections to these centers also cascade to other areas. Complete outage in the two research centers. Everybody left for the day, leaving the two research centers in dark.

Network stabilized a bit. And fluctuation was not frequent (almost none). I monitored the network up to 2AM. Didn’t sleep because had a class at 8:30AM.

DAY 3 : August 28

I had a class up to 10AM. Rushed directly to server room after the class. We already narrowed down to a smaller region. Now the problem was smaller and there were lesser number of people after us. Admin ABC with a student was sent down to inspect individual switches. Thats the problem with unmanageable switches. You have to go and check each and every switch for any error messages. Anyway we kept narrowing down the problematic area till lunch. I left for lunch and returned to my room as I didn’t sleep during previous night. I don’t know what happened in the afternoon. I missed that 🙁 At 6:30PM, I called admin XYZ and asked about the status. He informed that the problem has been isolated. Only two very small labs were left.

Three days and problem was still there. People were really out of control. Anyway network worked perfectly in other areas except those two labs. The good thing was that these labs were at the leaf level and they were not cascading connections further.

DAY 4 : August 29

I had a lab from 10AM-11AM. But it went up to 11:45AM. By the time, I reached server room, the problem was already resolved. Everyone was connected and no more complaints. Rawat sir updated me with a few decisions which are beyond the scope of this post. The problem was the routing queries from one of the ISPs connected to those labs at leaf level.

It really took almost four days to debug this problem. Debugging a network, especially debugging a network which is randomly cascaded, has more than one entry points, has no perimeter and has a lot of unmanageable switch is a real challenge.

Anyways it was again a learning experience for me. I used to blame people for not able to solve the network problems quickly. I just realized that its very easy to blame.