I need to take a minute to pat myself on the back (and to pat my co-worker Tom's back as well). Several months ago, an end user came to me, complaining that his network connectivity was intermittently slow, and to a horrible extent. Sometimes he could work just fine, but other times, it would take several minutes just to send a basic email. This problem only impacted people in a specific area on the network, and there was no visible rhyme or reason to when things would slow down.
This problem was so complicated that it's been an issue for longer than I've worked for the company. My predecessor -- one of the guys who helped build the network -- was unable to find the problem after working on it for several months himself. He was so convinced that no problem existed that he eventually refused to do any more work on it and said it was all in the end-users' heads. (Maybe that's part of why he got canned.)
This problem was first reported to me about three months ago. I did the cursory work on the problem and couldn't find anything wrong. In retrospect, I didn't immediately isolate the problem for two reasons... it was intermittent, and the physical layout of the network is poorly documented. After failing to quickly isolate the root issue, I told the end-user that a solution would be slow coming, but that I would continue working on it as time permitted.
My next step was to create a manually-launched process that would monitor the problem when it occurred. I created a program that would ping various servers in the building when the end-user clicked on a shortcut, and instructed the guy to click on this shortcut every time he experienced slow network performance and to send me an email when this happened, so I could look at the file produced by this program. For you geeks out there, it was a .bat file that pinged several different servers on the network and directed the output to a .txt file, complete with ping times and a date-time stamp saying when the process occurred. I still didn't find any rhyme or reason to the latency.
Eventually, Tom and I did some network sniffing, which included mapping the route from the backbone of the network to the end-user's machine, and the connection speed. We were astonished with the results. There was a fiber connection from the backbone to their segment of the network, but it was created using old technology and the hardware on each end of the fiber connection was a 10 megabit connection, running in half-duplex. Here we've got this state-of-the-art fiber connection, and it's running at the slowest allowable rate.
We figured we'd have to break open the checkbook to implement a solution, but Tom found a 100 megabit fiber connection available on the backbone, and we also discovered an available 100 megabit fiber-to-copper converter. After about 15 minutes of downtime (and countless man-hours of work) we fixed the problem. Needless to say, we're pretty proud of ourselves, and I plan to spend the next few weeks using a network monitoring tool to find similar bottlenecks in the network. Out of necessity, Tom and I are throwing out any and all assumptions that a fiber connection to the backbone means a fast connection to the network's core.
To make things even better, this problem has also fixed a couple of other issues that we didn't immediately connect with this slow connection speed. It took a lot of work, but we have made several users very happy for a very long time. Like I said, enterprise networks can be very complicated creatures, and things definitely can be different than they appear!
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment