Weird Network Fault Due to Unknown thing on the End of the Wire

I was called in to help fix a network that had been discombobulated. I didn't end up fixing it, but one staffer there did the trick by disconnecting a switch with a bunch of wires plugged into it.

The symptoms were basic: WiFi was working, but the wired network was not, and the internet was not working.

The LAN had been reconfigured, incorrectly, and that was the first thing to fix. First things first was to determine what the original configuration was - and that was found via the printer, which had a static IP address. Why? Because nobody likes it when the printer changes IP addresses, so most printers are configured with static IP adresses.

Of course, that makes it tough to change the IP addresses on a network, but in the big picture, it's more important that everyone can print than it is for the sysadmin to renumber the network (which nobody notices).

So the first thing to do was re-establish the LAN.

Next, the uplink to the internet had to be fixed. That was done by talking to the ISP. Once that was done, the information needed to be documented on a piece of paper to be taped to the router.

At this point, the wifi was able to get to the internet, but the wired nodes had problems. They seemed to get IP addresses over DHCP, but couldn't get a connection to any website. I was at a loss of what to do. Then, the staffer at the office went to unplug a switch. That fixed everything.

We did a bit of testing over the phone, unplugging and plugging until it failed. That narrowed it down.

So, I was guessing a device on there was interfering, perhaps by having the router's address. I went up and swapped the switch out to see if that was it - nope. Then I unplugged everything, plugged a laptop into the switch, and then tried to find the downlink. We tried different combinations of wires, and eventually came to the conclusion that one or other of a pair of cables was the problem.

We didn't know what was at the other end of the cables, but as long as they were kept out of the switch, everything was okay.

The bad cables, when plugged in, wouldn't cause the port's LED to illuminate. So it wasn't a computer at the other end causing a problem. It could be a wiring fault, or maybe even a non-ethernet device at the other end. Whatever it was, it caused the switch to misbehave and cause the network to fail.

Lesson Learned: finding the problem with "binary chopping"

The main lesson learned - which I didn't do, but will do in the future - is to shut down the failing network. Shut down the computers and switches. Bring up the internet connection and the router, making sure it works. Then, bring up half the network, and see if it functions. If it does, the problem is in the other half. If it fails, shut it down, and bring up the other half.

(Note that you must focus on the switches closest to the backbone first, so you can test the internet.)

Leaving the good half up, bring up half of the remaining, faulty set. Test, and determine which half contains the fault.

Repeat this process of bringing up half of the remainder of the back network until you find the problem.

As bad as it sounds, you will reach the fault in minimal time.

As an example, consider if we have a LAN with a total of 256 ethernet ports.

First, you power up 128 ports to find which half the bad device is at...
256 / 2 = 128

Then you power up 64 of the remaining ports...
128 / 2 = 64

64 / 2 = 32

32 / 2 = 16

16 / 2 = 8

8 / 2 = 4

4 / 2 = 2

A this point, only two ports are unknown, and you can just unplug one to see if it helps.

So it takes around 7 steps. If each step takes 10 minutes, your entire diagnosis takes 70 minutes. That's not bad.

Also, note that even if the fault causes a total network failure - the network is basically functional most of the time. The only time problems happen is when the bad fraction of the network is powered up.