Exsqueeze me?

Earlier this week, whilst jacked into the Matrix, we suddenly lost all connections to our SharePoint servers. At almost the exact moment, a call came in from the local site hosting our SharePoint servers, saying their network just went down… hard.

Since I got the call, it fell to me to get the ball rolling on calling all the particular people to get help for the site. I managed to get a hold of both regional and local LAN folks, and we immediately jumped on a con call with some of the other local people to figure out what the heck happened to cause the network to go down so suddenly. The regional LAN team did some digging and discovered that one of the VLAN’s was getting an awful lot of duplicate IP addresses tracing back to a single MAC address. At first they were stymied as to why this was, and how this would take the network down the way it did, especially since this was happening on one of the Cisco router cores, and not the other.

After a bit more digging, one of the local people on-site found something that made all of us cringe. He discovered an ethernet cable plugged in on one port of the core switch to a second port of the core switch, which just happened to be on the same VLAN. It was causing it to flap, and for some ungodly reason, causing all the duplicate addresses. Almost as soon as he unplugged it and shut both ports (neither were labeled as actually being in use), things began to improve… or so we thought.

The LAN team also took the offending core down, since it was the only one programmed with the VLAN that was flapping and causing all the duplicate IP’s. Almost as soon as that happened, things went back to normal on the network, since the first Core seemed to be OK. There was one catch. Someone at the site, in their infinite wisdom, decided to have half of the wireless access points hosted on one Core, and the other half on the other Core. So as soon as the offending Core went down, half the AP’s went with it. The Regional LAN team was scratching their heads, as to who in their right mind would set up the P controllers this way, and not dual-link them to both cores for redundancy. When they discovered this, they brought the 2nd Core back up, the gremlins came back out and everything they brought back up, came RIGHT back down again.

They managed to get the config’s from the 2nd core, since not only were the wireless AP’s tied there, but some of the VLAN’s as well. The LAN team was slowly realizing this issue was really starting to involve the words “massive” and “cluster”. They power cycled the first core back after taking the 2nd core down again, and injected the configs in, and some things began slowly coming up again. A bunch of servers and other things had to be rebooted because the constant flapping and switching between cores effectively made the servers go “Bah, screw it!” and go offline. So several other teams had to be brought in to go in, reboot the servers, and get all their services started back up again. After 5 hours, my team finally got their SharePoint back, as well as all the other Tier 3 teams, and they by and large were happy.

This didn’t mean the site itself was out of the woods just yet. Only 72 of the site’s over 1000 wireless AP’s were active, and the LAN team tried to figure out why. They figured out some of the scopes weren’t working, so they restarted a couple, and got about 60% of them back up and distributing IP’s. As for the rest, it took a while, but they eventually figured out that power cycling the switches the remaining down AP’s were attached to with PoE did the trick. Simply doing a shut/no shut didn’t do anything, power cycling seemed to be the fix since not only did it bring the AP’s back up, the switches went to the good core instead of their default of the bad one which we had since taken down to let Cisco handle.

All told, this whole bit of “fun” ended up taking over 21 hours to fix and get the site back into a largely working state as far as their LAN. They’re still working off one Core, and they’re going to let Cisco go through with a fine-toothed comb to see why it essentially went insane when the cable was plugged into two ports on the same VLAN.

5 thoughts on “Exsqueeze me?

  1. I’d be much more interested in finding out who’s responsible for the bullshit single-homing in the first place, /then/ go after the fucknugget that looped the switch , then go after the fucknugget that didn’t have the config set up with safeties to prevent that loop!

Leave a Reply