Recently I was called out to one of my customers, a local college, which had been fighting a variety of disruptive network issues for several days following a campus-wide power-down. After a short phone call, we agreed that I’d better get out there and take a look first-hand at what was going on. Remote work is almost always an option, but I always feel that getting my eyes directly on a problem helps. Sometimes I feel like “The Network Whisperer” when I get out to a customer in trouble — something just speaks to me and points me in the right direction.
I walked into a war room full of weary faces. They were exasperated and had been making ad-hoc changes for two days straight trying to make sense of the problems. The customer’s staff described the most severe of these issues which included an administrative VLAN that had spotty egress to the rest of the network affecting a number of faculty, and major issues on the student wireless network affecting thousands of students’ ability to access resources necessary for their studies (as well as the important stuff like Facebook and Netflix…). My customer had spent a full day on the phone with Cisco TAC and had really made no progress whatsoever, which frustrated them further.
When we had been on the phone, I requested that as I was on my way over the staff make notes on a whiteboard of all the various symptoms they had noted, as well as observations they’d made and steps they’d taken so far. We looked over the list and dove in.
The first was related to the administrative VLAN. The notes said things like:
- Frequent ping loss, sometimes 1 or 2, sometimes 8-10 at a time when pinging to or from VLAN 140 to another VLAN
- No problems pinging within VLAN 140
- Problem is moving, affecting different clients at different times for different durations
- Spanning tree stable
- HSRP stable
- No link errors
I called this one before I even had my laptop out. “That’s an IP conflict with the default gateway,” I told them. I’m a firm believer in the idea that a properly placed network capture can diagnose just about any problem. Fortunately, this well-prepared customer had already gotten a machine patched into the collapsed core and loaded Wireshark up before my arrival, so we quickly SPAN’d the problem VLAN over to it and started capturing ARP traffic using a capture filter of simply “arp” to limit the stream of packets. We saw many normal ARP responses for the gateway IP from the HSRP MAC address 0000.0c07.ac8c, but as anticipated within about 2 minutes we saw some additional ARP responses from another MAC address. The customer ran the MAC address through the SolarWinds UDT tool, and in just a few seconds we had the switch port and jack identified. It turned out to be a copier that had “mysteriously” been assigned the gateway IP of .1 even though the customer expected it to be set for dynamic IP assignment. They rebooted the device and it came up at the correct DHCP-reserved address.
To prevent this from recurring, I recommended that my customer verify that all devices like printers and copiers that have web-based administrative interfaces have an appropriate password protecting that interface. Many people don’t think about the implications of someone getting into a printer (“what can they do, change the banner page?”), but this was a perfect example. Gaining access to a printer or other such device and re-IP’ing it to conflict with the gateway will cause plenty of pain on the subnet and usually take a while to figure out.
In addition to securing the web interfaces, I explained that the layer 2 security features available in their Cisco switches like DHCP Snooping and Dynamic ARP Inspection would have helped to prevent this from happening.
We moved on to the issues in the student access VLAN. We had the following notes on the whiteboard:
- DHCP server seeing hundreds of Renew requests per second from the same hosts
- DHCP server logging hundreds of thousands of IP conflicts in a few hours
- Users reporting spotty connectivity
- Skype sessions keep breaking
- Slow performance
This one was not quite so obvious, but I had a hunch we were looking for a rogue DHCP server and since we already had the capture station set up I swung the SPAN over to the student VLAN. Unknowingly, I’d left the “arp” capture filter in place when I fired up the capture. Serendipity smiled on me, though, and seeing just the ARPs quickly identified a problem node. Some 59% of the ARP traffic on that VLAN was all coming in the form of responses from a specific MAC address with a Sony OUI which seemed to be proxy ARPing on behalf of many other nodes. We found that port and shut it off immediately until someone could go investigate, and remarkably all of the issues in the above list cleared immediately. I didn’t get a complete explanation of the specific problem with this student’s PC, but the student claimed not to know anything about the cause.
As a future protection from this sort of thing, I again recommended the customer look at enabling layer 2 security features as DAI, IP Source Guard, and broadcast storm control probably would have stepped in to protect the network from whatever accidental or malicious event resulted in this node essentially performing a man-in-the-middle on the entire VLAN.
A few minutes after solving #2 above, as I was trying to wrap up, reports started rolling in that clients on the student network couldn’t reach the college’s website which is hosted externally. This is critical for the students to access curricula and online learning tools. The network manager indicated this had also happened on “move in” day. They suspected a NAC solution they use, but after a few minutes on the phone with that vendor we cleared that as a possible cause.
I tried pinging the web server from a variety of the core’s SVIs, which all worked, except for the student VLAN. We determined that every other VLAN could reach the college’s website, and the student VLAN could reach everything except sites hosted at this external provider. I took our sample client that was exhibiting the problem and gave it a static NAT in the firewall to a different public IP. It was immediately able to load the page.
At this point I realized that as we’d fixed the student network’s previous issues, we’d probably caused a flood of previously failing traffic toward the website which may well have tripped IPS or DOS defenses on the hosting provider’s servers. We called them up, and lo and behold they’d blacklisted the specific IP used to PAT the student VLAN. This also jived with the similar event on “move in” day when thousands of students started hitting the same site from the same source IP in the course of just a couple hours. The provider whitelisted the college’s entire /24 and we called it a day.
A Good Day
In two hours, I’d diagnosed and resolved issues that the customer had been suffering through for two days and Cisco TAC couldn’t even identify after a full day. I mention this not to brag, but to point out a few observations. Firstly, extensive experience is obviously an asset in troubleshooting complex problems. For example, knowing the signs of a gateway IP conflict makes them “easy” to solve. Next, expert-level knowledge in technologies, features, and protocols helps not only to quickly use the right tool to diagnose problems, but also to identify mitigations to prevent them from happening in the future. Finally, sometimes there is no substitute for having boots on the ground. Had I not been watching the way the captured packets rolled in, or had I not been there to “play” with the problem laptop in problem #3 and see just how it was behaving, we may have spent hours of additional time on over the phone working these issues.
This experience reminded me a lot of the CCIE lab troubleshooting section: rapid-fire problems under high pressure that were each quite simple to resolve, but to the untrained or unexperienced eye could take a lifetime to diagnose. Some (many?!) days I feel like I should have never been awarded a CCNA, but this day I embodied what it means to be an expert in my field. I came in, I assessed the situation, I worked calmly and quickly, and I resolved the issues and restored the customer’s network to full operation, while providing them insight on how to improve the network’s resilience in the future. If only every day was this easy…