Saturday, January 21, 2012

Cisco's bug in ARP behavior - A story of firewall configuration oddisey...

Well, a very interesting situation happened to me. I was changing an old firewall with a new one and after switching I added secondary IP address to a firewall's public interface with the intention that all the traffic that was comming to that secondary IP address is redirected to one internal server. All nice except that it didn't work! I created such configuration numerous times and it simply had to work, but it didn't! The similar functionality for the primary IP address worked as expected, but secondary didn't! This was driving me nuts, until I finally figured out what was happening. This is a story of how I finally resolved this problem.

In cases when something isn't working as expected I use Wireshark, or better yet, tcpdump, and start to watch what's happening with packets on interfaces. I use tcpdump as it is more ubiquitous than Wireshark, i.e. available with default install on many operating systems, Unix like I mean. Now using this tool has one drawback, at least on Linux. The point where it catches packets is before PREROUTING chain and there is no way (at least I don't now how) to see packet flows between different chains. This is actually a restriction of Linux kernel so any other tool (Wireshark included) will behave the same.

Additional complication in this particular case was that to debug firewall I had to run it in production. This makes things quite complicated because when you run something in production that doesn't (fully) work as expected there will be many unhappy people and in the end you don't have much time to experiment, you have to revert old firewall so that people can continue to work. In the end this translates into longer debug period as you have relatively short time windows in which you can debug. Oh yeah, and it didn't helped that this was done at 8pm, outside of usual working hours, for various reasons I won't go into now.

So, using tcpdump I saw that packets with secondary address were reaching the firewall interface and they mysteriously disappeared within a box! Naturally, based on that I concluded that something strange is happening within Linux itself.

I have to admit that usually this would be a far reached hypothesis as it would mean that there is a bug in a relatively simple NAT configuration and it had to be due to the bug which would certainly be known. Quick googling revealed nothing at all and added a further confirmation that this hypothesis isn't good. But what kept me going in that direction was that I decided to use Firewall Builder as a tool to manage firewall and firewall rules. This was my first use of this tool ever (very good tool by the way!). The reason I selected that tool was that this firewall is used by one customer for which I intended to allow him to change rules by himself (so that I have less work to do :)). I wasn't particularly happy with the way rules are generated by that tool, and so I suspected that maybe it messed something, or I didn't define rules as it expects me to. To see if this is true, I flushed all the rules on the firewall and quickly generated a test case by manually entering iptables rules. Well, it turned out that it doesn't work either.

The next hypothesis was that FWBuilder somehow messed something within /proc filesystem. For example, it could be that I told him to be overlay restrictive. But trying out different combinations and poking throughout /proc/sys/net didn't help either, the things were not working and that was it!

Finally, at a moment of despair I started again tcpdump but this time I requested it to show me MAC addresses too. And then I noticed that destination MAC address doesn't belong to firewall! I rarely use this mode of tcpdump operation as L2 simply works, but now I realized what the problem was. The core of the problem was that the router (which is some Cisco) didn't change MAC address assigned to secondary IP address when I switched firewall. What I would usually do in such situation is to restart Cisco. But, since this router was within cabinet that I didn't have access to, and also I didn't have access to its power supply, it was not an option. Yet, it turned out that it is easy to "persuade" some device to change MAC address, just send it a gratuitous ARP response:
arping -s myaddr -I eth0 ciscoaddr
Both addresses are IP addresses, with myaddr being the address for which I want to change MAC address and ciscoaddr is device where I want this mapping to be changed. And that was it! Cisco now had correct MAC address and thing worked as expected. The primary address worked correctly because firewall itself sent a packet to Cisco and in that way changed MAC address belonging to primary IP address.

To conclude, this is a short version of everything that happened as I also used iptables' logging functionality (that obviously didn't help, as there was nothing to log for a start :)). Finally, there's left only one question to answer, i.e. How did I saw packets with "wrong" MAC address it tcpdump output? First, switch shouldn't send it to me, and second, interface should drop them before OS sees them (and by extension tcpdump). Well, switch was sending those frames to me because it's entry for MAC address expired and it didn't know where the old MAC address is, so it sent every frame to all the outputs/ports (apart from the receiving one, of course). The reason input interface didn't drop the packet was that sniffing tool places interface into promiscuous mode, and so it sees every frame that reaches it. Easy, and interesting how things combine to create problem and false clues. :)

No comments:

About Me

scientist, consultant, security specialist, networking guy, system administrator, philosopher ;)

Blog Archive