Everything about nothing: bug

Saturday, January 21, 2012

Cisco's bug in ARP behavior - A story of firewall configuration oddisey...

Well, a very interesting situation happened to me. I was changing an old firewall with a new one and after switching I added secondary IP address to a firewall's public interface with the intention that all the traffic that was comming to that secondary IP address is redirected to one internal server. All nice except that it didn't work! I created such configuration numerous times and it simply had to work, but it didn't! The similar functionality for the primary IP address worked as expected, but secondary didn't! This was driving me nuts, until I finally figured out what was happening. This is a story of how I finally resolved this problem.

In cases when something isn't working as expected I use Wireshark, or better yet, tcpdump, and start to watch what's happening with packets on interfaces. I use tcpdump as it is more ubiquitous than Wireshark, i.e. available with default install on many operating systems, Unix like I mean. Now using this tool has one drawback, at least on Linux. The point where it catches packets is before PREROUTING chain and there is no way (at least I don't now how) to see packet flows between different chains. This is actually a restriction of Linux kernel so any other tool (Wireshark included) will behave the same.

Additional complication in this particular case was that to debug firewall I had to run it in production. This makes things quite complicated because when you run something in production that doesn't (fully) work as expected there will be many unhappy people and in the end you don't have much time to experiment, you have to revert old firewall so that people can continue to work. In the end this translates into longer debug period as you have relatively short time windows in which you can debug. Oh yeah, and it didn't helped that this was done at 8pm, outside of usual working hours, for various reasons I won't go into now.

So, using tcpdump I saw that packets with secondary address were reaching the firewall interface and they mysteriously disappeared within a box! Naturally, based on that I concluded that something strange is happening within Linux itself.

I have to admit that usually this would be a far reached hypothesis as it would mean that there is a bug in a relatively simple NAT configuration and it had to be due to the bug which would certainly be known. Quick googling revealed nothing at all and added a further confirmation that this hypothesis isn't good. But what kept me going in that direction was that I decided to use Firewall Builder as a tool to manage firewall and firewall rules. This was my first use of this tool ever (very good tool by the way!). The reason I selected that tool was that this firewall is used by one customer for which I intended to allow him to change rules by himself (so that I have less work to do :)). I wasn't particularly happy with the way rules are generated by that tool, and so I suspected that maybe it messed something, or I didn't define rules as it expects me to. To see if this is true, I flushed all the rules on the firewall and quickly generated a test case by manually entering iptables rules. Well, it turned out that it doesn't work either.

The next hypothesis was that FWBuilder somehow messed something within /proc filesystem. For example, it could be that I told him to be overlay restrictive. But trying out different combinations and poking throughout /proc/sys/net didn't help either, the things were not working and that was it!

Finally, at a moment of despair I started again tcpdump but this time I requested it to show me MAC addresses too. And then I noticed that destination MAC address doesn't belong to firewall! I rarely use this mode of tcpdump operation as L2 simply works, but now I realized what the problem was. The core of the problem was that the router (which is some Cisco) didn't change MAC address assigned to secondary IP address when I switched firewall. What I would usually do in such situation is to restart Cisco. But, since this router was within cabinet that I didn't have access to, and also I didn't have access to its power supply, it was not an option. Yet, it turned out that it is easy to "persuade" some device to change MAC address, just send it a gratuitous ARP response:

arping -s myaddr -I eth0 ciscoaddr

Both addresses are IP addresses, with myaddr being the address for which I want to change MAC address and ciscoaddr is device where I want this mapping to be changed. And that was it! Cisco now had correct MAC address and thing worked as expected. The primary address worked correctly because firewall itself sent a packet to Cisco and in that way changed MAC address belonging to primary IP address.

To conclude, this is a short version of everything that happened as I also used iptables' logging functionality (that obviously didn't help, as there was nothing to log for a start :)). Finally, there's left only one question to answer, i.e. How did I saw packets with "wrong" MAC address it tcpdump output? First, switch shouldn't send it to me, and second, interface should drop them before OS sees them (and by extension tcpdump). Well, switch was sending those frames to me because it's entry for MAC address expired and it didn't know where the old MAC address is, so it sent every frame to all the outputs/ports (apart from the receiving one, of course). The reason input interface didn't drop the packet was that sniffing tool places interface into promiscuous mode, and so it sees every frame that reaches it. Easy, and interesting how things combine to create problem and false clues. :)

Tuesday, January 17, 2012

Interesting problem with OSSEC, active response and mail delivery...

We had a problem that manifested itself in such a way that mail messages didn't come from certain domains, or more specifically from certain mail servers. Furthermore, no clue was given in the mail log to know what went wrong and to make things worse, logs from the remote mail server were inaccessible to see there what actually happened. Finally, the worse thing was that this happened sporadically. It turned out that this was consequence of a circumstances and a bug with ossec active response. This post explains what happened.

We changed DNS domain several months ago, let me call the new domain newdomain.hr, and the old one olddomain.hr. DNS was reconfigured so that it correctly handled requests for a new domain, but we had to leave old domain because of some Web server. The the old domain was changed so that when someone asked which is a mail exchanger for olddomain.hr it would receive response: mail.newdomain.hr. Finally, domain olddomain.hr was removed from the mail server. This was the first error, and now I think that it is better either to leave old domain on mail server or to not return any response! Actually, if you want to get rid of the old domain, it is the best to remove it from the mail server and that DNS server doesn't return any response for a mail exchanger of a given domain. If you know how mail works, you'll know that by changing MX record for old domain from mail.olddomain.hr to mail.newdomain.hr didn't change anything!

Anyway, that's the part concerning mail. Now, about OSSEC. It has a possibility of active response, i.e. to block offending IP addresses for a certain amount of time, 10 minutes by default. One class of offending IP addresses are those that try to deliver mail messages which require mail server to be open relay. Since mail server is properly configured it rejects those messages with a message 'Relay denied'. After mail server rejects such delivery attempt OSSEC kicks in and blocks offending IP address for 10 minutes.

This, by itself didn't have to be a problem because blocking rules are automatically removed after 10 minutes. But, there is a bug in the removal script that manifested itself in the logs like follows (found on agent in /var/ossec/logs/active-responses.log):

Unable to run (iptables returning != 1): 1 - /var/ossec/active-response/bin/firewall-drop.sh delete - 203.83.62.99 1326738019.2370422 3301

For some reason removal of IP address from block list wasn't successful and that basically meant that the source mail host is blocked indefinitely!

Majority of mail servers that to generate such 'Relay denied' messages are truly spammers and if some of them were indefinitely blocked that was actually good. But, this particular source mail server that was blocked is very popular one with many users serving many different domains, so now when some other user tried to send an email that was legal and had correct address, IPtables blocked access and the mail couldn't be delivered. There was nothing in the logs of destination mail server. Also, sending user didn't receive any response message since mail was being temporary put on hold on the source server.

This particular problem was solved by completely removing the old domain. Now, source mail servers won't even try to deliver mails for the old domain and thus OSSEC won't block legitimate servers. Furthermore, the sending users will get notification immediately about non-existent mail address.

Everything about nothing

Saturday, January 21, 2012

Cisco's bug in ARP behavior - A story of firewall configuration oddisey...

Tuesday, January 17, 2012

Interesting problem with OSSEC, active response and mail delivery...

About Me

Blog Archive