Everything about nothing: debug

Showing posts with label debug. Show all posts

Thursday, June 28, 2012

Another internal error trying to access IPA Web UI

I just tried to access IPA's Web UI and I got 'Internal Server Error' dialog box:

Looking into log file (/var/log/httpd/error_log) I found the following entry that obviously was the reason dialog box appeared:

[Thu Jun 28 21:10:28 2012] [error] [client 192.168.178.1] gss_acquire_cred() failed: Unspecified GSS failure. Minor code may provide more information (, No key table entry found for HTTP/ipa.example-domain.local.localdomain@EXAMPLE-DOMAIN.HR), referer: https://ipa.example-domain.local/ipa/ui/

It's immediately obvious that something is wrong with the name of IPA server and that somehow .localdomain was appended!? At first, I thought that the problem is in the Firefox and that the value of keys network.negotiate-auth.trusted-uris and network.negotiate-auth.delegation-uris have to end with a dot so that no domain is appended. But quick test showed that I was wrong, when I added dots there nothing worked any more. :)

So, I thought that there must be something on a server that causes that behavior. And then, I looked into /etc/resolv.conf and there it was:

search localdomain example-domain.local

So, this search statement cause localdomain to be appended to the IPA's FQDN. So, I removed that statement and tried again, but the error was still there. Then, it occured to me that Apache probably memorized the statement so I restarted it. And, lo and behold, everyting worked.

You might wonder from where came this search statement. Well, I play tricks with my network setup, and in this case DHCP was used to obtain list of DNS servers which later I manually changed into 127.0.0.1. But, I forgot to remove search statement and so the error occurred. Playing games with network setup obviously bites sometimes... ;)

Sunday, February 12, 2012

Who's listening on an interface...

While I was trying to deduce whether arpwatch honors multiple -i options and listens on multiple interfaces I had a problem of detecting on which interface exactly does it listen? To determine that, I started arpwatch in the following way:

arpwatch -i wlan0 -i em1

but that didn't report to me which interface did the command bound to. Using -d option (debug) didn't help either. So, the first attempt was looking into files open by the command. It can be done using lsof command, or by directly looking into aprwatch's proc directory. So, I found out PID (in this particular case that was 23833) of the command (use ps for that) and then I went into directory /proc/PID/fd. In there, I saw the following content:

# cd /proc/23833/fd
# ls -l
total 0
lrwx------. 1 root root 64 Feb 12 12:32 0 -> socket:[27121497]
lrwx------. 1 root root 64 Feb 12 12:32 1 -> socket:[27121498]

This wasn't so useful! Actually, it tells me that arpwatch closed it's stdin and stdout descriptors and opened only appropriate sockets to get arp frames. lsof command also didn't show anything direct:

# lsof -p 23833 -n
COMMAND    PID USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
arpwatch 23833 root cwd    DIR              253,2     4096   821770 /var/lib/arpwatch
arpwatch 23833 root rtd    DIR              253,2     4096        2 /
arpwatch 23833 root txt    REG              253,2    34144 2672471 /usr/sbin/arpwatch
arpwatch 23833 root mem    REG              253,2   168512 2228280 /lib64/ld-2.14.90.so
arpwatch 23833 root mem    REG              253,2 2068608 2228301 /lib64/libc-2.14.90.so
arpwatch 23833 root mem    REG              253,2   235944 2675058 /usr/lib64/libpcap.so.1.1.1
arpwatch 23833 root mem    REG                0,6          27121497 socket:[27121497] (stat: No such file or directory)
arpwatch 23833 root    0u pack           27121497      0t0      ALL type=SOCK_RAW
arpwatch 23833 root    1u unix 0xffff8802e3024ac0      0t0 27121498 socket

But it did show that there is a raw socket in use, but not anything more than that. So, the next step was to find out how to list all raw sockets? netstat can list all open and listening sockets, so, looking into man page it turns out that the option --raw or -w is used to show raw sockets, i.e.

# netstat -w
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State

Well, neither this was very useful, but by default netstat doesn't show listening sockets so I repeated command adding -l option:

# netstat -w -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
raw 0 0 *:icmp *:* 7

So, this is definitely not what I'm looking for. This RAW socket listens for ICMP messages, and arpwatch definitely isn't capturing those. In man page it also says that netstat looks for information about raw sockets from /proc/net/raw file, so I looked into its content:

# cat /proc/net/raw
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
1: 00000000:0001 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 47105 2 ffff880308e18340 0

Also not useful! There was inode listed (47105) but how to find out information about particular inode? I looked throughout /proc file system but didn't find anything. I also checked lsof manual but wasn't able to find something useful (though I didn't read manual from start to finish, I just search word inode!).

Then, I remembered that there is a ss command that is specific to Linux, and that is used to provide information about sockets! So, I looked into man page and there it says that the option -0 (or -f link) is used to show PACKET sockets, so I tried:

# ss -f link
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port

Again nothing, but it quickly occured to me that it doesn't show listening sockets by default, so I tried with -l (and -n to avoid any resolving):

# ss -f link -ln
Netid State      Recv-Q Send-Q     Local Address:Port       Peer Address:Port
p_raw UNCONN     0      0                      *:em1                    *
p_dgr UNCONN     0      0                [34958]:wlan0                  *

Woohoo, I was on something, finally! I see raw socket bound to em1 interface (note that I started arpwach with the intention that it listens on wlan0 and em1 interfaces!) Only, I still don't know who is exactly using it. I only see that the other socket is datagram type, meaning network layer, and probably not used by arpwatch. man page for ss again helped, it says to use -p option to find out which process owns a socket, so I tried:

# ss -f link -lnp
Netid State      Recv-Q Send-Q     Local Address:Port       Peer Address:Port
p_raw UNCONN     0      0                      *:em1                    *      users:(("arpwatch",23833,0))
p_dgr UNCONN     0      0                [34958]:wlan0                  *      users:(("wpa_supplicant",1425,9))

Wow!! That was it! I found out that arwatch is listening only to a single interface, and later I confirmed it by looking into the source! I also saw that the other socket is used by wpa_supplicant, i.e. for a wireless network management purposes.

One final thing bothered me. From where does ss take this information? But it's easy to find out that, use strace! :) So, using strace I found out that ss is using /proc/net/packet file:

# cat /proc/net/packet
sk               RefCnt Type Proto Iface R Rmem   User   Inode
ffff880308cec000 3      3    0003   2     1 0      0      27121497
ffff880004358800 3      2    888e   7     1 0      0      25292685

Maybe I would get to that earlier if I had looked more closely into available files in /proc/net when /proc/net/raw turned out to be wrong file! But it doesn't matter, this search was fun and educative. :)

Saturday, January 21, 2012

Cisco's bug in ARP behavior - A story of firewall configuration oddisey...

Well, a very interesting situation happened to me. I was changing an old firewall with a new one and after switching I added secondary IP address to a firewall's public interface with the intention that all the traffic that was comming to that secondary IP address is redirected to one internal server. All nice except that it didn't work! I created such configuration numerous times and it simply had to work, but it didn't! The similar functionality for the primary IP address worked as expected, but secondary didn't! This was driving me nuts, until I finally figured out what was happening. This is a story of how I finally resolved this problem.

In cases when something isn't working as expected I use Wireshark, or better yet, tcpdump, and start to watch what's happening with packets on interfaces. I use tcpdump as it is more ubiquitous than Wireshark, i.e. available with default install on many operating systems, Unix like I mean. Now using this tool has one drawback, at least on Linux. The point where it catches packets is before PREROUTING chain and there is no way (at least I don't now how) to see packet flows between different chains. This is actually a restriction of Linux kernel so any other tool (Wireshark included) will behave the same.

Additional complication in this particular case was that to debug firewall I had to run it in production. This makes things quite complicated because when you run something in production that doesn't (fully) work as expected there will be many unhappy people and in the end you don't have much time to experiment, you have to revert old firewall so that people can continue to work. In the end this translates into longer debug period as you have relatively short time windows in which you can debug. Oh yeah, and it didn't helped that this was done at 8pm, outside of usual working hours, for various reasons I won't go into now.

So, using tcpdump I saw that packets with secondary address were reaching the firewall interface and they mysteriously disappeared within a box! Naturally, based on that I concluded that something strange is happening within Linux itself.

I have to admit that usually this would be a far reached hypothesis as it would mean that there is a bug in a relatively simple NAT configuration and it had to be due to the bug which would certainly be known. Quick googling revealed nothing at all and added a further confirmation that this hypothesis isn't good. But what kept me going in that direction was that I decided to use Firewall Builder as a tool to manage firewall and firewall rules. This was my first use of this tool ever (very good tool by the way!). The reason I selected that tool was that this firewall is used by one customer for which I intended to allow him to change rules by himself (so that I have less work to do :)). I wasn't particularly happy with the way rules are generated by that tool, and so I suspected that maybe it messed something, or I didn't define rules as it expects me to. To see if this is true, I flushed all the rules on the firewall and quickly generated a test case by manually entering iptables rules. Well, it turned out that it doesn't work either.

The next hypothesis was that FWBuilder somehow messed something within /proc filesystem. For example, it could be that I told him to be overlay restrictive. But trying out different combinations and poking throughout /proc/sys/net didn't help either, the things were not working and that was it!

Finally, at a moment of despair I started again tcpdump but this time I requested it to show me MAC addresses too. And then I noticed that destination MAC address doesn't belong to firewall! I rarely use this mode of tcpdump operation as L2 simply works, but now I realized what the problem was. The core of the problem was that the router (which is some Cisco) didn't change MAC address assigned to secondary IP address when I switched firewall. What I would usually do in such situation is to restart Cisco. But, since this router was within cabinet that I didn't have access to, and also I didn't have access to its power supply, it was not an option. Yet, it turned out that it is easy to "persuade" some device to change MAC address, just send it a gratuitous ARP response:

arping -s myaddr -I eth0 ciscoaddr

Both addresses are IP addresses, with myaddr being the address for which I want to change MAC address and ciscoaddr is device where I want this mapping to be changed. And that was it! Cisco now had correct MAC address and thing worked as expected. The primary address worked correctly because firewall itself sent a packet to Cisco and in that way changed MAC address belonging to primary IP address.

To conclude, this is a short version of everything that happened as I also used iptables' logging functionality (that obviously didn't help, as there was nothing to log for a start :)). Finally, there's left only one question to answer, i.e. How did I saw packets with "wrong" MAC address it tcpdump output? First, switch shouldn't send it to me, and second, interface should drop them before OS sees them (and by extension tcpdump). Well, switch was sending those frames to me because it's entry for MAC address expired and it didn't know where the old MAC address is, so it sent every frame to all the outputs/ports (apart from the receiving one, of course). The reason input interface didn't drop the packet was that sniffing tool places interface into promiscuous mode, and so it sees every frame that reaches it. Easy, and interesting how things combine to create problem and false clues. :)

Tuesday, December 6, 2011

Problems with resolver library...

I just had a problem that manifested itself in a very strange way. I couldn't open Web page hosted on a local network, while everything else seemingly worked. The behavior was same for Chrome and Firefox. In due course I realized that every application had this problem. On the other hand, resolving with nslookup worked flawlesly. This was very confusing. To add more to the confusion, while running tcpdump it was obvious that there were no DNS requests sent to the network! So, it was obvious that the problem was somewhere in the local resolver. At first, I suspected on nscd that was used as a caching daemon on Fedora, but in Fedora 16 this daemon is not installed. So, how to debug this situation? Quick google query didn't yield anything useful.

Reading manual page of resolv.conf there is section that says that you can use directive option debug. But trying to do that yielded no output! Neither there were any results using the same option but via RES_OPTIONS environment variable. This is strange, and needs additional investigation as why it is so, and more importantly to know how to debug local resolver.

In the mean time I figured out that the ping command behaves the same as browser and since ping command is much smaller it is easier to debug it using strace command. So, while running ping via strace I noticed the following line in the output:

open("/lib64/libnss_mdns4_minimal.so.2", O_RDONLY|O_CLOEXEC) = 3

which immediately rung a bell that the problem could be nsswitch! And indeed, opening it I saw the following line:

hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname

which basically said that, if mdns4 returns not found dns is not tried. It seems that mdns4 is used whenever the domain name ends in .local, which was true in my case. So, I changed that line into:

hosts: files dns

and everything works as expected.

Since I didn't install explicitly mdns, I decided to remove it. But then it became clear that wine (Windows Emulator) depends on it. So, I left it.

Everything about nothing