Archive for May, 2005

Microsoft patches, breaking things…

Posted in General, Support Calls on May 10th, 2005

I am sure you have heard it before and you are going to hear it again - Microsoft patches break things. In this case, it had to do with a patch that was installed on an Exchange server on April 13, 2005 (MS05-019). So, we get a trouble ticket stating that users in CALA (that’s short for Central America Latin America) were unable to pull mail. Ok, we have to make this issue more interesting - so SOMETIMES - the users could get their email. Oh great, one of those tickets - an intermittent issue. So, quick summary. We get a trouble ticket stating that users in CALA are intermittently having mail issues and only mail issues. So, we perform our typical troubleshooting: pings, traceroutes between the hosts and don’t see anything at the ip layer that could be causing this issue. One interesting thing about the connectivity between the Exchange server and the users is a VPN tunnel. Why would a VPN tunnel pick on packets only related to mail (more specifically Exchange/Outlook mail)? Good question to pose and a question we would need to answer? If not the vpn tunnel than what is picking on poor ol’ Exchange mail?

So, the next level of troubleshooting involved packet captures! Luckily the CALA IT guys were very eager to get their trouble repaired and so were we. We were given VNC access to the Exchange server (Wow, who does that? Hey, I’ll take it!) And, we were also given access to a host in Chile with an account to test with. So we were given the ok to install Ethereal on both the server and the client involved. We ran Ethereal and captured packets involved in sending and receiving email; however, we don’t even make it that far. Outlook gets to the point now, that it will not even open on the client machine - it just hangs. Opening up task manager shows a ‘Not Responding’ for Outlook. Thinking more about this, maybe if we had waited he would have opened up, but oh well. So we crack open the traces on the client and they look ugly. You can see packets coming in to the client out of order which generally should not be a problem, but it’s definitely something to make note of since SOME applications do not appreciate that and do not handle that very well. Also, on the server side, we could see that packets were coming in out of order as well. You could see the terminating end (on each side of the tunnel) sending back an ICMP MTU exceeded packet back to each party (Exchange Server and Client) every time a 1514 byte packet was sent by each machine because the tunnel MTU was exceeded. It was because of this that packets were being shipped to each machine, out of order. So you’d see say a few 60 byte packets, a 1220 byte packet a 1514 byte packet and a few more smaller packets like 514 bytes. Well, when those are all sent out of the server, the 1514 byte packet would get dropped and the icmp mtu exceeded message sent back to the server. In that short time frame, the other packets went on through and then you’d see what was the 1514 byte packet now came through as a 1490 byte packet with some smaller one trailing behind it. This was causing the packets to arrive out of order.

Outlook apparently did not like this since it was no longer opening up. So, we did a quick test by pinging the Exchange server, from the client machine in Chile, with the following:
ping -l 1472 -f 10.10.1.5
error received: Packet needs to be fragmented but DF set.
ping -l 1462 -f 10.10.1.5
error received: Packet needs to be fragmented but DF set.
We finally arrived at the following:
ping -l 1448 -f 10.10.1.5 (this means ping the Exchange server with a 1448 byte packet and don’t allow it to be fragmented)
Reply from 10.10.1.5: bytes=56 (sent 1448) time=151ms
Reply from 10.10.1.5: bytes=56 (sent 1448) time=150ms

Now that we have arrived to our new MTU, we set it up on the machine (MTU information is covered in the MS Support doc located at the end of this post). We rebooted the machine to ensure that the changes take effect and they did! In my traces, I could see that packets were now coming in order to each machine. We weren’t going to do this on the 6000 machines based in CALA that were experiencing the mail issue, so what was the next step. Call Microsoft, we did. We explained to MS what the issue was and they were on it like white on rice. They basically explained that they had a patch that went out around April 13th (coincidence) that causes issues in situations similar to ours in which ICMP MTU exceeded messages are sent to a patched machine. We checked the server and workstations for patch MS05-019, which MS noted as the possible culprit. They, MS, stated that significant changes were made to tcpip.sys and the manner in which packets are handled. Hmm. Anyway, a hotfix was produced which we were asked to install on both the server and the client machines which would repair issues within tcpip.sys and the mishandling of packets in our specific situation. So, we installed the hotfixes on both machines voila! Well, that sounded easy, but we had to wait a day for the CALA guys to get approval, then do it that night and get back with us first thing in the morning. CALA was back in service. Mind you, we used a different client machine to load the patch on since we had already changed the MTU on one of them. We wanted to leave that one alone in case we needed to perform more testing, but that turned out not be the case.

This problem went on for a week. We had Cisco looking at the tunnel as a possible cause, we removed ip route caching from the interfaces involved (WAN and LAN) with no help from that. We had also tried changing the df bit such that when the packet came into the tunnel, the df bit was removed and we fragmented the packet, still no luck. Mind you, I did not sniff as all these changes were made; however, if the change did not work we immediately removed it and returned back to our original configuration. We did not want to knock all these guys completely out of service.

This was definitely an interesting problem on our network and one worthy of a write up. Hope this helps shed light in some fashion. Keep in mind that controlling those patches from MS might have alleviated this issue, along with appropriate testing by the CALA crew. In today’s world where we need to apply patches in such volume and as quickly as possible, it’s hard to get that testing done correctly and quickly so this is the troubleshooting price we pay. All in all, I took away a lot from the experience and now I have some new CALA friends.

References:
The KB article:
http://support.microsoft.com/default.aspx?scid=898060
The Security Bulletin associated with MS05-019:
http://www.microsoft.com/technet/security/bulletin/MS05-019.mspx

Until next time, good day..

Chuck

Feel free to email me or post a reply.
Email: c h u c k (at) g o f i x i t (dot) c o m
_____________________________________________
Comments on this web site are moderated, so feel free to post - just don’t sweat it when it takes me a few hours to approve

Copyright gofixit.com 2005 - 2009