CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Steve Hebert's Development Blog

Steve's Blog - From .Net to dotMath and everything in between.

MS Windows Services for Unix + Client for NFS + EMC = Kernel Memory Leak

Here's a topic I thought I'd lend a little google juice since Microsoft has created a hotfix.  This problem is nasty - difficult to diagnose and difficult to track down. 

We have an application where we are sharing a NAS device with Unix servers.  We were seeing in our pre-production and production environments that a group of Windows 2000 boxes would go belly up after a week of use.  These machines would gradually become slow and suddenly unable to communicate on the network.  When looking at the event logs, the system stopped communicating over the network and showed repeated errors.  It looked like someone tripped over a network cable whenever these systems went down.  To make diagnosis worse, I did not have physical access to the boxes - only VNC access.  Here's the path to diagnosing the problem, perhaps this will save someone else some time.

After a few crashes we saw that socket creation was being denied because resources were low.  This led us to looking at System PTEs.  Once we were focused on the System PTEs, we monitored system PTEs in perfmon and saw that the leak didn't start happening for 4 hours, but then steadily declined on rate loosely tied to traffic volume.  Without any traffic we would see PTEs decrease at a rate of ~ 5/hour, with traffic we saw a range from 60-100 PTEs per hour.  The PTEs always decremented in blocks of 10.

At this point we weren't sure what was causing it - typically a driver because these consume kernel memory.  After spending a couple of days trying to track this down, we found that the Windows Services for Unix were at fault.  We contacted MS support and they shipped us the hotfix.  The problem disappeared and we haven't seen the behavior since.

I find it hard to believe that Microsoft has had this product in the field for so long and only now they see this critical of a leak. For some reason we only saw the problem with our EMC/NFS connection.  We have a solaris/NFS connection in development that has never exhibited the problem. I guess Microsoft doesn't test wsFU against small 3rd party vendors like EMC. </sarcasm>  We spent a ton of time tracking this problem and questioned everything on these systems. It's interesting to note that the problem also happens in Windows 2003, but because 2K3 always has significantly more System PTEs than Win2k the box will take much longer to fail.




Check out Devlicio.us!

Our Sponsors