The curious case of the major XenApp outage.

Here’s a really strange and interesting issue I faced late last week that resulted in a few head scratching moments and late nights.

An Issue began two weeks ago intermittently in a XenApp 4.5 farm used for hosted desktops, intermittently NonPagedPool bytes would shoot through the roof, the event logs would become inundated with event 333 errors and the servers would lock up completely.

The running sessions could no longer open new applications, performance was extremely poor and RDP’ing to the XenApp server would result in an RPC error. Disconnecting the sessions remotely would also result in an RPC error or TSAdmin was completely incapable of connecting to the server. We had no choice but to dump the servers using NMI and pray for a crash dump.

No changes had been made to the environment in a number of weeks and the last change to the environment was a “Citrix Ready” printer driver from Lexmark. As the days progressed the issue got worse and worse with more servers going offline day by day. Although we did initially catch a number of crash dumps, we hit a bad run of luck with most of them being corrupt on restart.

By day six, 9 servers went offline throughout the day, I was pulled in to assist resolve this massive issue.

I fired up the windows debugging tools and managed to get a look at a crash dump fresh from a locked up server.

Using !vm i pulled the virtual memory statistics at the point of the crash:

!vm

So we had a serious non paged pool leak as we suspected, but what exactly was chewing up all that nonpaged?

Running !poolused 2, i was able to drill down into the drivers using nonpagedpool and view which driver tag was using the largest amount of the pool as below:

reviewing the list, i was immediately alarmed by the amount of ram in use by the “Ica” pool tag. Having reviewed 100’s of memory dumps I had never seen the Ica pool tag listed in the top 20, never mind using 99721328 bytes (~95mb).

The Ica pool tag is fairly obvious as to who owns it, but just to be on the safe side and to drill down to the owning driver, I ran the following command on the drivers folder to find references to the “Ica” pool tag.

findstr /m /l Ica *.sys

findstr

So we got quite a few hits off the Ica pool tag. Quite a number of the above drivers are Microsoft’s, which is not suprising in the grand scheme of things as we all know the origination of the RDP protocol.

So with a little more information to hand, I set about googling this chain of events to see if it’s either documented, or hotfixed. A search yielded quite alot of articles including a private hotfix and a Rollup pack.

Drilling down into the technotes to see if I could find a potential cause for this issue, I was left a little wanting with the lack of information available:

Servers with Hotfix Rollup Pack 6 installed can run out of non-paged pool memory in the ICA Pool tag. The issue occurs when users log on to the server and its frequency increases with the amount of time since the most recent restart and the number of logons since the restart.

What irked me here, was the lack of information and the fact that these servers had been running HFRP 6 for roughly 18 months with no issues similar to this.

Why all of a sudden are we losing servers all over the place to such an issue?

I dug further into the hotfix notes with help from my good friend and all round cool Citrite James Denne, the hotfix specifically noted:

When a server is in low memory condition the <Redacted>() spins in an infinite loop by constantly allocating an OUTBUF and trying to write it on the stack. This problem happen when system is going in and out in low memory condition.

So there’s a great explanation of the issue from the horses mouth, but again there was a niggling problem in the back of my head…

These servers weren’t spinning in and out of low memory, our pool usage reporting would have caught this?

I was satisfied to see a hotfix was available, but in the back of my head I was concerned about the change that may have caused this issue, it’s still unclear what is causing this low memory condition to spin the infinite loop and why we couldn’t see the low memory scenario before it happens. Being an massive issue, we had to make a quick turn around here. We had a choice of going to HFRP 7 or using the private hotfix available. I chose the private hotfix, for two reasons:

Mass Deploying a roll up pack to fix one problem is like tapping a nail in with a sledge hammer.
My experience with HotFix Rollup Packs is they fix your issues, but introduce at least one new one.

We took all the servers offline for emergency maintenance that night and cautiously waited for the morning to come and see if our issue was resolved.

and so we patiently waited…

Once hotfixed and rebooted, we arrived at the office early to watch as the user sessions began to filter in to the Farm. All was quiet for the first hour or so, but then the phones started.

once the user load hit 15-16 users per XenApp session, a number of servers began to again log a number of eventlog 333’s as below:

333.

Dammit.

Frantically we connected to the console of a server, to check the paged pool states but again no alerts on pagepool size? as below the ICA pool tag was nowhere to be seen:

healthypoolmon

And the ica tag was at a much more respectable / expected value as below:

icatag

Our next clue came in the form of the following, when users were logging in they were getting the following error:

proferror

So we’ve fixed our Ica memory leak, now what the hell is happening?

If memory usage for the pools are ok but we’re still getting errors about flushing to the registry, and now new user profiles can’t load their profiles, my hunch was there had to be something wrong with the registry hives…

I used command prompt to open the “Documents and Settings” folder and ran the following command:

dir /s /a ntuser.dat

With a quick glance, i found the following:

ntuser

The “Citrix Print Manager Service” user account had a registry hive of over 50mb? What in the name of superman is hiding in that registry hive?

To rectify the issue immediately, we stopped the above print manager service and forced the hive to be unloaded with delprof. Once we had done this, the user profiles began to load again on each affected server. But we’re now unable to use client redirected printing.

To regedit!

I mounted the registry of a profile that had failed to delete and drilled down to see what all the fuss was about. As this was now firmly in the printing land, I went looking for keys to match the Lexmark driver change from a number of weeks ago.

What I found was extremely interesting, for each client redirected printer ever mapped with the Lexmark driver, we had an entry under both the PCLPlugin and PSPlugIn keys:

Although this was a really questionable practice from lexmark, I examined the keys under each entry for the PclPlugin key and they consisted of just a few small binary files of which were no more than a KB or two.

Upon looking at the same keys under PSPlugin, I found a new key, called GDL. This GDL key was absolutely massive and there was one for each and every time a client printer had been redirected using the Lexmark V2 driver.

gdl

I exported both both the users hive, and the psplugin key to text and the comparison is below:

The GDL key itself was over 3mb per redirected printer!?!?:

gdl

So there we have it.

The route cause was as follows:

This Lexmark driver has a weird tendency to store an unbelievable amount of crap in the registry for users.

The Citrix print manager service also suffers this faith when it maps a redirected printer.

As more and more users were testing in production (GRRRR!) / beginning to use a new printing solution on a customer site, this registry file began to grow and grow ultimately flooding the maximum registry size of 30% of the paged pool ram.

As the registry hive size was growing out of control, the Ica driver ran into a low memory situation and ultimately caused the infinite loop.

The Ica loop and nonpaged saturation was masking the printer driver bloat in the registry.

As the days went on, more and more servers began to saturate the Maximum registry size and go offline.

Corrective actions:

Enforce a policy to not allow native drivers, in any way, shape or form when redirecting printers where possible.
Obtain the latest driver from Lexmark is you have lexmark printers.
Give lexmark an earful for not testing their drivers.

Lessons Learned:

Don’t test things in production.
Don’t trust a vendor with “Citrix Ready”, it’s up to them to test these things and they regularly don’t.
Create a monitor for registry size (perfmon > system > % Registry quota in use)
install the debugging tools on the XenApp 4.5 servers as this issue is going to become more prevalent. *

* This isn’t going to get any better.

As Vendors move further and further towards 64 bit architectures they can and will forget about the extremely restrictive memory sizes available in 32 bit versions of windows, 64bit windows has so much memory available for the pools they can be as sloppy as they want without much concern.

Server 2003, Windows XP and XenApp 4.5’s death bells are knelling and have been for some time.

You are going to see pagepool’s floods and other such nasties more and more in the coming months before you finally decommission your old server 2003 environment. My advice to you is to:

get very comfortable with the following tools:
- PoolMon.
- Process explorer.
- Windows debugging tools.
Have a good read of the following article: 333, read it, read it again.
Never be afraid to have a look at a dump file yourself.
Throw an issue at every vendor possible during troubleshooting, it’s in their interest to prove it’s not their software at fault.
Understand your pagepool sizes and limitations.
Never trust a printer driver.
Never, ever, ever trust a Vendor to behave accordingly or follow the Citrix Ready standards.
If you absolutely, positively need to run something in server 2003 or XP, consider using XenDesktop hosted apps to isolate the problem to a singular kernel away from the bulk of your task workers.