Yesterday, one of our production sites began to crash at random intervals. We managed to narrow the issue down to one specific user logging in at the time, and clicking on a number of (again random) pages.
Post-mortem debugging using crashdumps and WinDbg showed the last exceptions on the stack to be (again random) and pretty minor.
The only thing they had in common was that they were unhandled, and so ended up in the Application_Error method of the Web project’s HttpApplication derived class.
So what happened ?
In the end it boils down to a feature in Internet Information Services called “Rapid Fail Protection”. If enabled (default), the application pool will stop and serve 503 Service unavailable responses when it sees X unhandled exceptions in Y minutes (both configurable).
Of course the best fix is to properly catch exceptions, however, if you ever have a case of Application Pools stopping under mysterious circumstances, check if you have Rapid Fail protection turned on.