Everything is fine or so you think, but then the CPU spikes and the other Web Front ends are fine, so what’s up? Do you know how to dig in and figure out what’s up?
Here are a few things I’d do…
So a worker process goes crazy, first need worker process id for w3wp.exe, for the app pool
1. Add the Process ID to your Task Manager.
Start Task Manager:
You’re likely already in it since you’ve identified it’s a worker process that’s taking all your CPU.
Click view | Select Columns | Check "PID (Process Identifier)"
(See more details on "Identifying worker processes II6 & IIS 7 for debugging and look at Remote debugging)
2. Next get a list of all of your worker processes
You can use the WP (worker process) object to list running worker processes in appcmd.exe from IIS 7:
%systemroot%system32inetsrvAPPCMD list wps
You’ll get a list of the running app pools and their names
WP "3577" (apppool:DefaultAppPool)
WP "9823" (apppool:Team)
WP "7235" (apppool:My)
WP "533" (apppool:G32c2cd87s235sd3f2ub9sads3234)
3. Once we have the app pool we can look at running requests and connections.
If you can take the box out of rotation (assuming it’s being load balanced) you can go into IIS and stop the non-problematic app pools. Then look at the inbound requests.
%systemroot%system32inetsrvAPPCMD list requests
REQUEST "fb0000238022230e" (url:GET /wait.aspx?time=10000,time:4276 msec
You can narrow this list by requesting just the particular failing app pool.
%systemroot%system32inetsrvAPPCMD list request /apppool.name:Team
4. If the machine is still connected, running a netstat will give you a list of connections that are current connections with the server. As an example, is it an indexer or a client connecting to web services that’s causing CPU to spike. More likely it could be a timer job on one of your own boxes. If it’s a busy box you may see tons of connections so you may need to PIPE or Output to a text file.
netstat [-a] [-e] [-n] [-o] [-p Protocol] [-r] [-s] [Interval]
Displays active TCP connections, ports on which the computer is listening, Ethernet statistics, the IP routing table, IPv4 statistics (for the IP, ICMP, TCP, and UDP protocols), and IPv6 statistics (for the IPv6, ICMPv6, TCP over IPv6, and UDP over IPv6 protocols). Used without parameters, netstat displays active TCP connections.
-a : Displays all active TCP connections and the TCP and UDP ports on which the computer is listening.
-e : Displays Ethernet statistics, such as the number of bytes and packets sent and received. This parameter can be combined with -s.
-n : Displays active TCP connections, however, addresses and port numbers are expressed numerically and no attempt is made to determine names.
-o : Displays active TCP connections and includes the process ID (PID) for each connection. You can find the application based on the PID on the Processes tab in Windows Task Manager. This parameter can be combined with -a, -n, and -p.
-p Protocol : Shows connections for the protocol specified by Protocol. In this case, the Protocol can be tcp, udp, tcpv6, or udpv6. If this parameter is used with -s to display statistics by protocol, Protocol can be tcp, udp, icmp, ip, tcpv6, udpv6, icmpv6, or ipv6.
-s : Displays statistics by protocol. By default, statistics are shown for the TCP, UDP, ICMP, and IP protocols. If the IPv6 protocol for Windows XP is installed, statistics are shown for the TCP over IPv6, UDP over IPv6, ICMPv6, and IPv6 protocols. The -p parameter can be used to specify a set of protocols.
-r : Displays the contents of the IP routing table. This is equivalent to the route print command.
Interval : Redisplays the selected information every Interval seconds. Press CTRL+C to stop the redisplay. If this parameter is omitted, netstat prints the selected information only once.
/? : Displays help at the command prompt.
I’ve personally used netstat to look at outgoing connections for troubleshooting connections to the gateway or firewall.
If your app pool is shared by multiple web applications, then you will need to narrow this down further. Most will want to either turn on tracing in IIS or dig into your ULS logs. I’d suggest digging into your ULS logs, but don’t forget the easier low hanging fruit of Event Viewer.
5. Dig into the event viewer and look for WWW, W3SVC, HTTP and Worker Process related events in Application Events and system related events. If you’re not seeing events, make sure you’re logging them as IIS application recycling events can be logged, and trace them if you need to.
The IIS 7.0 Health Model has been published on TechNet and containing details about most of the Event Log error codes that are logged for worker process and service (WAS) level conditions. It also includes the suggested diagnostics and workaround steps for each error condition. Understanding these will help you track down the issue as to whether it’s in the application, worker process, or where exactly the issue is.
The following are common:
Failed to start/restart the worker process:
- The configuration is invalid.
- The application pool identity has wrong account name or password.
- The maximum number of worker processes is reached or out of resources.
- Worker process cycles over and over again, starts and fails.
- Process can’t start due to service app pool account password issue
IIS initialization failed:
- The configuration section is invalid
- A module DLL listed in has invalid path, or failed to load
- Web.config related DLL issues
- A module failed to initialize
- A module, or application component has generated a debug break, or memory access violation, causing the process to terminate abruptly.
- Out of memory
- Unexpected error
Developers may want to attach Visual studio for debugging. Learning how to Debug ASP.NET App Issues (Memory Leaks, Crashes, Deadlocks, etc)
Otherwise you want to setup IIS tracing for next time to capture better what’s going on.
Monitoring worker processes feature in IIS lets you monitor sites, application pools, server worker processes, application domains, and requests.
4. Once you have the list you can easily correlate the process ID from task manager and from your app pool to determine the web app or the problem area. You can get a list of all of the details for your app pools and how they correlate to by digging into
Spence Harbar has a great writeup. Here’s his explanation on how to use ULS viewer for troubleshooting including working on the coorelation ID which you might be getting from a page or webpart that’s getting errors.
For SharePoint 2010, by default, ULS log is at C:Program FilesCommon FilesMicrosoft SharedWeb Server Extensions14LOGS
You can check the directory and try to read those logs. I was quite used to that, with notepad:)
ULS Viewer can be used in different modes. The log can be read from log files, real time ULS log, or even clipboard. Here’s some examples:
On a machine running SharePoint 2010, run ULS Viewer. Click File, Open From, then choose ULS (This could also be done by simply press Ctrl+U). Immediately the logs will be shown in real-time. You can filter message level by click the icons in the middle. This can tell you what is going on inside SharePoint.
5. Tools – Code leaks can be prevented and detected. The most popular are
FxCop, SPDisposeCheck. I like using the IIS Reskit tool wfetch for HTTP(s) debug and simulate connections (displays the request and the response) and the free Fiddler2 for HTTP(S) debugging (Fiddler is a Web Debugging Proxy which logs all HTTP(S) traffic between your computer and the Internet. Fiddler allows you to inspect all HTTP(S) traffic) to simulate and walk through web requests to see what users are doing, but it can also be helpful to try simulators and different browsers. Recently we had an issue with a Win Phone 7 and we downloaded the emulator to track down the issue. If you’re hitting the page with a browser and not getting very good errors make sure you’ve Disabled HTTP Friendly Error Messages in Internet Explorer. Those friendly error messages are not easy to use to troubleshoot at all.
6. Turn the Dev Dashboard to on demand to catch the next one early – Things can go a lot of different ways from here, but debugging in visual studio and FxCop, SPDisposeCheck and other tools and mechanisms to check out the code… In our case we turned the dev dashboard to ondemand in our intranet prod, so we can now tell right from our desktop browser what might be starting to go south. That way we can quickly determine… Is it the webpart, the SQL query, or an app server. Stsadm –o setproperty –pn developer-dashboard –pv ondemand (or "on" or "off")