Wednesday, 18 January 2012

Don’t Be a Balloon!

I had a problem just before the festive period where one of our production SQL Servers was randomly grinding to a halt, queries which usually returned in milliseconds where running for minutes.  It took 4 days of performance monitoring / troubleshooting and some help from the SQL Community (when I though all was lost) to find the cause of the problem and get it resolved!

To give a little bit of background information about this particular environment we run multiple SQL Server clusters in various geographical locations across the globe which we replicate to from a central SQL Server cluster.  Each of these ‘child’ SQL Servers also replicates certain transactional articles back to the central environment.  Each server runs SQL Server 2005 Standard Edition SP3 x64 on Windows Server 2003, they have different CPU / Memory allocations designed to support their respective workloads.

The server encountering the problem had 50% more memory than another server and was running half the load.  The problem first came to light when a couple of overnight jobs which normally take a few minutes were running for several hours before I had to stop them.  I started investigating the issue looking at the usual DMV’s with no joy.

Below are a few of the DMV’s I used during my investigations. 

sys.dm_os_wait_stats
sys.dm_os_waiting_tasks
sys.dm_os_performance_counters
sys.dm_os_memory_allocations
sys.dm_os_memory_allocations
sys.dm_io_virtual_file_stats
sys.dm_os_schedulers
sys.dm_os_sys_memory

I recently read a fantastic book – Performance Tuning with SQL Server Dynamic Management Views by Louis Davidson and Tim Ford which although did not lead me to the problem allowed me to dismiss several possible causes.

I then used SQL Sentry Performance Advisor to monitor the server without much joy.  One thing I did notice is that the buffer cache would intermittently drop from its normal size of 6GB to 0?  I used perfmon to monitor SQLServer:Buffer Manager / Buffer Cache Hit Ratio / Page Life Expectancy / Database pages and just as SQL Sentry highlighted these would intermittently drop to 0, then build back up.

I tried finding a correlation between any jobs / processes and the dramatic degradation in performance again with no joy.  I then found errors in the error log similar to the below;

Source                  spid1s

Message
A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 115227 seconds. Working set (KB): 1635000, committed (KB): 6453712, memory utilization: 25%.

Why was SQL Server releasing memory?  I checked the server and the locked pages in memory privilege was not granted to the SQL Server service account so I added this and restarted the service after repointing the applications to another server only to find that it didn’t make any difference.  Whilst trawling the super information b road (sorry running joke I had at a previous employer in the days when the ‘super information highway’ was reeeaaalllly slow) I found that on 64bit systems locked pages in memory is not supported on SQL Server 2005 Standard Edition unless you are at SP3 CU4, great (or so I thought) lets plan deploying CU4 and with the locked pages in memory privilege and we should be ok.

Me being me, and I think this is a general DBA thing (maybe not, it might just be my inner geek) I wasn’t happy at just applying the CU, something somewhere was causing SQL Server to release memory, it had never been a problem before, it had only recently started happening.

I stumbled across an excellent article from Jonathan Kehayias about locked pages in memory, it was the Considerations for Virtual environments section that gave me my big break.  Jonathan describes a scenario called memory overcommit as below;

“Memory overcommit is a scenario where the memory allocated to the virtual machines running on the host exceeds the total amount of physical RAM available in the server.

When memory overcommit occurs, one of the first ways that the hypervisor reacts is to make use of a special driver, known as a balloon driver, which is installed in the VM as part of the VM tools. In essence, the hypervisor sets the balloon driver the task of reducing memory consumption in the VM to a target level, and the balloon driver responds by acquiring memory in the VM.”

Now I will hold my hand up here, I haven’t been in my new job long (2nd month at the time of the issue) and am still learning the environments to this day but that is no excuse.  This server is one of several which run my companies services for the whole of the UK and at the time I didn’t know that it was a Virtual Cluster which I should have.  Another issue is that I didn’t and still don’t have access to the VCentre for this server so I couldn’t see if memory over commit and or memory ballooning was a problem (or so I thought). 

Resorting to Twitter under the #sqlhelp hashtag, Jonathan came to my rescue and eventually after a few questions for further info he told me about some perf counters.  If the VM Tools are installed on a VM then some VM Counters are exposed, you can see where I’m going with this one can’t you?!



Adding this counter I was shocked, all 6GB of the buffer cache had been ‘ballooned’ to disk on the ESX host!  This explained the dramatic degradation in performance, after a call to the team who manage the VMware environment we moved the cluster to other hosts which had not overcommitted memory and performance was restored to normal levels.  Now I have always been sceptical about virtual infrastructure and SAN storage being used for my database servers.  However with careful design to make sure there is no contention / adequate resource with room for growth so environments can scale I have never had a problem of this magnitude before.

There are a couple of things I have learned from this problem;

1 – Make sure you have access to all the hardware which all the servers you manage run on (SAN / VMware etc. etc.) this makes life so much easier when troubleshooting!

2 – Make sure you are involved in every aspect of design and are part of the core decision making team for new environments!  Or if you inherit environments when you move jobs or from acquisitions check and double check everything!

And last and most importantly of all!

3 – Trust no one!  I learned this from a very early stage in my SQL Server career (thank you Mr David C. Riley).  Everyone starts off equal, equally useless (to be polite) until they prove themselves otherwise!

I really hope this helps someone with the same problem or prevents someone encountering the same issue.

Chris

2 comments: