Cleveland DBA

SQL Server administration in Cleveland, OH

Error: “The file … cannot be moved by this RESTORE operation.”

This morning, I was trying to restore a database from a SQL 2005 backup to a new server.  The restore involved a full backup and a differential.  I restored the full backup with NORECOVERY via SSMS, but when I tried to restore the differential, also via SSMS, I got this error:

Okaaay.  I was a bit befuddled until I scripted out the two restore commands that SSMS is actually generating.  The full restore is fine:

RESTORE DATABASE [TestDiff]
	FROM  DISK = N'F:\myserver_backup\TestDiff.bak'
	WITH  FILE = 1,
		MOVE N'TestDiff' TO N'F:\myserver_data\TestDiff.mdf',
		MOVE N'TestDiff2' TO N'F:\myserver_data\TestDiff_1.ndf',
		MOVE N'TestDiff_log' TO N'F:\myserver_logs\TestDiff_2.ldf',
		NORECOVERY,  NOUNLOAD,  STATS = 10
GO

But look at the differential restore:

RESTORE DATABASE [TestDiff]
	FROM  DISK = N'F:\myserver_backup\TestDiff.diff'
	WITH  FILE = 1,
		MOVE N'TestDiff' TO N'F:\myserver_data\TestDiff.mdf',
		MOVE N'TestDiff2' TO N'F:\myserver_data\TestDiff.ndf',
		MOVE N'TestDiff_log' TO N'F:\myserver_logs\TestDiff.ldf',
		NOUNLOAD,  STATS = 10
GO

Notice anything wrong? The physical filenames aren’t the same as the ones used in the first restore. So SQL Server rightfully assumes you’re trying to move datafiles whle performing a differential restore and spits out an error. If you change the physical filenames in the differential restore to match the first restore, or remove the MOVE clauses altogether, it works:

RESTORE DATABASE [TestDiff]
	FROM  DISK = N'F:\myserver_backup\TestDiff.diff'
	WITH  FILE = 1,
		MOVE N'TestDiff' TO N'F:\myserver_data\TestDiff.mdf',
		MOVE N'TestDiff2' TO N'F:\myserver_data\TestDiff_1.ndf',
		MOVE N'TestDiff_log' TO N'F:\myserver_logs\TestDiff_2.ldf',
		NOUNLOAD,  STATS = 10
GO

-- or

RESTORE DATABASE [TestDiff]
	FROM  DISK = N'F:\myserver_backup\TestDiff.diff'
	WITH  FILE = 1,
		--MOVE N'TestDiff' TO N'F:\myserver_data\TestDiff.mdf',
		--MOVE N'TestDiff2' TO N'F:\myserver_data\TestDiff.ndf',
		--MOVE N'TestDiff_log' TO N'F:\myserver_logs\TestDiff.ldf',
		NOUNLOAD,  STATS = 10
GO

This appears to be a bug in SSMS but I’ve only gotten it to happen under specific circumstances. The backups needed to be from a SQL 2005 instance, but you could be restoring to SQL 2005 or SQL 2008.  And it looks like you need to be restoring a database with multiple datafiles.  These were the 2 conditions I personally experienced the problem with, though I didn’t do exhaustive testing.  If anyone finds any other scenarios where this happens, feel free to leave a comment.  Hopefully this saves others some time.

June 21, 2011 Posted by | Troubleshooting | , , | Leave a comment

How is SQL Server using all that memory, anyway?

I want to stick with the subject of memory because I think it’s probably the most misunderstood (and sometimes downright mysterious) component of SQL Server. I know I, for one, could benefit from a better understanding of its inner workings. So today I’d like to share that little bit of code I mentioned last week, that I find very handy for getting an overall picture of how memory is being used by a SQL instance. Personally I run it as a complete script, but I’ll break it down here and provide a little description of each section. I should say, too, that I didn’t write any of this from scratch. For most of it I started with code found here, here, and here, and modified it to suit.

So, without further ado, let’s get started. The first section provides an big-picture look at how SQL is currently allocating memory between the database page cache, procedure cache, and miscellaneous memory usage. It also provides the total memory usage, and I added in the maximum available workspace memory.

--HOW MUCH MEMORY IS MY SQL SERVER USING?

declare @plan_cache_size float, @obj_data_size float , @avail_workspace_size float

-- plan cache size

select @plan_cache_size = (cntr_value*8)/1024.0 FROM sys.dm_os_performance_counters
	WHERE object_name like '%Plan Cache%' and counter_name = 'Cache Pages' and instance_name = '_Total'

-- DATABASE PAGE CACHE (PAGES, INCLUDING INDEXES)

select @obj_data_size = (cntr_value*8)/1024.0  FROM sys.dm_os_performance_counters
	WHERE object_name like '%Buffer Manager%' and counter_name like 'Database pages%'

-- Maximum workspace available for sorts, hashes, etc

select @avail_workspace_size = (cntr_value)/1024.0 FROM sys.dm_os_performance_counters
	WHERE object_name like '%Memory Manager%' and counter_name = 'Maximum Workspace Memory (KB)'

-- CACHE TOTALS
select   @obj_data_size [Database_Page_Cache_MB]
 , @plan_cache_size [Procedure_Cache_MB]
 , [Misc_Memory_Usage_MB]
 , [Misc_Memory_Usage_MB] + @obj_data_size + @plan_cache_size [Total_Memory_Usage_MB]
 , @avail_workspace_size [Maximum_Workspace_MB]
from
(
 select sum(cntr_value)/1024.0 [Misc_Memory_Usage_MB]
 from sys.dm_os_performance_counters
 where object_name like '%memory%'
 and (
   counter_name like '%Connection Memory (KB)%'
   or
   counter_name like '%Granted Workspace Memory (KB)%'
   or
   counter_name like '%Lock Memory (KB)%'
   or
   counter_name like '%Optimizer Memory (KB)%'
   or
   counter_name like '%SQL Cache Memory (KB)%'
 )
) x
GO

The output:

The next section uses sys.dm_os_memory_cache_counters to examine how the procedure cache is being broken down.

-- How the Procedure cache is being used

SELECT  TOP 6
	LEFT([name], 20) as [name],
	LEFT([type], 20) as [type],
	([single_pages_kb] + [multi_pages_kb])/1024 AS cache_mb,
	[entries_count]
FROM sys.dm_os_memory_cache_counters
order by single_pages_kb + multi_pages_kb DESC

The output:

The third part of the script goes deeper into the procedure cache and displays the top 25 cached plans by size. Because it displays some application code, I’m not going to include the results here.

-- Top cached plans by size

select top 25
cacheobjType
, ObjType
, (pagesUsed * 8192)/1024.0/1024.0 [space_used_MB]
, db_name(dbid) [database_name]
, object_name(objid, dbid) [object_name]
, [sql]
from master..syscacheobjects (nolock)
order by pagesUsed desc
GO

And, finally, the last section shows us how much of the buffer pool is being utilized by each database, including the dirty and clean page counts, sorted by total memory.

-- Buffer Pool Memory Per DB

SELECT
(CASE WHEN ([database_id] = 32767) THEN 'Resource Database' ELSE DB_NAME (database_id) END) AS 'Database Name',
SUM(CASE WHEN ([is_modified] = 1) THEN 1 ELSE 0 END) AS DirtyPageCount,
SUM(CASE WHEN ([is_modified] = 1) THEN 0 ELSE 1 END) AS CleanPageCount,
count(*)AS TotalPageCount,
cast(count(*) * 8192.0 / (1024.0 * 1024.0) as decimal(8,2)) as BufferPoolMB
FROM sys.dm_os_buffer_descriptors
GROUP BY database_id
ORDER BY TotalPageCount desc
GO

And the output:

And there you have it. I hope you find this useful!

May 2, 2011 Posted by | General, SQL Tuning, Troubleshooting | , , , | 1 Comment

Available Workspace Memory

My research into the Resource Governor question and its answer brought up another question for me:

SELECT * FROM sys.dm_exec_query_resource_semaphores

Where does that total_memory_kb number for the default pool actually come from? After all, I have a max server memory of 26000MB for the instance, so why 18175MB?  The BOL description for this field wasn’t very helpful.  Googling for total_memory_kb didn’t really yield me much.  Most of what I found was unrelated to this particular DMV.

So I started looking at the sys.dm_os_performance_counters, and there I finally found something that looked promising.

 select cntr_value FROM sys.dm_os_performance_counters 
WHERE object_name like '%Memory Manager%' and counter_name = 'Maximum Workspace Memory (KB)' 

If you add up the 2 total_memory_kb values for the default pool, you get the maximum workspace memory. Woot!

That still didn’t tell me how that particular value was derived, but it gave me something better to search on. And that search returned this article at SQLMag.com which says (emphasis mine):

The SQL Server: Memory Manager object includes the following workspace-related PerfMon counters: Memory Grants Pending, which tells you the number of waiting queries; Memory Grants Outstanding, which counts the number of queries that have grants and are currently executing; Granted Workspace Memory (KB), which measures the total amount of grants SQL Server has made to running queries; and Maximum Workspace Memory (KB), which measures the maximum amount of memory SQL Server is currently willing to devote to workspace-memory grants. In general, SQL Server is able to devote between 25 and 75 percent of the buffer pool’s virtual memory to workspace memory. The value varies based on the demand for memory for other uses (such as database page cache and plan cache). If demand is low, SQL Server devotes a larger fraction of memory to workspace memory needs; if demand is high, SQL Server devotes a smaller fraction to workspace memory.

So there you have it. I feel better, don’t you feel better?

That entire article is worth reading for a better understanding of SQL Server memory. In fact, it helped me put together a bit of code to show me how memory is being used by SQL Server. More on that in a future post (I’ve got to give you a reason to keep coming back, right?).

April 28, 2011 Posted by | General, SQL Tuning, Troubleshooting | , | Leave a comment

Aha!

I found the answer to my Resource Governor quandary in this outstanding article, Understanding SQL Server Memory Grant, on MSDN.  The juicy bit:

The server calculates the memory limit for one query. By default, this is 25% (20% on 32bit SQL 2005) of total query memory (which is set by memory broker as about 90% of server memory). This per-query limit helps to prevent one query from dominating the whole server. This percentage is configurable on SQL 2008.

And how is it configurable, you ask?  By using Resource Governor to override it.

Really though, read the article, it’s extremely informative.  I’d like to give the author a great big hug.

April 25, 2011 Posted by | Features, Troubleshooting | , , | 1 Comment

Resource Governor causing memory problem??

So.  Last week I blogged about the experiment I was running with Resource Governor for one of our newly upgraded development databases.  I had enabled Resource Governor and created a procedure to categorize sessions, but I hadn’t set any limits and everyone was sharing the same default pool.  The plan, if you’ll recall, was to monitor the Resource Governor stats to get a feel for the resources different apps were using.

This week, users have been doing a lot of testing in this environment.  A developer called me on Wednesday morning and said that users were complaining of very slow performance.  So I took a look and saw several sessions waiting on resource_semaphores.  Now, I’ve been supporting this application’s database for years now and I’ve never seen any memory waits, so this got my attention quick.  One of the sessions that was running was one of the reports that I’d enabled Resource Governor to address in the first place.  A look at sys.dm_exec_query_memory_grants told me this query was requesting almost 19GB of memory (or, put another way, all of the available memory).  What the…?

I captured the query (and asked the user to kill his report) and ran the query in our production (SQL 2005) database.  In that environment it was only requesting 7GB of memory.  Still a lot, but do-able in a system with 40GB available.  The query plans were slightly different between the 2 environments, but updating stats didn’t change that at all.  I decided to use Resource Governor as it was intended, and I set a 40% memory grant limit for the Reports workload group.  I re-ran the query and this time it was only requesting 7GB of RAM.

The immediate problem was remediated, but it niggled at the back of my brain; why had the query requested that much memory in the first place?  Thursday morning I got another call.  This time, when I checked the system, a different application was requesting all of the available memory.  I could have set a restriction for that workload group, too, but this was just treating the symptoms.  There had to be a root cause.

I started searching the Great Gazoogle for known memory issues in SQL 2008, but found nothing that explained why queries were suddenly using a lot more memory than they did in SQL 2005.  Think.  Think.  Think.  What if Resource Governor is the problem?  So I ran a series of tests:  with Resource Governor disabled, with it enabled and the memory limit set to 40%, and enabled with the memory limit set to 100% (or unlimited).  I also ran it in our 2005 environment just for comparison purposes, and I monitored sys.dm_exec_query_memory_grants for each test.  Here are the very interesting results.

Again I say:  what the…?  With Resource Governor disabled the query only uses 4.8GB of memory.  Enabling Resource Governor actually appears to cause the query to request more memory.  It almost looks like, rather than being a limit, the memory grant % is telling the query how much to use (putting ideas into its head).  I know that can’t be the case, but that’s how it looks.  I thought maybe I’d misunderstood the settings, that maybe I’d mis-configured Resource Governor, but I don’t think I did.

Memory Grant %
Specify the maximum amount of memory that a single request can take from the pool. Range is 0 to 100.

And technically, that’s what SQL Server did.  I said “this group can use 100% of the memory,” and it gave that query 100% of the memory.  Or tried to, anyway.  But if it really needed all that memory, why doesn’t it ask for it with Resource Governor disabled?

So far I haven’t been able to find any bugs to explain this behavior.  And a post to SQLServerCentral’s forums hasn’t yielded any responses.  I guess for now I’m going to leave it disabled and keep searching for an explanation.  If anyone out there has any insight, please post a comment.  I’d really appreciate it.

April 22, 2011 Posted by | Features, Troubleshooting | , , , | 1 Comment

Deadlocks and trace flag 1204

I recently got an email from yet another developer about an application that would sporadically die for no apparent reason.  The application had logic to retry automatically, and would then run without error.  Given the intermittent nature of the problem, and the fact that the process was terminating, rather than hanging or timing out, to me it smacked of deadlocking.  So I decided to turn on trace flags 1204 and 3605 to confirm my suspicions.

DBCC TRACEON (1204, 3605, -1)

Flag 1204 records deadlock information.  Flag 3605 sends it to the error log.  And flag -1 turns the trace on for all sessions, not just the current session starting the trace.

I enabled the trace flags and sure enough, deadlocks.

Once a deadlock occurs and the trace information is written to the errorlog, actually making sense of it can be tricky.  I’ve found that, rather than trying to view it through Management Studio, make a copy of the actual ERRORLOG file and view it in Notepad or Wordpad.  It makes the information much more readable.  Let’s take a look at a typical deadlock trace output.

2011-02-25 16:00:45.96 spid4s      Deadlock encountered …. Printing deadlock information
2011-02-25 16:00:45.96 spid4s      Wait-for graph
2011-02-25 16:00:45.96 spid4s
2011-02-25 16:00:45.96 spid4s      Node:1

2011-02-25 16:00:45.96 spid4s      PAGE: 75:1:5994649             CleanCnt:3 Mode:IX Flags: 0x2
2011-02-25 16:00:45.96 spid4s       Grant List 0:
2011-02-25 16:00:45.96 spid4s         Owner:0x48A1FCE0 Mode: IX       Flg:0x0 Ref:0 Life:02000000 SPID:981 ECID:0 XactLockInfo: 0x4173350C
2011-02-25 16:00:45.96 spid4s         SPID: 981 ECID: 0 Statement Type: UPDATE Line #: 1
2011-02-25 16:00:45.96 spid4s         Input Buf: Language Event: UPDATE MyTable SET MyPatApplication = NULL WHERE MyPatApplication = ” AND MySendNotice = -1
2011-02-25 16:00:45.96 spid4s       Requested By:
2011-02-25 16:00:45.96 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x1E202E00 Mode: S SPID:1122 BatchID:0 ECID:0 TaskProxy:(0x196BC378) Value:0x4a7f5d80 Cost:(0/4232)
2011-02-25 16:00:45.96 spid4s
2011-02-25 16:00:45.96 spid4s      Node:2

2011-02-25 16:00:45.96 spid4s      OBJECT: 75:1886017850:0        CleanCnt:2 Mode:X Flags: 0x0
2011-02-25 16:00:45.96 spid4s       Grant List 1:
2011-02-25 16:00:45.96 spid4s         Owner:0x1EC887A0 Mode: X        Flg:0x0 Ref:22 Life:02000000 SPID:1122 ECID:0 XactLockInfo: 0x1E202E24
2011-02-25 16:00:45.97 spid4s         SPID: 1122 ECID: 0 Statement Type: SELECT Line #: 1
2011-02-25 16:00:45.97 spid4s         Input Buf: Language Event: SELECT MyTable.MyFolderID, CLName1, MDesc, MNarrative,  MArrangement FROM MyTable
2011-02-25 16:00:45.97 spid4s       Requested By:
2011-02-25 16:00:45.97 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x417334E8 Mode: IX SPID:981 BatchID:0 ECID:0 TaskProxy:(0x4AD92378) Value:0x131f63c0 Cost:(0/28112)
2011-02-25 16:00:45.97 spid4s
2011-02-25 16:00:45.97 spid4s      Victim Resource Owner:
2011-02-25 16:00:45.97 spid4s       ResType:LockOwner Stype:’OR’Xdes:0x1E202E00 Mode: S SPID:1122 BatchID:0 ECID:0 TaskProxy:(0x196BC378) Value:0x4a7f5d80 Cost:(0/4232)

The first thing we see is that this deadlock occurred between 2 processes, Node 1 and Node 2.  Deadlocks can happen between more than 2 processes as well, and I’ll show an example of that a little later.  For now, we’ll focus on Node 1 of this deadlock.  I’m not going to go into every bit of information, but I’ll show you the juicy bits.

 

2011-02-25 16:00:45.96 spid4s      PAGE: 75:1:5994649             CleanCnt:3 Mode:IX Flags: 0x2
2011-02-25 16:00:45.96 spid4s       Grant List 0:
2011-02-25 16:00:45.96 spid4s         Owner:0x48A1FCE0 Mode: IX       Flg:0x0 Ref:0 Life:02000000 SPID:981 ECID:0 XactLockInfo: 0x4173350C
2011-02-25 16:00:45.96 spid4s         SPID: 981 ECID: 0 Statement Type: UPDATE Line #: 1
2011-02-25 16:00:45.96 spid4s         Input Buf: Language Event: UPDATE MyTable SET MyPatApplication = NULL WHERE MyPatApplication = ” AND MySendNotice = -1
2011-02-25 16:00:45.96 spid4s       Requested By:
2011-02-25 16:00:45.96 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x1E202E00 Mode: S SPID:1122 BatchID:0 ECID:0 TaskProxy:(0x196BC378) Value:0x4a7f5d80 Cost:(0/4232)

In the first line, “PAGE: 75:1:5994649” is the resource this process had locked.  This means the lock was held on the object located on page 1:5994649 in database 75.  We can figure out what object this equates to by querying sys.databases to find out which database is number 75, and using the dbcc page command, dbcc page (75, 1, 5994649, 3), to get the id of the object on that page.  Then we query sys.objects to find the name of the object.  Also in the first line we see that the lock Mode on this resource was “IX” or intent exclusive.

In the fourth line, we can see that this Node was SPID 981, and it was performing an update statement.  On the next line we see the exact statement that SPID 981 was executing at the time.

Beginning with the 6th line, we see information about the process that was requesting a lock on the resource in question:

 

2011-02-25 16:00:45.96 spid4s       Requested By:
2011-02-25 16:00:45.96 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x1E202E00 Mode: S SPID:1122 BatchID:0 ECID:0 TaskProxy:(0x196BC378) Value:0x4a7f5d80 Cost:(0/4232)

Here we see that SPID 1122 was requesting a shared lock on the same resource owned by SPID 981.

The information on Node 2 follows the same pattern:

2011-02-25 16:00:45.96 spid4s      OBJECT: 75:1886017850:0        CleanCnt:2 Mode:X Flags: 0x0
2011-02-25 16:00:45.96 spid4s       Grant List 1:
2011-02-25 16:00:45.96 spid4s         Owner:0x1EC887A0 Mode: X        Flg:0x0 Ref:22 Life:02000000 SPID:1122 ECID:0 XactLockInfo: 0x1E202E24
2011-02-25 16:00:45.97 spid4s         SPID: 1122 ECID: 0 Statement Type: SELECT Line #: 1
2011-02-25 16:00:45.97 spid4s         Input Buf: Language Event: SELECT MyTable.MyFolderID, CLName1, MDesc, MNarrative,  MArrangement FROM MyTable
2011-02-25 16:00:45.97 spid4s       Requested By:
2011-02-25 16:00:45.97 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x417334E8 Mode: IX SPID:981 BatchID:0 ECID:0 TaskProxy:(0x4AD92378) Value:0x131f63c0 Cost:(0/28112)

SPID 1122 had an exclusive lock on the object located on page 1886017850:0 in database 75 while executing a SELECT statement.  And SPID 981 was requesting an intent-exclusive lock on that same resource.

In the final section:

 

2011-02-25 16:00:45.97 spid4s      Victim Resource Owner:
2011-02-25 16:00:45.97 spid4s       ResType:LockOwner Stype:’OR’Xdes:0x1E202E00 Mode: S SPID:1122 BatchID:0 ECID:0 TaskProxy:(0x196BC378) Value:0x4a7f5d80 Cost:(0/4232)

We see that SQL Server chose SPID 1122 to be the victim of this deadlock and terminated the process.  Generally speaking, SQL Server will choose to terminate a select statement over an insert/update/delete statement.

Two-node deadlocks are pretty straightforward to read once you get the hang of it.  Deadlocks with more nodes can be confusing, but just follow the chain.  For example:

 

2011-02-25 16:00:45.95 spid4s      Deadlock encountered …. Printing deadlock information
2011-02-25 16:00:45.95 spid4s      Wait-for graph
2011-02-25 16:00:45.95 spid4s
2011-02-25 16:00:45.95 spid4s      Node:1

2011-02-25 16:00:45.95 spid4s      PAGE: 75:1:5994649             CleanCnt:3 Mode:IX Flags: 0x2
2011-02-25 16:00:45.95 spid4s       Grant List 0:
2011-02-25 16:00:45.95 spid4s         Owner:0x48A1FCE0 Mode: IX       Flg:0x0 Ref:0 Life:02000000 SPID:981 ECID:0 XactLockInfo: 0x4173350C
2011-02-25 16:00:45.95 spid4s         SPID: 981 ECID: 0 Statement Type: UPDATE Line #: 1
2011-02-25 16:00:45.95 spid4s         Input Buf: Language Event: UPDATE MyTable SET MyPatApplication = NULL WHERE MyPatApplication = ” AND MySendNotice = -1
2011-02-25 16:00:45.95 spid4s       Requested By:
2011-02-25 16:00:45.95 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x1E202E00 Mode: S SPID:1122 BatchID:0 ECID:0 TaskProxy:(0x196BC378) Value:0x4a7f5d80 Cost:(0/4232)
2011-02-25 16:00:45.95 spid4s
2011-02-25 16:00:45.95 spid4s      Node:2

2011-02-25 16:00:45.95 spid4s      OBJECT: 75:1886017850:0        CleanCnt:3 Mode:X Flags: 0x0
2011-02-25 16:00:45.95 spid4s       Wait List:
2011-02-25 16:00:45.95 spid4s         Owner:0x53909C00 Mode: IS       Flg:0x2 Ref:1 Life:00000000 SPID:977 ECID:0 XactLockInfo: 0x44EA8794
2011-02-25 16:00:45.95 spid4s         SPID: 977 ECID: 0 Statement Type: SELECT Line #: 1
2011-02-25 16:00:45.95 spid4s         Input Buf: Language Event: (@P1 nvarchar(31))SELECT * FROM MyTable WHERE MyFolderID = @P1
2011-02-25 16:00:45.95 spid4s       Requested By:
2011-02-25 16:00:45.95 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x417334E8 Mode: IX SPID:981 BatchID:0 ECID:0 TaskProxy:(0x4AD92378) Value:0x131f63c0 Cost:(0/28112)
2011-02-25 16:00:45.95 spid4s
2011-02-25 16:00:45.95 spid4s      Node:3

2011-02-25 16:00:45.95 spid4s      OBJECT: 75:1886017850:0        CleanCnt:3 Mode:X Flags: 0x0
2011-02-25 16:00:45.95 spid4s       Grant List 1:
2011-02-25 16:00:45.95 spid4s         Owner:0x1EC887A0 Mode: X        Flg:0x0 Ref:22 Life:02000000 SPID:1122 ECID:0 XactLockInfo: 0x1E202E24
2011-02-25 16:00:45.95 spid4s         SPID: 1122 ECID: 0 Statement Type: SELECT Line #: 1
2011-02-25 16:00:45.95 spid4s         Input Buf: Language Event: SELECT MyTable.MyFolderID, CLName1, MDesc, MNarrative,  MArrangement FROM MyTable
2011-02-25 16:00:45.95 spid4s       Requested By:
2011-02-25 16:00:45.95 spid4s         ResType:LockOwner Stype:’OR’Xdes:0x44EA8770 Mode: IS SPID:977 BatchID:0 ECID:0 TaskProxy:(0x294C2378) Value:0x53909c00 Cost:(0/0)
2011-02-25 16:00:45.95 spid4s
2011-02-25 16:00:45.95 spid4s      Victim Resource Owner:
2011-02-25 16:00:45.95 spid4s       ResType:LockOwner Stype:’OR’Xdes:0x44EA8770 Mode: IS SPID:977 BatchID:0 ECID:0 TaskProxy:(0x294C2378) Value:0x53909c00 Cost:(0/0)

Here SPID 981 had something that SPID 1122 wanted, SPID 1122 had something that SPID 977 wanted, and SPID 977 had something that SPID 981 wanted.  As a result, SPID 977 was killed.

In the end, the developer was able to eliminate a lot of deadlocks by adding NOLOCK to his select statements, and his end-users are much happier.

March 1, 2011 Posted by | Troubleshooting | , | Leave a comment

Call me Ishmael

This week I’ve been chasing my own white whale.  Another developer came to me with this little problem.  A query he was trying to run over a linked server to an Oracle database was returning unexpected results.  Here’s his original query:

select count(order_num) from openquery(PROD, 'select t2.order_num
from myschema.mytable t2
where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'')
and nature = ''057''
and dept_code = ''7240''
and status_code like ''O%''')a

The resulting count was 200. However, running that same query directly on Oracle returns 658.  So I moved the count() function inside the remote query (which is a more efficient query anyway, but that’s beside the point).

select * from openquery(PROD, 'select count(order_num)
from myschema.mytable t2
where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'')
and nature = ''057''
and dept_code = ''7240''
and status_code like ''O%''')a 

The result?  658.  And running a simple select *, rather that count(*) also returns 658 records.  A little poking around on Oracle’s Metalink site netted a few bugs with the Oracle OLEDB provider, including one specifically related to count(*).  The bug description said it was 100% reproducible with no workaround.  And normally I might let it go at that and move on.  But the thing is, it’s not 100% reproducible.  If I run that same original query against the same table in our QA environment, I get the correct count.  In fact, if I run it against a different column in that same table in the Prod environment, I get the correct count.  Running test queries against many columns in several tables in each environment yielded mixed results, with seemingly no rhyme or reason behind which columns/tables returned a correct count, and which returned a count of 200.  Now, I realize that this is called a “bug” for a reason, but computers just don’t produce random and arbitrary results.  So I determined to figure out why.

I ran an 10046 event trace in Oracle along with a Profiler trace for both the “good” and “bad” scenarios.  In

the “good” test, SQL Server requests rows (OLEDB Call Event) in batches of 100:

<inputs>
    <hReserved>0x0000000000000000</hReserved>
    <lRowsOffset>0</lRowsOffset>
    <cRows>100</cRows>

</inputs>

And you see an “OLEDB DataRead Event” for each row fetched:

<inputs>
    <hRow>0x0000000000000259</hRow>
    <hAccessor>0x000000001127D3E0</hAccessor>

</inputs>

For the last batch, it requests 100 records again, but only gets 4 back:

<hresult>265926</hresult>

<outputs>
    <pcRowsObtained>4</pcRowsObtained>
    <prghRows>
        <HROW>0x000000000000025A</HROW>
        <HROW>0x000000000000025B</HROW>
        <HROW>0x000000000000025C</HROW>
        <HROW>0x000000000000025D</HROW>
    </prghRows>

</outputs>

That hresult of 265926 means “Reached start or end of rowset or chapter.”  And that all corresponds with what we see

in the Oracle trace file.  Each fetch is recorded with a line like this:

FETCH #3:c=0,e=2772,p=0,cr=73,cu=0,mis=0,r=100,dep=0,og=4,tim=18884103172270

where r=100 is the number of rows returned.  The final fetch shows r=4.  This is what we would expect to see. 

Now let’s look at the “bad” test.  Here we see no DataRead events in the Profiler trace.  The very last OLEDB

Call Event

<inputs>
    <hReserved>0x0000000000000000</hReserved>
    <lRowsOffset>0</lRowsOffset>
    <cRows>100</cRows>

</inputs>

shows 99 records returned.

<hresult>265926</hresult>

<outputs>
    <pcRowsObtained>99</pcRowsObtained>
    <prghRows>
        <HROW>0x0000000000000066</HROW>
        […]
        <HROW>0x00000000000000C8</HROW>
    </prghRows>

</outputs>

But what’s different here is that the Oracle trace shows 100 records being returned in that last fetch.  So why did

SQL Server only receive 99 records?  I thought perhaps an OLEDB trace would shed some light, but that’s like reading Sanskrit. 

Further examination of the Profiler traces yielded one difference that seemed relevant.  From the “good” test:

<inputs>
    <cPropertyIDSets>1</cPropertyIDSets>
    <rgPropertyIDSets>
        <DBPROPIDSET>
            <cPropertyIDs>1</cPropertyIDs>
            <rgPropertyIDs>
                <DBPROPID>DBPROP_ISequentialStream</DBPROPID>
            </rgPropertyIDs>
            <guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
        </DBPROPIDSET>
    </rgPropertyIDSets>
</inputs>
next line:
<hresult>0</hresult>
<outputs>
    <pcPropertySets>1</pcPropertySets>
    <prgPropertySets>
        <DBPROPSET>
            <cProperties>1</cProperties>
            <guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
            <rgProperties>
                <DBPROP>
                    <dwPropertyID>DBPROP_ISequentialStream</dwPropertyID>
                    <dwOptions>0</dwOptions>
                    <dwStatus>0</dwStatus>
                    <colid>DB_NULLID</colid>
                    <vValue>
                        <VARIANT>
                            <vt>VT_BOOL</vt>
                            <boolVal>-1</boolVal>
                        </VARIANT>
                    </vValue>
                </DBPROP>
            </rgProperties>
        </DBPROPSET>
    </prgPropertySets>
</outputs>
next line:
<inputs>
    <riid>IID_IRowsetInfo</riid>
</inputs>
next line:
<hresult>0</hresult>

That DBPROP_ISequentialStream.  I’m thinking that’s the critical factor that makes the OLEDB provider handle the data differently in the “good” tests, and allows it to process the data correctly.  The only problem is:  I can’t prove it.  Yet.  So far the Great Gazoogle has provided me with little helpful information on this property and its impact.  But the search continues, if in a slightly less obsessed manner.  And hopefully things will work out better for me than they did for Ahab.

Incidentally, there does seem to be a workaround.  If, when you create the linked server, you include a FetchSize in the provider string, @provstr=N’FetchSize=500′, with the fetchsize being 200 or more, this should get you the correct result.  At least, it did for me. 

February 25, 2011 Posted by | Troubleshooting | , , , , , | Leave a comment

Heisenbug

From Wikipedia:

A heisenbug (named after the Heisenberg Uncertainty Principle) is a computer bug that disappears or alters its characteristics when an attempt is made to study it.

Yep, sounds about right. The timeout problem has not happened once since I put monitoring into place on Monday morning.

February 18, 2011 Posted by | Troubleshooting | , | Leave a comment

The waiting is the hardest part

A developer came to me last week with a problem.  A relatively simple query within his application that normally takes under a second to run, is intermittently taking upwards of 30 seconds.  This, in turn, is causing his application to time out.  He said he could increase the timeout on his application to work around the problem, but wanted to know if I could help fix it, first.

As I said, the query itself is a very simple one, joining 3 tables, the largest of which only contains about a half a million records.  And the query itself only returns about a thousand rows.  So I started going through the usual suspects:

The Code:  The first thing I did was check out the query itself to ensure the developer wasn’t doing anything silly like applying functions to columns in the where clause, etc.  I also saw that he was using literal values, thereby ruling out parameter sniffing.

Blocking:  Given the intermittent nature of the problem, blocking initially looked like a viable cause.  However, the developer is using nolock on all tables in the query.  To make doubly sure, I used the blocked process threshold in conjunction with a server-side trace to track any blocking on the server longer than 10 seconds.  There was no blocking at all occurring in the related database. 

Statistics/Query plan:  Statistics are updated nightly on this database.  During the day, the data is static, so stats shouldn’t be the problem.  Another server-side trace confirmed that the query was using the exact same query plan for good runs as it did for bad runs. 

Waits:  That brings me to waits.  I decided to use sp_WhoIsActive to try to trap the problem when it happens and see what the process is waiting for.  The problem was actually catching it in the act.  Up to this point, by the time we were aware of the problem, it was gone.  To catch it, I’d need to have the procedure already running in the background, recording data. 

My first step was using the following script to create a table to hold the output.

DECLARE @wia_table VARCHAR(4000) ;
DECLARE @schema_out VARCHAR(4000) ;

SET @wia_table = 'DBA_Temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;

EXEC sp_WhoIsActive
   @get_transaction_info = 1,
   @get_plans = 1,
   @get_additional_info = 1,
   @get_task_info  = 2,
   @RETURN_SCHEMA = 1,
   @SCHEMA = @schema_out OUTPUT ;

SET @schema_out = REPLACE(@schema_out, '<table_name>', @wia_table) ;

EXEC(@schema_out) ;

The next part was a script to record the sp_WhoIsActive output for the application login every 10 seconds.  Given that the bad executions run over 30 seconds, I figured that every 10 seconds would be sufficient to catch the problem and minimize overhead.  The script ends itself at 5pm.

DECLARE
    @wia_table VARCHAR(4000) ,
    @stoptime DATETIME

SET @wia_table = 'DBA_temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112)
SET @stoptime = cast(CONVERT(VARCHAR, GETDATE(), 112) + ' 17:00:00'  as datetime )

WHILE GETDATE() < @stoptime
    BEGIN
      
       EXEC sp_WhoIsActive
           @get_transaction_info = 1,
           @get_plans = 1,
           @get_additional_info = 1,
           @get_task_info  = 2,
           @filter = 'my_user', @filter_type = 'login',        
           @destination_table = @wia_table;

       WAITFOR DELAY '00:00:10'

    END ;
GO
 

And finally, I created a job to run both of these scripts, scheduling it to run daily at 7am.  Between this job, the server-side trace, and the blocking trace, I think I have all my bases covered.  The only thing left now is to wait… (Get it?  Wait??)

February 15, 2011 Posted by | Troubleshooting | , | 2 Comments