Cleveland DBA

SQL Server administration in Cleveland, OH

Aha!

I found the answer to my Resource Governor quandary in this outstanding article, Understanding SQL Server Memory Grant, on MSDN.  The juicy bit:

The server calculates the memory limit for one query. By default, this is 25% (20% on 32bit SQL 2005) of total query memory (which is set by memory broker as about 90% of server memory). This per-query limit helps to prevent one query from dominating the whole server. This percentage is configurable on SQL 2008.

And how is it configurable, you ask?  By using Resource Governor to override it.

Really though, read the article, it’s extremely informative.  I’d like to give the author a great big hug.

April 25, 2011 Posted by | Features, Troubleshooting | , , | 1 Comment

Resource Governor causing memory problem??

So.  Last week I blogged about the experiment I was running with Resource Governor for one of our newly upgraded development databases.  I had enabled Resource Governor and created a procedure to categorize sessions, but I hadn’t set any limits and everyone was sharing the same default pool.  The plan, if you’ll recall, was to monitor the Resource Governor stats to get a feel for the resources different apps were using.

This week, users have been doing a lot of testing in this environment.  A developer called me on Wednesday morning and said that users were complaining of very slow performance.  So I took a look and saw several sessions waiting on resource_semaphores.  Now, I’ve been supporting this application’s database for years now and I’ve never seen any memory waits, so this got my attention quick.  One of the sessions that was running was one of the reports that I’d enabled Resource Governor to address in the first place.  A look at sys.dm_exec_query_memory_grants told me this query was requesting almost 19GB of memory (or, put another way, all of the available memory).  What the…?

I captured the query (and asked the user to kill his report) and ran the query in our production (SQL 2005) database.  In that environment it was only requesting 7GB of memory.  Still a lot, but do-able in a system with 40GB available.  The query plans were slightly different between the 2 environments, but updating stats didn’t change that at all.  I decided to use Resource Governor as it was intended, and I set a 40% memory grant limit for the Reports workload group.  I re-ran the query and this time it was only requesting 7GB of RAM.

The immediate problem was remediated, but it niggled at the back of my brain; why had the query requested that much memory in the first place?  Thursday morning I got another call.  This time, when I checked the system, a different application was requesting all of the available memory.  I could have set a restriction for that workload group, too, but this was just treating the symptoms.  There had to be a root cause.

I started searching the Great Gazoogle for known memory issues in SQL 2008, but found nothing that explained why queries were suddenly using a lot more memory than they did in SQL 2005.  Think.  Think.  Think.  What if Resource Governor is the problem?  So I ran a series of tests:  with Resource Governor disabled, with it enabled and the memory limit set to 40%, and enabled with the memory limit set to 100% (or unlimited).  I also ran it in our 2005 environment just for comparison purposes, and I monitored sys.dm_exec_query_memory_grants for each test.  Here are the very interesting results.

Again I say:  what the…?  With Resource Governor disabled the query only uses 4.8GB of memory.  Enabling Resource Governor actually appears to cause the query to request more memory.  It almost looks like, rather than being a limit, the memory grant % is telling the query how much to use (putting ideas into its head).  I know that can’t be the case, but that’s how it looks.  I thought maybe I’d misunderstood the settings, that maybe I’d mis-configured Resource Governor, but I don’t think I did.

Memory Grant %
Specify the maximum amount of memory that a single request can take from the pool. Range is 0 to 100.

And technically, that’s what SQL Server did.  I said “this group can use 100% of the memory,” and it gave that query 100% of the memory.  Or tried to, anyway.  But if it really needed all that memory, why doesn’t it ask for it with Resource Governor disabled?

So far I haven’t been able to find any bugs to explain this behavior.  And a post to SQLServerCentral’s forums hasn’t yielded any responses.  I guess for now I’m going to leave it disabled and keep searching for an explanation.  If anyone out there has any insight, please post a comment.  I’d really appreciate it.

April 22, 2011 Posted by | Features, Troubleshooting | , , , | 1 Comment

Call me Ishmael

This week I’ve been chasing my own white whale.  Another developer came to me with this little problem.  A query he was trying to run over a linked server to an Oracle database was returning unexpected results.  Here’s his original query:

select count(order_num) from openquery(PROD, 'select t2.order_num
from myschema.mytable t2
where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'')
and nature = ''057''
and dept_code = ''7240''
and status_code like ''O%''')a

The resulting count was 200. However, running that same query directly on Oracle returns 658.  So I moved the count() function inside the remote query (which is a more efficient query anyway, but that’s beside the point).

select * from openquery(PROD, 'select count(order_num)
from myschema.mytable t2
where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'')
and nature = ''057''
and dept_code = ''7240''
and status_code like ''O%''')a 

The result?  658.  And running a simple select *, rather that count(*) also returns 658 records.  A little poking around on Oracle’s Metalink site netted a few bugs with the Oracle OLEDB provider, including one specifically related to count(*).  The bug description said it was 100% reproducible with no workaround.  And normally I might let it go at that and move on.  But the thing is, it’s not 100% reproducible.  If I run that same original query against the same table in our QA environment, I get the correct count.  In fact, if I run it against a different column in that same table in the Prod environment, I get the correct count.  Running test queries against many columns in several tables in each environment yielded mixed results, with seemingly no rhyme or reason behind which columns/tables returned a correct count, and which returned a count of 200.  Now, I realize that this is called a “bug” for a reason, but computers just don’t produce random and arbitrary results.  So I determined to figure out why.

I ran an 10046 event trace in Oracle along with a Profiler trace for both the “good” and “bad” scenarios.  In

the “good” test, SQL Server requests rows (OLEDB Call Event) in batches of 100:

<inputs>
    <hReserved>0x0000000000000000</hReserved>
    <lRowsOffset>0</lRowsOffset>
    <cRows>100</cRows>

</inputs>

And you see an “OLEDB DataRead Event” for each row fetched:

<inputs>
    <hRow>0x0000000000000259</hRow>
    <hAccessor>0x000000001127D3E0</hAccessor>

</inputs>

For the last batch, it requests 100 records again, but only gets 4 back:

<hresult>265926</hresult>

<outputs>
    <pcRowsObtained>4</pcRowsObtained>
    <prghRows>
        <HROW>0x000000000000025A</HROW>
        <HROW>0x000000000000025B</HROW>
        <HROW>0x000000000000025C</HROW>
        <HROW>0x000000000000025D</HROW>
    </prghRows>

</outputs>

That hresult of 265926 means “Reached start or end of rowset or chapter.”  And that all corresponds with what we see

in the Oracle trace file.  Each fetch is recorded with a line like this:

FETCH #3:c=0,e=2772,p=0,cr=73,cu=0,mis=0,r=100,dep=0,og=4,tim=18884103172270

where r=100 is the number of rows returned.  The final fetch shows r=4.  This is what we would expect to see. 

Now let’s look at the “bad” test.  Here we see no DataRead events in the Profiler trace.  The very last OLEDB

Call Event

<inputs>
    <hReserved>0x0000000000000000</hReserved>
    <lRowsOffset>0</lRowsOffset>
    <cRows>100</cRows>

</inputs>

shows 99 records returned.

<hresult>265926</hresult>

<outputs>
    <pcRowsObtained>99</pcRowsObtained>
    <prghRows>
        <HROW>0x0000000000000066</HROW>
        […]
        <HROW>0x00000000000000C8</HROW>
    </prghRows>

</outputs>

But what’s different here is that the Oracle trace shows 100 records being returned in that last fetch.  So why did

SQL Server only receive 99 records?  I thought perhaps an OLEDB trace would shed some light, but that’s like reading Sanskrit. 

Further examination of the Profiler traces yielded one difference that seemed relevant.  From the “good” test:

<inputs>
    <cPropertyIDSets>1</cPropertyIDSets>
    <rgPropertyIDSets>
        <DBPROPIDSET>
            <cPropertyIDs>1</cPropertyIDs>
            <rgPropertyIDs>
                <DBPROPID>DBPROP_ISequentialStream</DBPROPID>
            </rgPropertyIDs>
            <guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
        </DBPROPIDSET>
    </rgPropertyIDSets>
</inputs>
next line:
<hresult>0</hresult>
<outputs>
    <pcPropertySets>1</pcPropertySets>
    <prgPropertySets>
        <DBPROPSET>
            <cProperties>1</cProperties>
            <guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
            <rgProperties>
                <DBPROP>
                    <dwPropertyID>DBPROP_ISequentialStream</dwPropertyID>
                    <dwOptions>0</dwOptions>
                    <dwStatus>0</dwStatus>
                    <colid>DB_NULLID</colid>
                    <vValue>
                        <VARIANT>
                            <vt>VT_BOOL</vt>
                            <boolVal>-1</boolVal>
                        </VARIANT>
                    </vValue>
                </DBPROP>
            </rgProperties>
        </DBPROPSET>
    </prgPropertySets>
</outputs>
next line:
<inputs>
    <riid>IID_IRowsetInfo</riid>
</inputs>
next line:
<hresult>0</hresult>

That DBPROP_ISequentialStream.  I’m thinking that’s the critical factor that makes the OLEDB provider handle the data differently in the “good” tests, and allows it to process the data correctly.  The only problem is:  I can’t prove it.  Yet.  So far the Great Gazoogle has provided me with little helpful information on this property and its impact.  But the search continues, if in a slightly less obsessed manner.  And hopefully things will work out better for me than they did for Ahab.

Incidentally, there does seem to be a workaround.  If, when you create the linked server, you include a FetchSize in the provider string, @provstr=N’FetchSize=500′, with the fetchsize being 200 or more, this should get you the correct result.  At least, it did for me. 

February 25, 2011 Posted by | Troubleshooting | , , , , , | Leave a comment

Heisenbug

From Wikipedia:

A heisenbug (named after the Heisenberg Uncertainty Principle) is a computer bug that disappears or alters its characteristics when an attempt is made to study it.

Yep, sounds about right. The timeout problem has not happened once since I put monitoring into place on Monday morning.

February 18, 2011 Posted by | Troubleshooting | , | Leave a comment

The waiting is the hardest part

A developer came to me last week with a problem.  A relatively simple query within his application that normally takes under a second to run, is intermittently taking upwards of 30 seconds.  This, in turn, is causing his application to time out.  He said he could increase the timeout on his application to work around the problem, but wanted to know if I could help fix it, first.

As I said, the query itself is a very simple one, joining 3 tables, the largest of which only contains about a half a million records.  And the query itself only returns about a thousand rows.  So I started going through the usual suspects:

The Code:  The first thing I did was check out the query itself to ensure the developer wasn’t doing anything silly like applying functions to columns in the where clause, etc.  I also saw that he was using literal values, thereby ruling out parameter sniffing.

Blocking:  Given the intermittent nature of the problem, blocking initially looked like a viable cause.  However, the developer is using nolock on all tables in the query.  To make doubly sure, I used the blocked process threshold in conjunction with a server-side trace to track any blocking on the server longer than 10 seconds.  There was no blocking at all occurring in the related database. 

Statistics/Query plan:  Statistics are updated nightly on this database.  During the day, the data is static, so stats shouldn’t be the problem.  Another server-side trace confirmed that the query was using the exact same query plan for good runs as it did for bad runs. 

Waits:  That brings me to waits.  I decided to use sp_WhoIsActive to try to trap the problem when it happens and see what the process is waiting for.  The problem was actually catching it in the act.  Up to this point, by the time we were aware of the problem, it was gone.  To catch it, I’d need to have the procedure already running in the background, recording data. 

My first step was using the following script to create a table to hold the output.

DECLARE @wia_table VARCHAR(4000) ;
DECLARE @schema_out VARCHAR(4000) ;

SET @wia_table = 'DBA_Temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;

EXEC sp_WhoIsActive
   @get_transaction_info = 1,
   @get_plans = 1,
   @get_additional_info = 1,
   @get_task_info  = 2,
   @RETURN_SCHEMA = 1,
   @SCHEMA = @schema_out OUTPUT ;

SET @schema_out = REPLACE(@schema_out, '<table_name>', @wia_table) ;

EXEC(@schema_out) ;

The next part was a script to record the sp_WhoIsActive output for the application login every 10 seconds.  Given that the bad executions run over 30 seconds, I figured that every 10 seconds would be sufficient to catch the problem and minimize overhead.  The script ends itself at 5pm.

DECLARE
    @wia_table VARCHAR(4000) ,
    @stoptime DATETIME

SET @wia_table = 'DBA_temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112)
SET @stoptime = cast(CONVERT(VARCHAR, GETDATE(), 112) + ' 17:00:00'  as datetime )

WHILE GETDATE() < @stoptime
    BEGIN
      
       EXEC sp_WhoIsActive
           @get_transaction_info = 1,
           @get_plans = 1,
           @get_additional_info = 1,
           @get_task_info  = 2,
           @filter = 'my_user', @filter_type = 'login',        
           @destination_table = @wia_table;

       WAITFOR DELAY '00:00:10'

    END ;
GO
 

And finally, I created a job to run both of these scripts, scheduling it to run daily at 7am.  Between this job, the server-side trace, and the blocking trace, I think I have all my bases covered.  The only thing left now is to wait… (Get it?  Wait??)

February 15, 2011 Posted by | Troubleshooting | , | 2 Comments