Cleveland DBA

SQL Server administration in Cleveland, OH

Call me Ishmael

This week I’ve been chasing my own white whale.  Another developer came to me with this little problem.  A query he was trying to run over a linked server to an Oracle database was returning unexpected results.  Here’s his original query:

select count(order_num) from openquery(PROD, 'select t2.order_num
from myschema.mytable t2
where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'')
and nature = ''057''
and dept_code = ''7240''
and status_code like ''O%''')a

The resulting count was 200. However, running that same query directly on Oracle returns 658.  So I moved the count() function inside the remote query (which is a more efficient query anyway, but that’s beside the point).

select * from openquery(PROD, 'select count(order_num)
from myschema.mytable t2
where open_date < to_date(''01-JAN-2011'', ''DD-MON-YYYY'')
and nature = ''057''
and dept_code = ''7240''
and status_code like ''O%''')a 

The result?  658.  And running a simple select *, rather that count(*) also returns 658 records.  A little poking around on Oracle’s Metalink site netted a few bugs with the Oracle OLEDB provider, including one specifically related to count(*).  The bug description said it was 100% reproducible with no workaround.  And normally I might let it go at that and move on.  But the thing is, it’s not 100% reproducible.  If I run that same original query against the same table in our QA environment, I get the correct count.  In fact, if I run it against a different column in that same table in the Prod environment, I get the correct count.  Running test queries against many columns in several tables in each environment yielded mixed results, with seemingly no rhyme or reason behind which columns/tables returned a correct count, and which returned a count of 200.  Now, I realize that this is called a “bug” for a reason, but computers just don’t produce random and arbitrary results.  So I determined to figure out why.

I ran an 10046 event trace in Oracle along with a Profiler trace for both the “good” and “bad” scenarios.  In

the “good” test, SQL Server requests rows (OLEDB Call Event) in batches of 100:

<inputs>
    <hReserved>0x0000000000000000</hReserved>
    <lRowsOffset>0</lRowsOffset>
    <cRows>100</cRows>

</inputs>

And you see an “OLEDB DataRead Event” for each row fetched:

<inputs>
    <hRow>0x0000000000000259</hRow>
    <hAccessor>0x000000001127D3E0</hAccessor>

</inputs>

For the last batch, it requests 100 records again, but only gets 4 back:

<hresult>265926</hresult>

<outputs>
    <pcRowsObtained>4</pcRowsObtained>
    <prghRows>
        <HROW>0x000000000000025A</HROW>
        <HROW>0x000000000000025B</HROW>
        <HROW>0x000000000000025C</HROW>
        <HROW>0x000000000000025D</HROW>
    </prghRows>

</outputs>

That hresult of 265926 means “Reached start or end of rowset or chapter.”  And that all corresponds with what we see

in the Oracle trace file.  Each fetch is recorded with a line like this:

FETCH #3:c=0,e=2772,p=0,cr=73,cu=0,mis=0,r=100,dep=0,og=4,tim=18884103172270

where r=100 is the number of rows returned.  The final fetch shows r=4.  This is what we would expect to see. 

Now let’s look at the “bad” test.  Here we see no DataRead events in the Profiler trace.  The very last OLEDB

Call Event

<inputs>
    <hReserved>0x0000000000000000</hReserved>
    <lRowsOffset>0</lRowsOffset>
    <cRows>100</cRows>

</inputs>

shows 99 records returned.

<hresult>265926</hresult>

<outputs>
    <pcRowsObtained>99</pcRowsObtained>
    <prghRows>
        <HROW>0x0000000000000066</HROW>
        […]
        <HROW>0x00000000000000C8</HROW>
    </prghRows>

</outputs>

But what’s different here is that the Oracle trace shows 100 records being returned in that last fetch.  So why did

SQL Server only receive 99 records?  I thought perhaps an OLEDB trace would shed some light, but that’s like reading Sanskrit. 

Further examination of the Profiler traces yielded one difference that seemed relevant.  From the “good” test:

<inputs>
    <cPropertyIDSets>1</cPropertyIDSets>
    <rgPropertyIDSets>
        <DBPROPIDSET>
            <cPropertyIDs>1</cPropertyIDs>
            <rgPropertyIDs>
                <DBPROPID>DBPROP_ISequentialStream</DBPROPID>
            </rgPropertyIDs>
            <guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
        </DBPROPIDSET>
    </rgPropertyIDSets>
</inputs>
next line:
<hresult>0</hresult>
<outputs>
    <pcPropertySets>1</pcPropertySets>
    <prgPropertySets>
        <DBPROPSET>
            <cProperties>1</cProperties>
            <guidPropertySet>DBPROPSET_ROWSET</guidPropertySet>
            <rgProperties>
                <DBPROP>
                    <dwPropertyID>DBPROP_ISequentialStream</dwPropertyID>
                    <dwOptions>0</dwOptions>
                    <dwStatus>0</dwStatus>
                    <colid>DB_NULLID</colid>
                    <vValue>
                        <VARIANT>
                            <vt>VT_BOOL</vt>
                            <boolVal>-1</boolVal>
                        </VARIANT>
                    </vValue>
                </DBPROP>
            </rgProperties>
        </DBPROPSET>
    </prgPropertySets>
</outputs>
next line:
<inputs>
    <riid>IID_IRowsetInfo</riid>
</inputs>
next line:
<hresult>0</hresult>

That DBPROP_ISequentialStream.  I’m thinking that’s the critical factor that makes the OLEDB provider handle the data differently in the “good” tests, and allows it to process the data correctly.  The only problem is:  I can’t prove it.  Yet.  So far the Great Gazoogle has provided me with little helpful information on this property and its impact.  But the search continues, if in a slightly less obsessed manner.  And hopefully things will work out better for me than they did for Ahab.

Incidentally, there does seem to be a workaround.  If, when you create the linked server, you include a FetchSize in the provider string, @provstr=N’FetchSize=500′, with the fetchsize being 200 or more, this should get you the correct result.  At least, it did for me. 

February 25, 2011 Posted by | Troubleshooting | , , , , , | Leave a comment

Heisenbug

From Wikipedia:

A heisenbug (named after the Heisenberg Uncertainty Principle) is a computer bug that disappears or alters its characteristics when an attempt is made to study it.

Yep, sounds about right. The timeout problem has not happened once since I put monitoring into place on Monday morning.

February 18, 2011 Posted by | Troubleshooting | , | Leave a comment

The waiting is the hardest part

A developer came to me last week with a problem.  A relatively simple query within his application that normally takes under a second to run, is intermittently taking upwards of 30 seconds.  This, in turn, is causing his application to time out.  He said he could increase the timeout on his application to work around the problem, but wanted to know if I could help fix it, first.

As I said, the query itself is a very simple one, joining 3 tables, the largest of which only contains about a half a million records.  And the query itself only returns about a thousand rows.  So I started going through the usual suspects:

The Code:  The first thing I did was check out the query itself to ensure the developer wasn’t doing anything silly like applying functions to columns in the where clause, etc.  I also saw that he was using literal values, thereby ruling out parameter sniffing.

Blocking:  Given the intermittent nature of the problem, blocking initially looked like a viable cause.  However, the developer is using nolock on all tables in the query.  To make doubly sure, I used the blocked process threshold in conjunction with a server-side trace to track any blocking on the server longer than 10 seconds.  There was no blocking at all occurring in the related database. 

Statistics/Query plan:  Statistics are updated nightly on this database.  During the day, the data is static, so stats shouldn’t be the problem.  Another server-side trace confirmed that the query was using the exact same query plan for good runs as it did for bad runs. 

Waits:  That brings me to waits.  I decided to use sp_WhoIsActive to try to trap the problem when it happens and see what the process is waiting for.  The problem was actually catching it in the act.  Up to this point, by the time we were aware of the problem, it was gone.  To catch it, I’d need to have the procedure already running in the background, recording data. 

My first step was using the following script to create a table to hold the output.

DECLARE @wia_table VARCHAR(4000) ;
DECLARE @schema_out VARCHAR(4000) ;

SET @wia_table = 'DBA_Temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;

EXEC sp_WhoIsActive
   @get_transaction_info = 1,
   @get_plans = 1,
   @get_additional_info = 1,
   @get_task_info  = 2,
   @RETURN_SCHEMA = 1,
   @SCHEMA = @schema_out OUTPUT ;

SET @schema_out = REPLACE(@schema_out, '<table_name>', @wia_table) ;

EXEC(@schema_out) ;

The next part was a script to record the sp_WhoIsActive output for the application login every 10 seconds.  Given that the bad executions run over 30 seconds, I figured that every 10 seconds would be sufficient to catch the problem and minimize overhead.  The script ends itself at 5pm.

DECLARE
    @wia_table VARCHAR(4000) ,
    @stoptime DATETIME

SET @wia_table = 'DBA_temp.dbo.WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112)
SET @stoptime = cast(CONVERT(VARCHAR, GETDATE(), 112) + ' 17:00:00'  as datetime )

WHILE GETDATE() < @stoptime
    BEGIN
      
       EXEC sp_WhoIsActive
           @get_transaction_info = 1,
           @get_plans = 1,
           @get_additional_info = 1,
           @get_task_info  = 2,
           @filter = 'my_user', @filter_type = 'login',        
           @destination_table = @wia_table;

       WAITFOR DELAY '00:00:10'

    END ;
GO
 

And finally, I created a job to run both of these scripts, scheduling it to run daily at 7am.  Between this job, the server-side trace, and the blocking trace, I think I have all my bases covered.  The only thing left now is to wait… (Get it?  Wait??)

February 15, 2011 Posted by | Troubleshooting | , | 2 Comments

Plan guides and parameterization

We’ve all had this experience. A third party application is running a piece of sql that simply isn’t choosing the best query plan, and if only it would use a hash join instead of a nested loop. Or maybe it’s blocking other processes and a nolock hint would help. The problem is, you can’t change the code.

Or can you?

SQL Server plan guides effectively allow you to optimize performance of queries you can’t modify directly by attaching hints to them. Let’s say you have an application running the following query:

 plan_guides_original

And after some testing, you decide it will run much better if it were using MERGE JOINs.  So you add a hint.

plan_guides_with hint

That’s all well and good, but you can’t go into the application to add that hint.  And then there’s the added complication of the usage of literal values in the sql, instead of variables.  So the optimizer will see every execution of this query for ‘Manufacturing’ as completely different from an execution for ‘Quality Assurance’.  In order for a query plan to work for any value, you’ll have to parameterize the query.  Now, you could force parameterization at the database level, but that’s a mighty big hammer for this little nail.  So, instead we’ll create a plan guide using 3 steps.

  • Step 1: We’ll use sp_get_query_template to generate a parameterized version of this query
  • Step 2: We’ll force parameterization of any query that matches this form.  
  • Step 3: We can create the plan guide. 

Sound complicated?  It’s not.  Look:

/*    Step 1: Create a parameterized version of the query. */

DECLARE @stmt nvarchar(max)
DECLARE @params nvarchar(max)
EXEC sp_get_query_template
N'SELECT E.EmployeeID, C.LastName, C.FirstName, D.GroupName, E.Title, P.PayFrequency, P.Rate
FROM 
    HumanResources.Employee E 
    inner join Person.Contact C on E.ContactID = C.ContactID
    inner join HumanResources.EmployeePayHistory P on E.EmployeeID = p.EmployeeID
            AND P.RateChangeDate = 
                (SELECT MAX(RateChangeDate) FROM HumanResources.EmployeePayHistory P2
                    WHERE P2.EmployeeID = P.EmployeeID)
    inner join HumanResources.EmployeeDepartmentHistory DH on E.EmployeeID = DH.EmployeeID
            AND DH.EndDate IS NULL
    inner join HumanResources.Department D on D.DepartmentID = DH.DepartmentID
WHERE
    D.GroupName = ''Manufacturing''    AND
    E.CurrentFlag = 1
',
@stmt OUTPUT,            --try selecting these 2 variables to see how they get parameterized
@params OUTPUT

/*    Step 2: Force parameterization of any query that matches this form.  So, any time this 
    sql statement is executed, regardless of the literal values used, it will be parameterized.  
    Therefore all iterations of the query will look the same to the optimizer. This is critical to
    using plan guides.  */
    
EXEC sp_create_plan_guide N'MyTemplateGuide', 
@stmt, 
N'TEMPLATE',             
NULL, 
@params, 
N'OPTION(PARAMETERIZATION FORCED)'

/*    Step 3: Now that we've parameterized the query, we can apply a plan guide to it.  In this
    instance, we're giving it a hint to use a MERGE JOIN */

EXEC sp_create_plan_guide N'MyPlanGuide', 
@stmt, 
N'SQL', 
NULL, 
@params, 
N'OPTION(MERGE JOIN)'

 

Pretty straightforward, right?  And please note that if your query is already using parameters, you don’t need to do steps 1 and 2.  Now, let’s try that original sql again.  Notice the original code is now executing as though it has the MERGE JOIN hint:

plan_guides_new

 

You can see what plan guides are in your current database by querying sys.plan_guides.

 

sys.plan_guides

And you can disable/enable a particular plan guide using sp_control_plan_guide.

disable plan_guide

And that’s it.  You’re a hero.

February 9, 2011 Posted by | SQL Tuning, T-SQL | , , | Leave a comment

The obligatory first post

This past weekend I attended SQL Saturday here in sunny Cleveland.  And while there were probably more fun ways to spend a Saturday, I have to say I’m very glad I went.  There were some great sessions by Steve Jones, Thomas LaRock, Grant Fritchey, and Jeremiah Peschka and I learned a ton of information.

However, I would say that 2 things stand out as the most important lessons I came away with.

1) sp_whoisactive.  I don’t know where I’ve been.  Under a rock, apparently.  This stored procedure by Adam Mechanic is AWESOME.  Period.  I downloaded it this morning and within seconds was able to provide my users with the exact SQL from a couple of reports that were killing our system.  Out of the box, this procedure lists details about all active sessions, including the time since the last batch, the spid, full SQL text, login, CPU utilization, tempdb usage, disk and usage, and a lot more.  I haven’t even explored all the options yet.  This is my new sp_who, my new activity monitor, my new [insert name of pretty third party tool here].  If you haven’t checked it out yet, do it now!

2) This one is a bit embarrassing.  I haven’t updated my resume since… ok I’m not going to say exactly when.  Suffice it to say it’s been years.  Years.  There, I said it.  My head hangs in shame.  Why so long?  I don’t really have a good explanation.  I’ve been at the same employer for over 10 years.  I’m not looking to leave at this point.  I realize that just because I have no intention of leaving, that doesn’t mean my employer might not have other ideas.  And the longer I leave it, the more difficult the task becomes.  Key points get forgotten.  So the second important (and probably the *most* important) thing I took away from SQL Saturday is that I have to make more of an effort in building and maintaining a “brand” for myself, which includes keeping my resume up to date.
Which leads me to this blog.  This is more for me than anyone else.  A place to document things I know, things I learn, questions I face, and maybe even help someone else.  And this is my first post, done.  Glad we got that out of the way.

February 8, 2011 Posted by | Professional Development | , , | Leave a comment