Cleveland DBA

SQL Server administration in Cleveland, OH

Composite indexes – Does column order matter?

When it comes to creating composite indexes, a common bit of advice we’re given is to put the most selective column first.  And I’ve always just taken this at face value.  But recently I got to thinking, is this still true?  I’d never actually tested this theory myself.  Does this rule still apply?

So let’s take a look.  I started by creating a test table in AdventureWorks using existing data.  My goal was to create a table with one high-cardinality column and at least one very low-cardinality column.

USE AdventureWorks;
GO

SET NOCOUNT ON

CREATE TABLE myBigTable (
ID int IDENTITY(1,1),
SURNAME VARCHAR(40),
LOCATION CHAR(3))

INSERT INTO myBigTable (SURNAME, LOCATION)
SELECT LastName, StateProvinceCode FROM Person.Contact CROSS JOIN Person.StateProvince ;
GO

SELECT    COUNT(distinct ID) AS DistinctIDs,
 COUNT(DISTINCT surname) AS DistinctNames,
 COUNT(DISTINCT location) AS DistinctLocs
FROM myBigTable;
GO

Perfect.  ID is unique and location is very low cardinality.  I’ll use those 2 columns in my test indexes.

CREATE INDEX i_highcard ON myBigTable (ID, location);  -- high cardinality first
CREATE INDEX i_lowcard ON myBigTable (location, ID);    -- low cardinality first
GO

Before we go any further, let’s take a look at the physical size of the indexes, just to see if there’s any difference there.

select name, index_id from sys.indexes where object_id = OBJECT_ID('myBigTable')
SELECT * FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('myBigTable'), NULL , NULL, 'LIMITED');
GO

Judging by the fragment_count and page_count values, I’d say there’s no difference in size.  Now, on to query performance.  First we’ll query the table for a particular ID value.  I’ll use the INDEX hint to force the optimizer to use each index.

SET STATISTICS IO ON
SELECT * FROM myBigTable WITH (INDEX = i_highcard) WHERE ID = 77321
SELECT * FROM myBigTable WITH (INDEX = i_lowcard) WHERE ID = 77321

Check out those execution plans.  When we query based on the first column in the index, the optimizer is able to do an index seek.  But when we query based on the second column in the index, as in the second query, the optimizer doesn’t know where to enter the index, so it’s forced to do an index scan.  (In fact, the optimizer even suggests we create an index on ID.)  We can also see the impact of these different execution plans in the IO statistics.

From the first query:
Table ‘myBigTable’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

And from the second query:

Table ‘myBigTable’. Scan count 5, logical reads 9563, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

That’s a lot more IO the database has to do.  And we all know, when it comes to IO, less is more.

Now, those results were pretty predictable.  But what about when both indexed columns are in the where clause?  Based on what I’ve been told, the index with the high-cardinality column first will be more efficient.  But I want to see that for myself.

SELECT * FROM myBigTable WITH (INDEX = i_highcard) WHERE ID = 77321 AND LOCATION = '04'
SELECT * FROM myBigTable WITH (INDEX = i_lowcard) WHERE ID = 77321 AND LOCATION = '04'

First, the IO stats:

Table ‘myBigTable’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘myBigTable’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Exactly the same.  What about the execution plans?

Also the same.  Even down to the cost.

That surprises me.  I really thought the i_highcard index would be more efficient.  So, I’m back to my original question, when it comes to composite indexes, does column order matter?  And I guess the answer is:  it depends.  SQL Server only maintains statistics on the first column in an index, so having a highly selective first column can be more efficient if you’re using that column in your WHERE clauses.  But if you’re using both columns as filter criteria, it doesn’t seem to matter.  This another reason is why it’s extremely important to know your application and how it accesses the data.

April 7, 2011 Posted by | General, SQL Tuning | , | 2 Comments

The obligatory first post

This past weekend I attended SQL Saturday here in sunny Cleveland.  And while there were probably more fun ways to spend a Saturday, I have to say I’m very glad I went.  There were some great sessions by Steve Jones, Thomas LaRock, Grant Fritchey, and Jeremiah Peschka and I learned a ton of information.

However, I would say that 2 things stand out as the most important lessons I came away with.

1) sp_whoisactive.  I don’t know where I’ve been.  Under a rock, apparently.  This stored procedure by Adam Mechanic is AWESOME.  Period.  I downloaded it this morning and within seconds was able to provide my users with the exact SQL from a couple of reports that were killing our system.  Out of the box, this procedure lists details about all active sessions, including the time since the last batch, the spid, full SQL text, login, CPU utilization, tempdb usage, disk and usage, and a lot more.  I haven’t even explored all the options yet.  This is my new sp_who, my new activity monitor, my new [insert name of pretty third party tool here].  If you haven’t checked it out yet, do it now!

2) This one is a bit embarrassing.  I haven’t updated my resume since… ok I’m not going to say exactly when.  Suffice it to say it’s been years.  Years.  There, I said it.  My head hangs in shame.  Why so long?  I don’t really have a good explanation.  I’ve been at the same employer for over 10 years.  I’m not looking to leave at this point.  I realize that just because I have no intention of leaving, that doesn’t mean my employer might not have other ideas.  And the longer I leave it, the more difficult the task becomes.  Key points get forgotten.  So the second important (and probably the *most* important) thing I took away from SQL Saturday is that I have to make more of an effort in building and maintaining a “brand” for myself, which includes keeping my resume up to date.
Which leads me to this blog.  This is more for me than anyone else.  A place to document things I know, things I learn, questions I face, and maybe even help someone else.  And this is my first post, done.  Glad we got that out of the way.

February 8, 2011 Posted by | Professional Development | , , | Leave a comment