Over the years I have spent working with SQL Server, the personal time investment that has repaid itself more than any other is becoming intimately familiar with execution plans and how they can be used to diagnose and correct performance problems. There is nothing I enjoy more than talking about execution plans, the query optimizer, and how internals knowledge can be applied to solve real problems. Living in New Zealand, I don’t get to speak on this topic as often as I would like, primarily due…(read more)
I am very pleased to announce that I will be speaking at the SQL Intersection conference in Las Vegas again this year. This time around, I am giving a full-day workshop, "Mastering SQL Server Execution Plan Analysis" as well as a two-part session, "Parallel Query Execution" during the main conference. The workshop is a pre-conference event, held on Sunday 9 November (straight after this year’s PASS Summit). Being on Sunday gives you the whole Monday off to recover and before the…(read more)
Have SQL Server performance problems? What are you going to look at? Can you quickly isolate the root cause? Or do you ever find yourself stuck, not sure of what to do next? That’s what the No More Guessing seminar is all about. This seminar teaches you to use: Baselining: An analytical methodology based on statistical analysis and evidence rather than graphs and your eyes. Performance counters . The right set of counters to get you where you need to be without being overwhelmed. Wait stats: Including…(read more)
Nested loops join query plans can be a lot more interesting (and complicated!) than is commonly realized. One query plan area I get asked about a lot is prefetching. It is not documented in full detail anywhere, so this seems like a good topic to address in a blog post. The examples used in this article are based on questions asked by Adam Machanic . Test Query The following query uses the AdventureWorks sample database (as usual). Run it a couple of times if necessary so you see the query plan when…(read more)
I’m glad there is no minimum length requirement for T-SQL Tuesday blog posts , because this one will be short. I was in the classroom for almost 11 hours today, and I need to be back tomorrow morning at 7:30. Way long ago, back in SQL 2000 (or was it earlier?) when a query indicated that SQL Server was going to use a nonclustered index to get row pointers, and then look up those rows in the underlying table, the plan just had a very linear look to it. The operator that indicated going from the nonclustered…(read more)
I love SQL Server execution plans. It is often easy to spot the cause of a performance problem just by looking at one. The task is considerably easier if the plan includes run-time information (a so-called ‘actual’ execution plan), but even a compiled plan can be very useful. Nevertheless, there are still times where the execution plan does not tell the whole story, and we need to think more deeply about query execution to really understand a performance problem. This post looks at one such example, based on a recent question posted on the SQL Performance Q & A site.
The Execution Plan
This plan is reasonably large (20MB cached plan size) but not massively complex once you break it down (click on the image above to view it full-size in a new window). The context of the question is that this query usually executes in less than a minute, but sometimes it runs for nearly twenty minutes – though the plan appears identical.
There are many different things to look for in execution plans. What you choose to look at first is as much a matter of personal preference as anything, but many people are drawn to high-cost operators, so I will start there. In this plan, the cost of one operator dominates all others, shown as being responsible for 100% of the cost of the query. It is highlighted in red in Plan Explorer; I have expanded the relevant plan section (the top right) below:
There is no doubt that this seek is busy little thing. It is executed 249,484 times, though it only produces a grand total of 167,813 rows over all iterations of the loop join – an average of just 0.7 rows per seek. There are all sorts of interesting details in the plan about this seek – I could write a whole blog post about it – but two details that stand out are the “Force Seek: True” and “Partitioned: True” attributes. These tell us that the base table is partitioned, and the query writer had to use a FORCESEEK table hint to get this plan.
Without this hint, the optimizer would almost certainly choose a Hash Match or Merge Join rather than Nested Loops. This is understandable given the optimizer’s cost model and the simplifying assumptions it makes (such as assuming every query starts with a cold buffer cache). That’s fine, but we can see from the query plan that the inner-side table has 643 million rows. Left to its own devices, the optimizer would estimate that it would be faster to perform a sequential scan of 643 million rows (with large-block read-ahead) than it would be to run a quarter-million randomly-distributed seeks driven by a Nested Loops join.
I doubt that the optimizer’s reasoning here is sound (at least on any reasonably modern hardware) but there we go. The query author probably knows that a good fraction of this table is likely to be in cache, so with all that in mind, I think we can reasonably assume at this stage that the FORCESEEK hint is genuinely needed here, and this part of the plan is at least reasonably optimal.
Important note: The seek certainly does not account for 100% of the runtime cost of this query. Remember cost percentages are always estimates – even in ‘actual’ plans. It can be useful to check the reasons for high-estimated-cost operators, but they should never be used as a primary tuning metric.
This query was executed on SQL Server 2012, so there is a handy warning triangle on the Sort operator indicating that one or more sort runs had to be spilled to physical tempdb disk. The plan clearly shows this spilling is a result of an inaccurate cardinality estimate at the Filter operator (the estimates are not bad at all prior to this). The Sort expects 9,217 rows totalling approximately 5MB, but actually encountered 61,846 rows in 35MB. As you may know, memory for sorts and hashes is allocated before execution starts, and generally cannot expand dynamically at run time.
The spilled sort is undesirable, of course, but it is unlikely to be a major cause of the occasional poor performance given the small size of the spilled data. Nevertheless, this might be a good place to split this query up. The idea would be to write the results of the query (up to and including the Filter) to a temporary heap table using SELECT INTO, and then create a clustered index with the same keys as the Sort operator. The temporary table would not be large, and may well perform better overall than the spilled sort, including the cost of creating the clustered index. Of course, creating this index will involve a sort, but it will be one based on the known cardinality of the temporary heap table. The part of the plan that could be replaced by a temporary table is shown below:
I am a big fan of simplifications like this. Smaller execution plans tend to optimize better for all sorts of reasons, and the source code usually becomes easier to maintain as well. I should mention there is another warning triangle in the 2012 execution plan (shown on the root icon), which relates to some implicit conversions that I will mention later.
The execution plan was captured with Plan Explorer, so we can also easily see I/O statistics for the two executions. The first is for a fast (sub-60-second) run:
Overall, these I/O numbers show pretty much what we would expect: a decent number of logical reads associated with the seeks into the Trans table (but certainly not 100% of the total, ha ha), a very small number of physical reads, and a small amount of read-ahead activity on a couple of tables.
The second set of I/O data is from a slow run (18 minutes or so):
The very obvious difference is the appearance of a work table, with 178 million logical reads and 130 million LOB logical reads. It seems very likely this work table, and its 300 million logical reads, is responsible for the dramatic decrease in query performance. But given that the execution plans are identical (right down to the XML) what is causing this?
My answer to that question (on the Q & A site) was that it is related to the increased level of read-ahead activity, but to see why that is the case, we will need to reproduce the issue and dig a bit deeper.
Before we really get going on this, it will be useful to take a look at what the execution plan is doing in outline. We saw the first part of the plan earlier when looking at the spilling sort. The data set at that point (which we would like to write to a temporary table, remember) essentially represents source data for a second query, which uses a series of Nested Loops Left Joins to lookup information from other tables:
The inner side of each join involves some reasonably involved logic, which is thankfully not important to the present discussion. What is important is that the result of each lookup is a LOB data type. This begins to shed some light on the LOB logical reads reported against the work table, but it does not explain why the work table (and the 300 million associated reads) do not appear when the query runs quickly (with the same execution plan).
Reproducing the problem
The first part of the repro involves creating six tables that represent the lookup tables in the original query plan. Each table will have 10,000 rows, consisting of a sequential reference number and a second column containing a 2048 single-byte-character string. The source table used to drive the lookups will be a regular Numbers table containing just a single integer column.
The next step is to ensure that each lookup table is optimally organized for read-ahead:
The original query translates into our simplified test rig as:
The broad idea there is to concatenate our 2048-character column to itself five times and include a Unicode character that was used in the original query as a delimiter that could not appear in the source data. Each lookup performs the same basic operation against its target table, and the final result is the result of concatenating all the intermediate results. The query hints are necessary to get the right plan shape, just because my test rig tables are so much smaller than the real ones.
Note that the Unicode delimiter means the 2048-character single-byte data is implicitly converted to Unicode, doubling in size. It is not a crucial feature of the test, but it did appear in the original query and explains the type conversion warnings in the execution plan I mentioned earlier. The execution plan for the test query is (click to enlarge if necessary):
I should also stress that the CONCAT operator (new in SQL Server 2012) is not crucial either. If you are using an earlier version of SQL Server, an equivalent query (for present purposes) is shown below. I’m going to stick with CONCAT for the remainder of the post, however.
Warm cache results
With all data in memory, the test query (in either form) completes in about 1.6 seconds on my laptop. The result shows that each output row contains 147,468 bytes of Unicode character data. A typical set of I/O statistics follows:
Nothing too exciting to see there, but this is just our baseline.
Cold cache results
With no data in memory, the test query now runs for 18.6 seconds – almost 12x slower. The I/O statistics show the expected (but still mysterious!) work table and its associated reads:
The Extended Events wait statistics show SQL Server spent very little of that time waiting on my laptop’s slow hard drive – just 402 ms:
The are a number of factors in play here that we will look at in turn.
Nested Loops Prefetching
One of the reasons the optimizer prefers Hash Match and Merge Join for larger inputs is that the data access patterns tend to favour large sequential read-ahead. Both hash and merge tend to scan (range-scan in the case of a seek) their inputs, and the SQL Server Storage Engine automatically issues read-ahead when it detects this type of access. There is nothing in the execution plan to show that a base table will be read with read-ahead, it just happens.
A very basic implementation of Nested Loops join would not benefit from read-ahead at all on its inner side. The outer (driving) side of the loops join might well be a scan or range-scan of an index, and so benefit from automatic read-ahead, of course. The inner side is executed once per outer row, resulting in a rapid succession of small index seeks for different values. These small seeks will typically not be large enough to trigger the automatic read-ahead mechanism. Indeed, in our test, each inner side seek is for precisely one value.
SQL Server improves on this by implementing a second read-ahead mechanism especially for Nested Loops joins (not all N-L joins, it is a cost-based decision the optimizer makes). The basic idea is to buffer extra rows from the outer side of the join, and to use the row values in the buffer to drive read-ahead for the inner side. The effect is that the Nested Loops join becomes a partly blocking operator as outer-side rows are read into the buffer and read-ahead issued based on buffered index key values.
This read-ahead may be either order-preserving or not, and is indicated in the execution plan by the Nested Loop attributes With Ordered Prefetch and With Unordered Prefetch, respectively. When unordered prefetch occurs, the inner side is processed in whatever order the asynchronous reads happen to complete. With ordered prefetching, the mechanism is careful to ensure that the order of rows entering the join is preserved on the output.
In the test rig, the ORDER BY clause means there is a need to preserve row order, so Ordered Prefetch is used:
The issue described in this post is not specific to ordered prefetching – the same behaviour is just as likely with unordered prefetching. The point is that Nested Loops prefetching is one of the requirements.
Documented trace flags 652 and 8744 may be used (with care, and after serious testing) to disable automatic read-ahead and Nested Loops prefetching respectively. This is sometimes beneficial where all data is expected to be in memory (in which case read-ahead processing consumes resources better used by query execution) or where the I/O subsystem is extremely fast. In case you were wondering, there is no background thread for prefetching – all the work of checking whether the data is in memory, and issuing I/O if not, is performed by the worker thread executing the query.
I should stress that read-ahead and Nested Loops prefetching is generally A Very Good Thing with typical storage solutions (e.g. SANs) and both work best (or at all) when indexes have low logical fragmentation.
The issue described here also requires that a large object data type is manufactured before prefetching. The Compute Scalar operators in the test execution plan perform that function:
By ‘manufactured’, I mean that the source columns are not LOB types, but the expression output is – notice the implicit conversion to nvarchar(max). To be clear about it, the issue we are analysing here does not occur when Nested Loops prefetching occurs with an expression that was a LOB type to begin with.
The Outer Join
The optimizer is quite good, generally speaking, at moving scalar expressions around. If the query had featured inner joins (whether by query design or through optimizer activities) the chances are quite good that the problematic expressions (the LOB manufacturers) would have moved beyond the prefetching, and so out of harm’s way. It is quite tricky to preserve NULL-extension and other outer-join semantics properly when moving expressions above an outer join, so the optimizer generally does not even try. In essence, the outer join represents an optimization barrier to the LOB-manufacturing expressions.
When Nested Loops prefetching occurs with a manufactured LOB, the question arises of where to store the created LOBs when buffering rows for prefetch. If the source data were already a LOB type, the execution engine would already have memory structures in place to handle them. When prefetching encounters a manufactured LOB, it needs to store it somewhere, since the engine is no longer processing a stream of one row at a time. It turns out that there is a small memory buffer set aside for this eventuality, which empirical tests show to be 24KB.
However, this 24KB (directly allocated, not via workspace memory grant) is shared across all concurrently executing prefetching joins in the query. With six such joins in the test rig plan and large manufactured LOBs, the buffer stands no chance. As a result, query execution engages a bail-out option: a work table created in tempdb. Though the pages of the worktable may in fact remain memory-resident, overheads (including latching and using general-purpose code interfaces for access to the buffered rows) mean this is very much slower than using the direct-memory cache.
As with most internal work tables, the logical reads reported on this work table indicate the number of rows processed (not 8KB pages, as for regular I/O statistics). This fact, together with the large number of items processed via the worktable in our test, accounts for the millions of reads reported.
The creation and use of the work table depends on run time conditions and timing. If execution finds the data it needs is already in memory, the prefetch checks are still performed, but no asynchronous read requests end up being posted. The 24KB buffer is never filled, so the need to create a work table never arises. The more prefetch that actually occurs, the higher the chances that the buffer will fill. It is quite possible to experience a low level of prefetch with manufactured LOBs without the engine needing to bail out to a work table, especially if the LOBs are not very big and the I/O system is quite fast.
We can rewrite the query to avoid feeding manufactured LOB data to the prefetch buffer. The idea is to use OUTER APPLY to return the data that contributes to the concatenation, rather than the result of the concatenation. We can then perform the CONCAT operation (which handles NULLs nicely without extra work) after the join, avoiding the prefetch buffer issue completely. In SQL Server versions prior to 2012, we would need to use direct string concatenation, and handle rows that are NULL-extended explicitly using ISNULL or COALESCE.
The execution plan for the rewritten query looks visually similar to the problematic one:
However, the Compute Scalars no longer manufacture a LOB data type, they just emit column and variable references:
All the concatenation work (and LOB manufacture) is performed by the final top-level Compute Scalar in a single monster expression [Expr1056]:
Warm cache results
With all data in memory, the new query completes in 1.8 seconds (very slightly up on 1.6 seconds before):
Cold cache results
When all data must be fetched from disk, the query issues optimal prefetching and completes in 7.3 seconds (down from 18.6 seconds) with no work table:
The Extended Events wait statistics now show 3.8 seconds spent waiting for my laptop’s slow spinning disk (which is a good thing!)
Work tables can appear in STATISTICS IO output for a wide range of reasons, but if you encounter one with a very large number of reads – particularly LOB reads – you may be encountering this issue. The rewrite proposed above may not always be possible, but you should be able to refactor your query to avoid the issue now you know it exists.
I am not a fan of doing large amounts of string manipulation in SQL Server. I am always particularly suspicious of the perceived need to split or concatenate large volumes of strings.
I am, however, a fan of always using explicit data types (rather than relying on implicit conversions) and generating relatively small query plans that offer the query optimizer clear and obvious choices. By necessity, this often means writing small SQL queries in logical steps (and no, long chains of common table expressions do not count!)
The real world does not always make these things possible, of course, but it is good to have goals 🙂
© 2013 Paul White – All Rights Reserved
Screenshots acquired using SnagIt by TechSmith
Query plan details obtained using Plan Explorer PRO by SQLSentry
I have just published a four-part series for SQLPerformance.com on the Halloween Problem. Some of you will never have heard of this issue, and those that have might associate it only with T-SQL
UPDATE queries. In fact, the Halloween problem affects execution plans for
INSERT, UPDATE, DELETE and
This is a topic I have been meaning to write about properly for years, ever since I read Craig Freedman’s 2008 blog post on the topic, which ended with the cryptic comment:
“…although I’ve used update statements for all of the examples in this post, some insert and delete statements also require Halloween protection, but I’ll save that topic for a future post.”
That future post never materialized, sadly, so I thought I would have a go. The four parts of the series are summarized and linked below, I hope you find the material interesting.
The SQL standard and three-phase separation
Logical update processing
The Halloween problem
Avoiding the problem in UPDATE statements
Constraint checking and phase separation
MERGE contains several optimizations the other DML statements do not
Hole-filling with merge join
Hole-filling with nested loops
Avoiding an extra B-tree navigation
Avoiding the join
Early optimization approaches
The SQL Server optimizer approach
The case of the redundant sort
HP levels and properties
Plan changes for Halloween Protection
Heaps and forwarded records
As always, I appreciate your comments and feedback.
I remember one of the most surprising changes in SQL Server 2000 2005 was how the graphical plans showed the use of a nonclustered index to seek, and the plan included something that looked like a JOIN to find rows in the base table. Here’s an example. Although I used SQL Server 2008R2, the graphical plan will be pretty similar to what it looked like back in SQL Server 2000 2005. My code will make a copy of a table in the AdventureWorks2008 database, and then build an index on one of the columns….(read more)