Category Archives: Performance

Minimising Data Movement in PDW Using Query Optimisation Techniques

This is a white paper that I put together recently about APS / PDW Query Optimisation. You may have seen it at http://blogs.technet.com/b/dataplatforminsider/archive/2014/11/14/aps-best-practice-how-to-optimize-query-performance-by-minimizing-data-movement.aspx as well, but in case you haven’t, read on!

I think the significance of this paper is big, because most people who deal with data warehouses (and PDW even more so) haven’t spent much time thinking about Query Optimisation techniques, and certainly not about how they can leverage features of SQL Server’s Query Optimizer to minimise data movement (which is probably the largest culprit for poor performance in a PDW environment).

Oh, and I have another one that I’m writing too…

 


The Analytics Platform System, with its MPP SQL Server engine (SQL Server Parallel Data Warehouse) can deliver performance and scalability for analytics workloads that you may not have expected from SQL Server. But there are key differences in working with SQL Server PDW and SQL Server Enterprise Edition that one should be aware of in order to take full advantage of the SQL Server PDW capabilities. One of the most important considerations when tuning queries in Microsoft SQL Server Parallel Data Warehouse is the minimisation of data movement. This post shows a useful technique regarding the identification of redundant joins through additional predicates that simulate check constraints.

Microsoft’s PDW, part of the Analytics Platform System (APS), offers scale-out technology for data warehouses. This involves spreading data across a number of SQL Server nodes and distributions, such that systems can host up to many petabytes of data. To achieve this, queries which use data from multiple distributions to satisfy joins must leverage the Data Movement Service (DMS) to relocate data during the execution of the query. This data movement is both a blessing and a curse; a blessing because it is the fundamental technology which allows the scale-out features to work, and a curse because it can be one of the most expensive parts of query execution. Furthermore, tuning to avoid data movement is something which many SQL Server query tuning experts have little experience, as it is unique to the Parallel Data Warehouse edition of SQL Server.

Regardless of whether data in PDW is stored in a column-store or row-store manner, or whether it is partitioned or not, there is a decision to be made as to whether a table is to be replicated or distributed. Replicated tables store a full copy of their data on each compute node of the system, while distributed tables distribute their data across distributions, of which there are eight on each compute node. In a system with six compute nodes, there would be forty-eight distributions, with an average of less than 2.1% (100% / 48) of the data in each distribution.

When deciding whether to distribute or replicate data, there are a number of considerations to bear in mind. Replicated data uses more storage and also has a larger management overhead, but can be more easily joined to data, as every SQL node has local access to replicated data. By distributing larger tables according to the hash of one of the table columns (known as the distribution key), the overhead of both reading and writing data is reduced – effectively reducing the size of databases by an order of magnitude.

Having decided to distribute data, choosing which column to use as the distribution key is driven by factors including the minimisation of data movement and the reduction of skew. Skew is important because if a distribution has much more than the average amount of data, this can affect query time. However, the minimisation of data movement is probably the most significant factor in distribution-key choice.

Joining two tables together involves identifying whether rows from each table match to according a number of predicates, but to do this, the two rows must be available on the same compute node. If one of the tables is replicated, this requirement is already satisfied (although it might need to be ‘trimmed’ to enable a left join), but if both tables are distributed, then the data is only known to be on the same node if one of the join predicates is an equality predicate between the distribution keys of the tables, and the data types of those keys are exactly identical (including nullability and length). More can be read about this in the excellent whitepaper about Query Execution in Parallel Data Warehouse at http://gsl.azurewebsites.net/Portals/0/Users/Projects/pdwau3/sigmod2012.pdf

To avoid data movement between commonly-performed joins, creativity is often needed by the data warehouse designers. This could involve the addition of extra columns to tables, such as adding the CustomerKey to many fact data tables (and using this as the distribution key), as joins between orders, items, payments, and other information required for a given report, as all these items are ultimately about a customer, and adding additional predicates to each join to alert the PDW Engine that only rows within the same distribution could possibly match. This is thinking that is alien for most data warehouse designers, who would typically feel that adding CustomerKey to a table not directly related to a Customer dimension is against best-practice advice.

Another technique commonly used by PDW data warehouse designers that is rarely seen in other SQL Server data warehouses is splitting tables up into two, either vertically or horizontally, whereas both are relatively common in PDW to avoid some of the problems that can often occur.

Splitting a table vertically is frequently done to reduce the impact of skew when the ideal distribution key for joins is not evenly distributed. Imagine the scenario of identifiable customers and unidentifiable customers, as increasingly the situation as stores have loyalty programs allowing them to identify a large portion (but not all) customers. For the analysis of shopping trends, it could be very useful to have data distributed by customer, but if half the customers are unknown, there will be a large amount of skew.

To solve this, sales could be split into two tables, such as Sales_KnownCustomer (distributed by CustomerKey) and Sales_UnknownCustomer (distributed by some other column). When analysing by customer, the table Sales_KnownCustomer could be used, including the CustomerKey as an additional (even if redundant) join predicate. A view performing a UNION ALL over the two tables could be used to allow reports that need to consider all Sales.

The query overhead of having the two tables is potentially high, especially if we consider tables for Sales, SaleItems, Deliveries, and more, which might all need to be split into two to avoid skew while minimising data movement, using CustomerKey as the distribution key when known to allow customer-based analysis, and SalesKey when the customer is unknown.

By distributing on a common key the impact is to effectively create mini-databases which are split out according to groups of customers, with all of the data about a particular customer residing in a single database. This is similar to the way that people scale out when doing so manually, rather than using a system such as PDW. Of course, there is a lot of additional overhead when trying to scale out manually, such as working out how to execute queries that do involve some amount of data movement.

By splitting up the tables into ones for known and unknown customers, queries that were looking something like the following:

SELECT …
FROM Sales AS s
JOIN SaleItems AS si
ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems AS dsi
ON dsi.LineItemKey = si.LineItemKey
JOIN Deliveries AS d
ON d.DeliveryKey = dsi.DeliveryKey

…would become something like:

SELECT …
FROM Sales_KnownCustomer AS s
JOIN SaleItems_KnownCustomer AS si
ON si.SalesKey = s.SalesKey
AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems_KnownCustomer AS dsi
ON dsi.LineItemKey = si.LineItemKey
AND dsi.CustomerKey = s.CustomerKey
JOIN Deliveries_KnownCustomer AS d
ON d.DeliveryKey = dsi.DeliveryKey
AND d.CustomerKey = s.CustomerKey
UNION ALL
SELECT …
FROM Sales_UnknownCustomer AS s
JOIN SaleItems_UnknownCustomer AS li
ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems_UnknownCustomer AS dsi
ON dsi.LineItemKey = s.LineItemKey
AND dsi.SalesKey = s.SalesKey
JOIN Deliveries_UnknownCustomer AS d
ON d.DeliveryKey = s.DeliveryKey
AND d.SalesKey = s.SalesKey

I’m sure you can appreciate that this becomes a much larger effort for query writers, and the existence of views to simplify querying back to the earlier shape could be useful. If both CustomerKey and SalesKey were being used as distribution keys, then joins between the views would require both, but this can be incorporated into logical layers such as Data Source Views much more easily than using UNION ALL across the results of many joins. A DSV or Data Model could easily define relationships between tables using multiple columns so that self-serving reporting environments leverage the additional predicates.

The use of views should be considered very carefully, as it is easily possible to end up with views that nest views that nest view that nest views, and an environment that is very hard to troubleshoot and performs poorly. With sufficient care and expertise, however, there are some advantages to be had.

The resultant query would look something like:

SELECT …
FROM Sales AS s
JOIN SaleItems AS li
ON si.SalesKey = s.SalesKey
AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems AS dsi
ON dsi.LineItemKey = si.LineItemKey
AND dsi.CustomerKey = s.CustomerKey
AND dsi.SalesKey = s.SalesKey
JOIN Deliveries AS d
ON d.DeliveryKey = dsi.DeliveryKey
AND d.CustomerKey = s.CustomerKey
AND d.SalesKey = s.SalesKey

Joining multiple sets of tables which have been combined using UNION ALL is not the same as performing a UNION ALL of sets of tables which have been joined. Much like any high school mathematics teacher will happily explain that (a*b)+(c*d) is not the same as (a+c)*(b+d), additional combinations need to be considered when the logical order of joins and UNION ALLs.

joins

Notice that when we have (TableA1 UNION ALL TableA2) JOIN (TableB1 UNION ALL TableB2), we must perform joins not only between TableA1 and TableB1, and TableA2 and TableB2, but also TableA1 and TableB2, and TableB1 and TableA2. These last two combinations do not involve tables with common distribution keys, and therefore we would see data movement. This is despite the fact that we know that there can be no matching rows in those combinations, because some are for KnownCustomers and the others are for UnknownCustomers. Effectively, the relationships between the tables would be more like the following diagram:

joins2

There is an important stage of Query Optimization which must be considered here, and which can be leveraged to remove the need for data movement when this pattern is applied – that of Contradiction.

The contradiction algorithm is an incredibly useful but underappreciated stage of Query Optimization. Typically it is explained using an obvious contradiction such as WHERE 1=2. Notice the effect on the query plans of using this predicate.

clip_image012Because the Query Optimizer recognises that no rows can possibly satisfy the predicate WHERE 1=2, it does not access the data structures seen in the first query plan.

This is useful, but many readers may not consider queries that use such an obvious contradiction are going to appear in their code.

But suppose the views that perform a UNION ALL are expressed in this form:

CREATE VIEW dbo.Sales AS
SELECT *
FROM dbo.Sales_KnownCustomer
WHERE CustomerID > 0
UNION ALL
SELECT *
FROM dbo.Sales_UnknownCustomer
WHERE CustomerID = 0;

Now, we see a different kind of behaviour.

Before the predicates are used, the query on the views is rewritten as follows (with SELECT clauses replaced by ellipses).

SELECT …
FROM (SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
UNION ALL
SELECT …
FROM [tempdb].[dbo].[TEMP_ID_4208] AS T4_1) AS T2_1
INNER JOIN
(SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
UNION ALL
SELECT …
FROM [tempdb].[dbo].[TEMP_ID_4209] AS T5_1) AS T3_1
INNER JOIN
(SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
UNION ALL
SELECT …
FROM [tempdb].[dbo].[TEMP_ID_4210] AS T6_1) AS T4_1
INNER JOIN
(SELECT …
FROM [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
UNION ALL
SELECT …
FROM [tempdb].[dbo].[TEMP_ID_4211] AS T6_1) AS T4_2
ON (([T4_2].[CustomerKey] = [T4_1].[CustomerKey])
AND ([T4_2].[SalesKey] = [T4_1].[SalesKey])
AND ([T4_2].[DeliveryKey] = [T4_1].[DeliveryKey]))) AS T3_2
ON (([T3_1].[CustomerKey] = [T3_2].[CustomerKey])
AND ([T3_1].[SalesKey] = [T3_2].[SalesKey])
AND ([T3_2].[SaleItemKey] = [T3_1].[SaleItemKey]))) AS T2_2
ON (([T2_2].[CustomerKey] = [T2_1].[CustomerKey])
AND ([T2_2].[SalesKey] = [T2_1].[SalesKey]))) AS T1_1

Whereas with the inclusion of the additional predicates, the query simplifies to:

SELECT …
FROM (SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
WHERE ([T4_1].[CustomerKey] > 0)) AS T3_1
INNER JOIN
(SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
WHERE ([T5_1].[CustomerKey] > 0)) AS T4_1
INNER JOIN
(SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
WHERE ([T6_1].[CustomerKey] > 0)) AS T5_1
INNER JOIN
(SELECT …
FROM [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
WHERE ([T6_1].[CustomerKey] > 0)) AS T5_2
ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))
UNION ALL
SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[Sales_UnknownCustomer] AS T4_1
WHERE ([T4_1].[CustomerKey] = 0)) AS T3_1
INNER JOIN
(SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[SaleItems_UnknownCustomer] AS T5_1
WHERE ([T5_1].[CustomerKey] = 0)) AS T4_1
INNER JOIN
(SELECT …
FROM (SELECT …
FROM [sample_vsplit].[dbo].[Delivery_SaleItems_UnknownCustomer] AS T6_1
WHERE ([T6_1].[CustomerKey] = 0)) AS T5_1
INNER JOIN
(SELECT …
FROM [sample_vsplit].[dbo].[Deliveries_UnknownCustomer] AS T6_1
WHERE ([T6_1].[CustomerKey] = 0)) AS T5_2
ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))) AS T1_1

This may seem more complex – it’s certainly longer – but this is the original, preferred version of the join. This is a powerful rewrite of the query.

joins3 

Furthermore, the astute PDW-familiar reader will quickly realise that the UNION ALL of two local queries (queries that don’t require data movement) is also local, and that therefore, this query is completely local. The TEMP_ID_NNNNN tables in the first rewrite are more evidence that data movement has been required.

When the two plans are shown using PDW’s EXPLAIN keyword, the significance is shown even clearer.

The first plan appears as following, and it is obvious that there is a large amount of data movement involved.

clip_image014

clip_image015

The queries passed in are identical, but the altered definitions of the views have removed the need for any data movement at all. This should allow your query to run a little faster. Ok, a lot faster.

Summary

When splitting distributed tables vertically to avoid skew, views over those tables should include predicates which reiterate the conditions that cause the data to be populated into each table. This provides additional information to the PDW Engine that can remove unnecessary data movement, resulting in much-improved performance, both for standard reports using designed queries, and ad hoc reports that use a data model.

 

Check us out at www.lobsterpot.com.au or talk to me via Twitter at @rob_farley

How TOP wrecks performance (part 1)

The TOP keyword in the SELECT clause may not be ANSI standard, but I am sure we have all used it. But do you know that it comes with two optional keywords? And have you used them? They are PERCENT and WITH TIES. TOP Let’s first look at the based functionality of TOP, and how it affects performance – which in fact, contrary to the title of this post, is usually quite good for performance. Here is a simple example, using the AdventureWorks2012 sample database (which can be downloaded from Codeplex…(read more)

SQL Server Hardware

From Red-Gate website you can download (for free) the ebook SQL Server Hardware written by Glenn Berry (@GlennAlanBerry).

This ebook will provide the fundamental knowledge and resources you need to make intelligent decisions about choice, and optimal installation and configuration, of SQL Server hardware, operating system and the SQL Server RDBMS. 

The full editorial review is available here.

This book cannot miss in your digital library! And obviously you can buy the paperback on Amazon.

Execution plans, laundry, and a giveaway

In just a week from now, SQL Saturday #337 will kick off in Portland, Oregon. And I will be there – the third time already for me to be speaking in Portland. For a European, Portland is not the most logical location. But the organization over there is a pretty smart bunch of people. They figure that being close to Seattle gives them a great opportunity – so whenever they get the chance, they will reserve a Saturday just before the PASS Summit in Seattle for their own event. And then they make sure…(read more)

SQL Spatial: Getting “nearest” calculations working properly

If you’ve ever done spatial work with SQL Server, I hope you’ve come across the ‘nearest’ problem.

You have five thousand stores around the world, and you want to identify the one that’s closest to a particular place. Maybe you want the store closest to the LobsterPot office in Adelaide, at -34.925806, 138.605073. Or our new US office, at 42.524929, -87.858244. Or maybe both!

You know how to do this. You don’t want to use an aggregate MIN or MAX, because you want the whole row, telling you which store it is. You want to use TOP, and if you want to find the closest store for multiple locations, you use APPLY. Let’s do this (but I’m going to use addresses in AdventureWorks2012, as I don’t have a list of stores). Oh, and before I do, let’s make sure we have a spatial index in place. I’m going to use the default options.

CREATE SPATIAL INDEX spin_Address ON Person.Address(SpatialLocation);

And my actual query:

WITH MyLocations AS
(SELECT * FROM (VALUES (‘LobsterPot Adelaide’, geography::Point(-34.925806, 138.605073, 4326)),
                       (‘LobsterPot USA’, geography::Point(42.524929, -87.858244, 4326))
               ) t (Name, Geo))
SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country
FROM MyLocations AS l
CROSS APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a
JOIN Person.StateProvince AS s
    ON s.StateProvinceID = a.StateProvinceID
JOIN Person.CountryRegion AS c
    ON c.CountryRegionCode = s.CountryRegionCode
;

image

Great! This is definitely working. I know both those City locations, even if the AddressLine1s don’t quite ring a bell. I’m sure I’ll be able to find them next time I’m in the area.

But of course what I’m concerned about from a querying perspective is what’s happened behind the scenes – the execution plan.

image

This isn’t pretty. It’s not using my index. It’s sucking every row out of the Address table TWICE (which sucks), and then it’s sorting them by the distance to find the smallest one. It’s not pretty, and it takes a while. Mind you, I do like the fact that it saw an indexed view it could use for the State and Country details – that’s pretty neat. But yeah – users of my nifty website aren’t going to like how long that query takes.

The frustrating thing is that I know that I can use the index to find locations that are within a particular distance of my locations quite easily, and Microsoft recommends this for solving the ‘nearest’ problem, as described at http://msdn.microsoft.com/en-au/library/ff929109.aspx.

Now, in the first example on this page, it says that the query there will use the spatial index. But when I run it on my machine, it does nothing of the sort.

image

I’m not particularly impressed. But what we see here is that parallelism has kicked in. In my scenario, it’s split the data up into 4 threads, but it’s still slow, and not using my index. It’s disappointing.

But I can persuade it with hints!

If I tell it to FORCESEEK, or use my index, or even turn off the parallelism with MAXDOP 1, then I get the index being used, and it’s a thing of beauty! Part of the plan is here:

image

It’s massive, and it’s ugly, and it uses a TVF… but it’s quick.

The way it works is to hook into the GeodeticTessellation function, which is essentially finds where the point is, and works out through the spatial index cells that surround it. This then provides a framework to be able to see into the spatial index for the items we want. You can read more about it at http://msdn.microsoft.com/en-us/library/bb895265.aspx#tessellation – including a bunch of pretty diagrams. One of those times when we have a much more complex-looking plan, but just because of the good that’s going on.

This tessellation stuff was introduced in SQL Server 2012. But my query isn’t using it.

When I try to use the FORCESEEK hint on the Person.Address table, I get the friendly error:

Msg 8622, Level 16, State 1, Line 1
Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

And I’m almost tempted to just give up and move back to the old method of checking increasingly large circles around my location. After all, I can even leverage multiple OUTER APPLY clauses just like I did in my recent Lookup post.

WITH MyLocations AS
(SELECT * FROM (VALUES (‘LobsterPot Adelaide’, geography::Point(-34.925806, 138.605073, 4326)),
                       (‘LobsterPot USA’, geography::Point(42.524929, -87.858244, 4326))
               ) t (Name, Geo))
SELECT
    l.Name,
    COALESCE(a1.AddressLine1,a2.AddressLine1,a3.AddressLine1),
    COALESCE(a1.City,a2.City,a3.City),
    s.Name AS [State],
    c.Name AS Country
FROM MyLocations AS l
OUTER APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    WHERE l.Geo.STDistance(ad.SpatialLocation) < 1000
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a1
OUTER APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    WHERE l.Geo.STDistance(ad.SpatialLocation) < 5000
    AND a1.AddressID IS NULL
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a2
OUTER APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    WHERE l.Geo.STDistance(ad.SpatialLocation) < 20000
    AND a2.AddressID IS NULL
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a3
JOIN Person.StateProvince AS s
    ON s.StateProvinceID = COALESCE(a1.StateProvinceID,a2.StateProvinceID,a3.StateProvinceID)
JOIN Person.CountryRegion AS c
    ON c.CountryRegionCode = s.CountryRegionCode
;

But this isn’t friendly-looking at all, and I’d use the method recommended by Isaac Kunen, who uses a table of numbers for the expanding circles.

It feels old-school though, when I’m dealing with SQL 2012 (and later) versions. So why isn’t my query doing what it’s supposed to? Remember the query…

WITH MyLocations AS
(SELECT * FROM (VALUES (‘LobsterPot Adelaide’, geography::Point(-34.925806, 138.605073, 4326)),
                       (‘LobsterPot USA’, geography::Point(42.524929, -87.858244, 4326))
               ) t (Name, Geo))
SELECT l.Name, a.AddressLine1, a.City, s.Name AS [State], c.Name AS Country
FROM MyLocations AS l
CROSS APPLY (
    SELECT TOP (1) *
    FROM Person.Address AS ad
    ORDER BY l.Geo.STDistance(ad.SpatialLocation)
    ) AS a
JOIN Person.StateProvince AS s
    ON s.StateProvinceID = a.StateProvinceID
JOIN Person.CountryRegion AS c
    ON c.CountryRegionCode = s.CountryRegionCode
;

Well, I just wasn’t reading http://msdn.microsoft.com/en-us/library/ff929109.aspx properly.

The following requirements must be met for a Nearest Neighbor query to use a spatial index:

  1. A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses.

  2. The TOP clause cannot contain a PERCENT statement.

  3. The WHERE clause must contain a STDistance() method.

  4. If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause.

  5. The first expression in the ORDER BY clause must use the STDistance() method.

  6. Sort order for the first STDistance() expression in the ORDER BY clause must be ASC.

  7. All the rows for which STDistance returns NULL must be filtered out.

Let’s start from the top.

1. Needs a spatial index on one of the columns that’s in the STDistance call. Yup, got the index.

2. No ‘PERCENT’. Yeah, I don’t have that.

3. The WHERE clause needs to use STDistance(). Ok, but I’m not filtering, so that should be fine.

4. Yeah, I don’t have multiple predicates.

5. The first expression in the ORDER BY is my distance, that’s fine.

6. Sort order is ASC, because otherwise we’d be starting with the ones that are furthest away, and that’s tricky.

7. All the rows for which STDistance returns NULL must be filtered out. But I don’t have any NULL values, so that shouldn’t affect me either.

…but something’s wrong. I do actually need to satisfy #3. And I do need to make sure #7 is being handled properly, because there are some situations (eg, differing SRIDs) where STDistance can return NULL. It says so at http://msdn.microsoft.com/en-us/library/bb933808.aspx – “STDistance() always returns null if the spatial reference IDs (SRIDs) of the geography instances do not match.” So if I simply make sure that I’m filtering out the rows that return NULL…

…then it’s blindingly fast, I get the right results, and I’ve got the complex-but-brilliant plan that I wanted.

image

It just wasn’t overly intuitive, despite being documented.

@rob_farley

SSIS Lookup transformation in T-SQL

There is no equivalent to the SSIS Lookup transformation in T-SQL – but there is a workaround if you’re careful.

The big issue that you face is about the number of rows that you connect to in the Lookup. SQL Books Online (BOL) says:

  • If there is no matching entry in the reference dataset, no join occurs. By default, the Lookup transformation treats rows without matching entries as errors. However, you can configure the Lookup transformation to redirect such rows to a no match output. For more information, see Lookup Transformation Editor (General Page) and Lookup Transformation Editor (Error Output Page).
  • If there are multiple matches in the reference table, the Lookup transformation returns only the first match returned by the lookup query. If multiple matches are found, the Lookup transformation generates an error or warning only when the transformation has been configured to load all the reference dataset into the cache. In this case, the Lookup transformation generates a warning when the transformation detects multiple matches as the transformation fills the cache.

This is very important. It means that every row that enters the Lookup transformation comes out. This could be coming out of the transformation as an error, or through a ‘No Match’ output, with an ignored failure, or having found a row. But it will never return multiple copies of the row, even if it has matched two rows. This last point is inherently different to what happens in T-SQL. In T-SQL, any time you do a join, whether an INNER JOIN or an OUTER JOIN, if you match multiple rows on the right hand side, you get two copies of the row from the left. When doing Lookups in the world of ETL (as you would with SSIS), this is a VeryBadThing.

TSQL2sDay150x150You see, there’s an assumption with ETL systems that things are under control in your data warehouse. It’s this assumption that I want to look at in this post. I do actually think it’s quite a reasonable one, but I also recognise that a lot of people don’t feel that it’s something they can rely on. Either way, I’ll show you a couple of ways that you can implement some workarounds, and it also qualifies this post for this month’s T-SQL Tuesday, hosted by Dev Nambi.

Consider that you have a fact row, and you need to do a lookup into a dimension table to find the appropriate key value (I might know that the fact row corresponds to the Adelaide office, but having moved recently, I would want to know whether it’s the new version of the office or the old one). I know that ‘ADL’ is unique in my source system – quite probably because of a unique constraint in my OLTP environment – but I don’t have that guarantee in my warehouse. Actually, I know that I will have multiple rows for ADL. Only one is current at any point in time, but can I be sure that if I try to find the ADL record for a particular point in time, I will only find one row?

A typical method for versioning dimension records (a Type 2 scenario) is to have a StartDate and EndDate for each version. But implementing logic to make sure there can never be an overlap is tricky. It’s easy enough to test, particularly since LAG/LEAD functions became available, but putting an actual constraint in there is harder – even more so if you’re dealing with something like Microsoft’s Parallel Data Warehouse, which doesn’t support unique constraints (this is totally fair enough, when you consider that the rows for a single table can be spread across hundreds of sub-tables).

If we know that we have contiguous StartDate/EndDate ranges, with no gaps and no overlaps, then we can confidently write a query like:

FROM facttable f
LEFT JOIN dimtable d
ON d.BusinessKey = f.DimBK
AND d.StartDate <= f.FactDate
AND f.FactDate < d.EndDate

By doing a LEFT JOIN, we know that we’re never going to eliminate a fact by failing to match it (and can introduce an inferred dimension member), but if we have somehow managed to have overlapping records, then we could inadvertently get a second copy of our fact row. That’s going to wreck our aggregates, and the business will lose faith in the environment that has been put in.

Of course, your dimension management is sound. You will never have this problem. Really. But what happens if someone has broken the rules and manually tweaked something? What if there is disagreement amongst BI developers about the logic that should be used for EndDate values (some prefer to have a gap of a day, as in “Jan 1 to Jan 31, Feb 1 to Feb 28”, whereas others prefer to have the EndDate value the same as the next StartDate. There’s definitely potential for inconsistency between developers.

Whatever the reason, if you suddenly find yourself with the potential for two rows to be returned by a ‘lookup join’ like this, you have a problem. Clearly the SSIS Lookup transform ensures that there is never a second row considered to match, but T-SQL doesn’t offer a join like that.

But it does give us APPLY.

We can use APPLY to reproduce the same functionality as a join, by using code such as:

FROM facttable f
OUTER APPLY (SELECT * FROM dimtable d
             WHERE d.BusinessKey = f.DimBK
             AND d.StartDate <= f.FactDate
             AND f.FactDate < d.EndDate) d1

But because we now have a fully-fledged correlated table expression, we can be a little more tricky, and tweak it with TOP,

FROM facttable f
OUTER APPLY (SELECT TOP (1) * FROM dimtable d
             WHERE d.BusinessKey = f.DimBK
             AND d.StartDate <= f.FactDate
             AND f.FactDate < d.EndDate) d1

, which leaves us being confident that the number of rows in the set produced by our FROM clause is exactly the same number as we have in our fact table. The OUTER APPLY (rather than CROSS APPLY) makes sure we have lose rows, and the TOP (1) ensures that we never match more than one.

But still I feel like we have a better option that having to consider which method of StartDate/EndDate logic is used.

What we want is the most recent version of the dimension member at the time of the fact record. To me, this sounds like a TOP query with an ORDER BY and a filter,

FROM facttable f
OUTER APPLY (SELECT TOP (1) * FROM dimtable d
             WHERE d.BusinessKey = f.DimBK
             AND d.StartDate <= f.FactDate
             ORDER BY d.StartDate DESC) d1

, and you will notice that I’m no longer using the EndDate at all. In fact, I don’t need to bother having it in the table at all.

Now, the worst scenario that I can imagine is that I have a fact record that has been backdated to before the dimension member appeared in the system. I’m sure you can imagine it, such as when someone books vacation time before they’ve actually started with a company. The dimension member StartDate might be populated with when they actually start with the company, but they have activity before their record becomes ‘current’.

Well, I solve that with a second APPLY.

FROM facttable f
OUTER APPLY (SELECT TOP (1) * FROM dimtable d
             WHERE d.BusinessKey = f.DimBK
             AND d.StartDate <= f.FactDate
             ORDER BY d.StartDate DESC) d1
OUTER APPLY (SELECT TOP (1) * FROM dimtable d
             WHERE d1.BusinessKey IS NULL
             AND d.BusinessKey = f.DimBK
             AND d.StartDate > f.FactDate
             ORDER BY d.StartDate ASC) d1a

Notice that I correlate the second APPLY to the first one, with the predicate “d1.BusinessKey IS NULL”. This is very important, and addresses a common misconception, as many people will look at this query and assume that the second APPLY will be executed for every row. Let’s look at the plan that would come about here.

 

image

I don’t have any indexes on facttable – I’m happy enough to scan the whole table, but I want you to notice the two Nested Loop operators and the lower branches for them. A Nested Loop operator sucks data from it’s top branch, and for every row that comes in, requests any matching rows from the lower one.

We already established that the APPLY with TOP is not going to change the number of rows, so the number of rows that the left-most Nested Loop is pulling from its top branch is the same as the one on its right, which also matches the rows from the Table Scan. And we know that we do want to check dimtable for every row that’s coming from facttable.

But we don’t want to be doing a Seek in dimtable a second time for every row that the Nested Loop pulls from factable.

Luckily, that’s another poor assumption. People misread this about execution plans all the time.

When taught how to read an execution plan, many will head straight to the top-right, and whenever they hit a join operator, head to the right of that branch. And it’s true that the data streams do start there. It’s not the full story though, and it’s shown quite clearly here, through that Filter operator.

That Filter operator is no ordinary one, but has a Startup Expression Predicate property.

image

This means that the operator only requests rows from its right, if that predicate is satisfied. In this case, it means if it didn’t find matching row the first time it looked in dimtable. Therefore, the second Index Seek won’t get executed except in very rare situations. And we know (but the QO doesn’t) that it will be typically none at all, and that the estimated cost is not going to be 33%, but much more like 0%.

So now you have a way of being able to do lookups that will not only guarantee that one row (at most) will be picked up, but you also have a pattern that will let you do a second lookup for those times when you don’t have the first.

And keep your eye out for Startup Expression Predicates – they can be very useful for knowing which parts of your execution plan don’t need to get executed…

@rob_farley

SQL 2014 does data the way developers want

A post I’ve been meaning to write for a while, good that it fits with this month’s T-SQL Tuesday, hosted by Joey D’Antoni (TSQL2sDay150x150@jdanton)

Ever since I got into databases, I’ve been a fan. I studied Pure Maths at university (as well as Computer Science), and am very comfortable with Set Theory, which undergirds relational database concepts. But I’ve also spent a long time as a developer, and appreciate that that databases don’t exactly fit within the stuff I learned in my first year of uni, particularly the “Algorithms and Data Structures” subject, in which we studied concepts like linked lists. Writing in languages like C, we used pointers to quickly move around data, without a database in sight. Of course, if we had a power failure all this data was lost, as it was only persisted in RAM. Perhaps it’s why I’m a fan of database internals, of indexes, latches, execution plans, and so on – the developer in me wants to be reassured that we’re getting to the data as efficiently as possible.

Back when SQL Server 2005 was approaching, one of the big stories was around CLR. Many were saying that T-SQL stored procedures would be a thing of the past because we now had CLR, and that obviously going to be much faster than using the abstracted T-SQL. Around the same time, we were seeing technologies like Linq-to-SQL produce poor T-SQL equivalents, and developers had had a gutful. They wanted to move away from T-SQL, having lost trust in it. I was never one of those developers, because I’d looked under the covers and knew that despite being abstracted, T-SQL was still a good way of getting to data. It worked for me, appealing to both my Set Theory side and my Developer side.

CLR hasn’t exactly become the default option for stored procedures, although there are plenty of situations where it can be useful for getting faster performance.

SQL Server 2014 is different though, through Hekaton – its In-Memory OLTP environment.

When you create a table using Hekaton (that is, a memory-optimized one), the table you create is the kind of thing you’d’ve made as a developer. It creates code in C leveraging structs and pointers and arrays, which it compiles into fast code. When you insert data into it, it creates a new instance of a struct in memory, and adds it to an array. When the insert is committed, a small write is made to the transaction to make sure it’s durable, but none of the locking and latching behaviour that typifies transactional systems is needed. Indexes are done using hashes and using bw-trees (which avoid locking through the use of pointers) and by handling each updates as a delete-and-insert.

This is data the way that developers do it when they’re coding for performance – the way I was taught at university before I learned about databases. Being done in C, it compiles to very quick code, and although these tables don’t support every feature that regular SQL tables do, this is still an excellent direction that has been taken.

@rob_farley

SQL TuneIn Zagreb 2014 – Session material

I spent the last few days in Zagreb, Croatie, at the third edition of the SQL TuneIn conference , and I had a very good time here. Nice company, good sessions, and awesome audiences. I presented my “Understanding Execution Plans” precon to a small but interested audience on Monday. Participants have received a download link for the slide deck. On Tuesday I had a larger crowd for my session on cardinality estimation. The slide deck and demo code used for that presentation will be available through…(read more)

Columnstore Case Study #2: Columnstore faster than SSAS Cube at DevCon Security

Preamble

This is the second in a series of posts documenting big wins encountered using columnstore indexes in SQL Server 2012 & 2014.  Many of these can be found in my big deck along with details such as internals, best practices, caveats, etc.  The purpose of sharing the case studies in this context is to provide an easy-to-consume quick-reference alternative. See also Columnstore Case Study #1: MSIT SONAR Aggregations

Why Columnstore?

As stated previously, If we’re looking for a subset of columns from one or a few rows, given the right indexes, SQL Server can do a superlative job of providing an answer. If we’re asking a question which by design needs to hit lots of rows—DW, reporting, aggregations, grouping, scans, etc., SQL Server has never had a good mechanism—until columnstore.

Columnstore indexes were introduced in SQL Server 2012. However, they’re still largely unknown. Some adoption blockers existed; yet columnstore was nonetheless a game changer for many apps.  In SQL Server 2014, potential blockers have been largely removed & they’re going to profoundly change the way we interact with our data.  The purpose of this series is to share the performance benefits of columnstore & documenting columnstore is a compelling reason to upgrade to SQL Server 2014.

The Customer

DevCon Security provides home & business security services & has been in business for 135 years. I met DevCon personnel while speaking to the Utah County SQL User Group on 20 February 2012. (Thanks to TJ Belt (b|@tjaybelt) & Ben Miller (b|@DBADuck) for the invitation which serendipitously coincided with the height of ski season.)

The App: DevCon Security Reporting: Optimized & Ad Hoc Queries

DevCon users interrogate a SQL Server 2012 Analysis Services cube via SSRS. In addition, the SQL Server 2012 relational back end is the target of ad hoc queries; this DW back end is refreshed nightly during a brief maintenance window via conventional table partition switching.

SSRS, SSAS, & MDX

Conventional relational structures were unable to provide adequate performance for user interaction for the SSRS reports. An SSAS solution was implemented requiring personnel to ramp up technically, including learning enough MDX to satisfy requirements.

Ad Hoc Queries

Even though the fact table is relatively small—only 22 million rows & 33GB—the table was a typical DW table in terms of its width: 137 columns, any of which could be the target of ad hoc interrogation. As is common in DW reporting scenarios such as this, it is often nearly to optimize for such queries using conventional indexing.

DevCon DBAs & developers attended PASS 2012 & were introduced to the marvels of columnstore in a session presented by Klaus Aschenbrenner (b|@Aschenbrenner)

The Details

Classic vs. columnstore before-&-after metrics are impressive.

 

 

 

Scenario

Conventional Structures

Columnstore

Δ

SSRS via SSAS

10 – 12 seconds

1 second

>10x

Ad Hoc

5-7 minutes
(300 – 420 seconds)

1 – 2 seconds

>100x

Here are two charts characterizing this data graphically.  The first is a linear representation of Report Duration (in seconds) for Conventional Structures vs. Columnstore Indexes. 

image

As is so often the case when we chart such significant deltas, the linear scale doesn’t expose some the dramatically improved values corresponding to the columnstore metrics.  Just to make it fair here’s the same data represented logarithmically; yet even here the values corresponding to 1 –2 seconds aren’t visible. 

image

The Wins

  1. Performance: Even prior to columnstore implementation, at 10 – 12 seconds canned report performance against the SSAS cube was tolerable. Yet the 1 second performance afterward is clearly better. As significant as that is, imagine the user experience re: ad hoc interrogation. The difference between several minutes vs. one or two seconds is a game changer, literally changing the way users interact with their data—no mental context switching, no wondering when the results will appear, no preoccupation with the spinning mind-numbing hurry-up-&-wait indicators.  As we’ve commonly found elsewhere, columnstore indexes here provided performance improvements of one, two, or more orders of magnitude.
  2. Simplified Infrastructure: Because in this case a nonclustered columnstore index on a conventional DW table was faster than an Analysis Services cube, the entire SSAS infrastructure was rendered superfluous & was retired.
  3. PASS Rocks: Once again, the value of attending PASS is proven out. The trip to Charlotte combined with eager & enquiring minds let directly to this success story. Find out more about the next PASS Summit here, hosted this year in Seattle on November 4 – 7, 2014.

DevCon BI Team Lead Nathan Allan provided this unsolicited feedback:

“What we found was pretty awesome. It has been a game changer for us in terms of the flexibility we can offer people that would like to get to the data in different ways.”

Summary

For DW, reports, & other BI workloads, columnstore often provides significant performance enhancements relative to conventional indexing.  I have documented here, the second in a series of reports on columnstore implementations, results from DevCon Security, a live customer production app for which performance increased by factors of from 10x to 100x for all report queries, including canned queries as well as reducing time for results for ad hoc queries from 5 – 7 minutes to 1 – 2 seconds. As a result of columnstore performance, the customer retired their SSAS infrastructure.

I invite you to consider leveraging columnstore in your own environment. Let me know if you have any questions.

Updated sp_indexinfo

It was time to give sp_indexinfo some love. The procedure is meant to be the “ultimate” index information procedure, providing lots of information about all indexes in a database or all indexes for a certain table. Here is what I did in this update: Changed the second query that retrieves missing index information so it generates the index name (based on schema name, table name and column named – limited to 128 characters). Re-arranged and shortened column names to make output more compact and more…(read more)