Category Archives: pdw

APS Best Practice: How to Optimize Query Performance by Minimizing Data Movement

by Rob Farley, LobsterPot Solutions

The Analytics Platform System, with its MPP SQL Server engine (SQL Server Parallel Data Warehouse) can deliver performance and scalability for analytics workloads that you may not have expected from SQL Server. But there are key differences in working with SQL Server PDW and SQL Server Enterprise Edition that one should be aware of in order to take full advantage of the SQL Server PDW capabilities. One of the most important considerations when tuning queries in Microsoft SQL Server Parallel Data Warehouse is the minimisation of data movement. This post shows a useful technique regarding the identification of redundant joins through additional predicates that simulate check constraints.

Microsoft’s PDW, part of the Analytics Platform System (APS), offers scale-out technology for data warehouses. This involves spreading data across a number of SQL Server nodes and distributions, such that systems can host up to many petabytes of data. To achieve this, queries which use data from multiple distributions to satisfy joins must leverage the Data Movement Service (DMS) to relocate data during the execution of the query. This data movement is both a blessing and a curse; a blessing because it is the fundamental technology which allows the scale-out features to work, and a curse because it can be one of the most expensive parts of query execution. Furthermore, tuning to avoid data movement is something which many SQL Server query tuning experts have little experience, as it is unique to the Parallel Data Warehouse edition of SQL Server.

Regardless of whether data in PDW is stored in a column-store or row-store manner, or whether it is partitioned or not, there is a decision to be made as to whether a table is to be replicated or distributed. Replicated tables store a full copy of their data on each compute node of the system, while distributed tables distribute their data across distributions, of which there are eight on each compute node. In a system with six compute nodes, there would be forty-eight distributions, with an average of less than 2.1% (100% / 48) of the data in each distribution.

When deciding whether to distribute or replicate data, there are a number of considerations to bear in mind. Replicated data uses more storage and also has a larger management overhead, but can be more easily joined to data, as every SQL node has local access to replicated data. By distributing larger tables according to the hash of one of the table columns (known as the distribution key), the overhead of both reading and writing data is reduced – effectively reducing the size of databases by an order of magnitude.

Having decided to distribute data, choosing which column to use as the distribution key is driven by factors including the minimisation of data movement and the reduction of skew. Skew is important because if a distribution has much more than the average amount of data, this can affect query time. However, the minimisation of data movement is probably the most significant factor in distribution-key choice.

Joining two tables together involves identifying whether rows from each table match to according a number of predicates, but to do this, the two rows must be available on the same compute node. If one of the tables is replicated, this requirement is already satisfied (although it might need to be ‘trimmed’ to enable a left join), but if both tables are distributed, then the data is only known to be on the same node if one of the join predicates is an equality predicate between the distribution keys of the tables, and the data types of those keys are exactly identical (including nullability and length). More can be read about this in the excellent whitepaper about Query Execution in Parallel Data Warehouse.

To avoid data movement between commonly-performed joins, creativity is often needed by the data warehouse designers. This could involve the addition of extra columns to tables, such as adding the CustomerKey to many fact data tables (and using this as the distribution key), as joins between orders, items, payments, and other information required for a given report, as all these items are ultimately about a customer, and adding additional predicates to each join to alert the PDW Engine that only rows within the same distribution could possibly match. This is thinking that is alien for most data warehouse designers, who would typically feel that adding CustomerKey to a table not directly related to a Customer dimension is against best-practice advice.

 

Another technique commonly used by PDW data warehouse designers that is rarely seen in other SQL Server data warehouses is splitting tables up into two, either vertically or horizontally, whereas both are relatively common in PDW to avoid some of the problems that can often occur.

Splitting a table vertically is frequently done to reduce the impact of skew when the ideal distribution key for joins is not evenly distributed. Imagine the scenario of identifiable customers and unidentifiable customers, as increasingly the situation as stores have loyalty programs allowing them to identify a large portion (but not all) customers. For the analysis of shopping trends, it could be very useful to have data distributed by customer, but if half the customers are unknown, there will be a large amount of skew.

To solve this, sales could be split into two tables, such as Sales_KnownCustomer (distributed by CustomerKey) and Sales_UnknownCustomer (distributed by some other column). When analysing by customer, the table Sales_KnownCustomer could be used, including the CustomerKey as an additional (even if redundant) join predicate. A view performing a UNION ALL over the two tables could be used to allow reports that need to consider all Sales.

The query overhead of having the two tables is potentially high, especially if we consider tables for Sales, SaleItems, Deliveries, and more, which might all need to be split into two to avoid skew while minimising data movement, using CustomerKey as the distribution key when known to allow customer-based analysis, and SalesKey when the customer is unknown.

By distributing on a common key the impact is to effectively create mini-databases which are split out according to groups of customers, with all of the data about a particular customer residing in a single database. This is similar to the way that people scale out when doing so manually, rather than using a system such as PDW. Of course, there is a lot of additional overhead when trying to scale out manually, such as working out how to execute queries that do involve some amount of data movement.

By splitting up the tables into ones for known and unknown customers, queries that were looking something like the following:

SELECT …
FROM Sales AS s
JOIN SaleItems AS si
   ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems AS dsi
   ON dsi.LineItemKey = si.LineItemKey
JOIN Deliveries AS d
   ON d.DeliveryKey = dsi.DeliveryKey

…would become something like:

SELECT …
FROM Sales_KnownCustomer AS s
JOIN SaleItems_KnownCustomer AS si
   ON si.SalesKey = s.SalesKey
   AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems_KnownCustomer AS dsi
   ON dsi.LineItemKey = si.LineItemKey
   AND dsi.CustomerKey = s.CustomerKey
JOIN Deliveries_KnownCustomer AS d
   ON d.DeliveryKey = dsi.DeliveryKey
   AND d.CustomerKey = s.CustomerKey
UNION ALL
SELECT …
FROM Sales_UnknownCustomer AS s
JOIN SaleItems_UnknownCustomer AS li
   ON si.SalesKey = s.SalesKey
JOIN Delivery_SaleItems_UnknownCustomer AS dsi
   ON dsi.LineItemKey = s.LineItemKey
   AND dsi.SalesKey = s.SalesKey
JOIN Deliveries_UnknownCustomer AS d
   ON d.DeliveryKey = s.DeliveryKey
   AND d.SalesKey = s.SalesKey

I’m sure you can appreciate that this becomes a much larger effort for query writers, and the existence of views to simplify querying back to the earlier shape could be useful. If both CustomerKey and SalesKey were being used as distribution keys, then joins between the views would require both, but this can be incorporated into logical layers such as Data Source Views much more easily than using UNION ALL across the results of many joins. A DSV or Data Model could easily define relationships between tables using multiple columns so that self-serving reporting environments leverage the additional predicates.

The use of views should be considered very carefully, as it is easily possible to end up with views that nest views that nest view that nest views, and an environment that is very hard to troubleshoot and performs poorly. With sufficient care and expertise, however, there are some advantages to be had.

 

The resultant query would look something like:

SELECT …
FROM Sales AS s
JOIN SaleItems AS li
   ON si.SalesKey = s.SalesKey
   AND si.CustomerKey = s.CustomerKey
JOIN Delivery_SaleItems AS dsi
   ON dsi.LineItemKey = si.LineItemKey
   AND dsi.CustomerKey = s.CustomerKey
   AND dsi.SalesKey = s.SalesKey
JOIN Deliveries AS d
   ON d.DeliveryKey = dsi.DeliveryKey
   AND d.CustomerKey = s.CustomerKey
   AND d.SalesKey = s.SalesKey

Joining multiple sets of tables which have been combined using UNION ALL is not the same as performing a UNION ALL of sets of tables which have been joined. Much like any high school mathematics teacher will happily explain that (a*b)+(c*d) is not the same as (a+c)*(b+d), additional combinations need to be considered when the logical order of joins and UNION ALLs.

Notice that when we have (TableA1 UNION ALL TableA2) JOIN (TableB1 UNION ALL TableB2), we must perform joins not only between TableA1 and TableB1, and TableA2 and TableB2, but also TableA1 and TableB2, and TableB1 and TableA2. These last two combinations do not involve tables with common distribution keys, and therefore we would see data movement. This is despite the fact that we know that there can be no matching rows in those combinations, because some are for KnownCustomers and the others are for UnknownCustomers. Effectively, the relationships between the tables would be more like the following diagram:

There is an important stage of Query Optimization which must be considered here, and which can be leveraged to remove the need for data movement when this pattern is applied – that of Contradiction.

The contradiction algorithm is an incredibly useful but underappreciated stage of Query Optimization. Typically it is explained using an obvious contradiction such as WHERE 1=2. Notice the effect on the query plans of using this predicate.

Because the Query Optimizer recognises that no rows can possibly satisfy the predicate WHERE 1=2, it does not access the data structures seen in the first query plan.

This is useful, but many readers may not consider queries that use such an obvious contradiction are going to appear in their code.

But suppose the views that perform a UNION ALL are expressed in this form:

CREATE VIEW dbo.Sales AS
SELECT *
FROM dbo.Sales_KnownCustomer
WHERE CustomerID > 0
UNION ALL
SELECT *
FROM dbo.Sales_UnknownCustomer
WHERE CustomerID = 0;

Now, we see a different kind of behaviour.

Before the predicates are used, the query on the views is rewritten as follows (with SELECT clauses replaced by ellipses).

SELECT …
FROM   (SELECT …
        FROM   (SELECT …
                FROM   [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
                UNION ALL
                SELECT …
                FROM   [tempdb].[dbo].[TEMP_ID_4208] AS T4_1) AS T2_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
                        UNION ALL
                        SELECT …
                        FROM   [tempdb].[dbo].[TEMP_ID_4209] AS T5_1) AS T3_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
                                UNION ALL
                                SELECT …
                                FROM   [tempdb].[dbo].[TEMP_ID_4210] AS T6_1) AS T4_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
                                UNION ALL
                                SELECT …
                                FROM   [tempdb].[dbo].[TEMP_ID_4211] AS T6_1) AS T4_2
                               ON (([T4_2].[CustomerKey] = [T4_1].[CustomerKey])
                                   AND ([T4_2].[SalesKey] = [T4_1].[SalesKey])
                                       AND ([T4_2].[DeliveryKey] = [T4_1].[DeliveryKey]))) AS T3_2
                       ON (([T3_1].[CustomerKey] = [T3_2].[CustomerKey])
                           AND ([T3_1].[SalesKey] = [T3_2].[SalesKey])
                               AND ([T3_2].[SaleItemKey] = [T3_1].[SaleItemKey]))) AS T2_2
               ON (([T2_2].[CustomerKey] = [T2_1].[CustomerKey])
                   AND ([T2_2].[SalesKey] = [T2_1].[SalesKey]))) AS T1_1

Whereas with the inclusion of the additional predicates, the query simplifies to:

SELECT …
FROM   (SELECT …
        FROM   (SELECT …
                FROM   [sample_vsplit].[dbo].[Sales_KnownCustomer] AS T4_1
                WHERE  ([T4_1].[CustomerKey] > 0)) AS T3_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_KnownCustomer] AS T5_1
                        WHERE  ([T5_1].[CustomerKey] > 0)) AS T4_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_KnownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] > 0)) AS T5_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_KnownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] > 0)) AS T5_2
                               ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
                                   AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
                                       AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
                       ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
                           AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
                               AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
               ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
                   AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))
        UNION ALL
        SELECT …
        FROM   (SELECT …
                FROM   [sample_vsplit].[dbo].[Sales_UnknownCustomer] AS T4_1
                WHERE  ([T4_1].[CustomerKey] = 0)) AS T3_1
               INNER JOIN
               (SELECT …
                FROM   (SELECT …
                        FROM   [sample_vsplit].[dbo].[SaleItems_UnknownCustomer] AS T5_1
                        WHERE  ([T5_1].[CustomerKey] = 0)) AS T4_1
                       INNER JOIN
                       (SELECT …
                        FROM   (SELECT …
                                FROM   [sample_vsplit].[dbo].[Delivery_SaleItems_UnknownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] = 0)) AS T5_1
                               INNER JOIN
                               (SELECT …
                                FROM   [sample_vsplit].[dbo].[Deliveries_UnknownCustomer] AS T6_1
                                WHERE  ([T6_1].[CustomerKey] = 0)) AS T5_2
                               ON (([T5_2].[CustomerKey] = [T5_1].[CustomerKey])
                                   AND ([T5_2].[SalesKey] = [T5_1].[SalesKey])
                                       AND ([T5_2].[DeliveryKey] = [T5_1].[DeliveryKey]))) AS T4_2
                       ON (([T4_1].[CustomerKey] = [T4_2].[CustomerKey])
                           AND ([T4_1].[SalesKey] = [T4_2].[SalesKey])
                               AND ([T4_2].[SaleItemKey] = [T4_1].[SaleItemKey]))) AS T3_2
               ON (([T3_2].[CustomerKey] = [T3_1].[CustomerKey])
                   AND ([T3_2].[SalesKey] = [T3_1].[SalesKey]))) AS T1_1

This may seem more complex – it’s certainly longer – but this is the original, preferred version of the join. This is a powerful rewrite of the query.

Furthermore, the astute PDW-familiar reader will quickly realise that the UNION ALL of two local queries (queries that don’t require data movement) is also local, and that therefore, this query is completely local. The TEMP_ID_NNNNN tables in the first rewrite are more evidence that data movement has been required.

When the two plans are shown using PDW’s EXPLAIN keyword, the significance is shown even clearer.

The first plan appears as following, and it is obvious that there is a large amount of data movement involved.

The queries passed in are identical, but the altered definitions of the views have removed the need for any data movement at all. This should allow your query to run a little faster. Ok, a lot faster.

Summary

When splitting distributed tables vertically to avoid skew, views over those tables should include predicates which reiterate the conditions that cause the data to be populated into each table. This provides additional information to the PDW Engine that can remove unnecessary data movement, resulting in much-improved performance, both for standard reports using designed queries, and ad hoc reports that use a data model.

Introducing the Microsoft Analytics Platform System – the turnkey appliance for big data analytics

At the Accelerate your Insights event last week, Satya Nadella introduced the new Microsoft Analytics Platform System (APS) as Microsoft’s solution for delivering “Big Data in a box.” APS is an evolution of our SQL Server Parallel Data Warehouse (PDW) appliance which builds upon the high performance and scale capabilities of that MPP version of SQL Server, and now introduces a dedicated region to the appliance for Hadoop in addition to the SQL Server PDW capabilities. The Hadoop region within the appliance is based on the Hortonworks Data Platform for Windows but adds key capabilities enterprises expect for a Tier 1 appliance such as high availability through the appliance design and Windows Server failover clustering, security through Active Directory and a unified appliance management experience through Systems Center. Completing the APS package and seamlessly unifying the data in SQL Server PDW with data in Hadoop is PolyBase, a ground breaking query technology developed by Dr. David DeWitt and his team in Microsoft’s Grey Systems Labs.

Microsoft continues to work with industry leading hardware partners Dell, HP and Quanta to deliver APS as a turnkey appliance that also delivers the best value in the industry for a data warehouse appliance.

Go to the APS product site to learn more or watch the short product introduction video here: 

O’Reilly Strata: Busting Big Data Adoption Myths with Halo 4–Part 3

Structured or unstructured – what to store? How about all of it?

I spoke at the O’Reilly Strata Conference last year and am back again this year to share thoughts about big data. It’s been a remarkable year for Microsoft’s data platform team. While we’ve had a long history in the data space, last year was noteworthy for several reasons.

  1. Dave Campbell Keynotes at Strata Santa Clara 2013We announced our commitment to the Hadoop community and we partnered with Hortonworks to provide enterprises with an Apache-Hadoop compatible implementation on Windows.
  2. We released new versions of our business intelligence (BI) tools, providing business users access to powerful analytics through familiar tools including Excel 2013, Power View and PowerPivot. The columnar in-memory analytics engine technology in PowerPivot, available in Excel, is the same technology which is inside SQL Server Analysis Services Tabular model. We’ve made it very easy in SQL Server 2012 to migrate PowerPivot models into Analysis Services so you can take solutions and insights developed in Excel and scale them up to server based deployments which can serve many users.
  3. And tomorrow you can start ordering SQL Server 2012 Parallel Data Warehouse (PDW), the next generation of our parallel processing data warehousing appliance. The latest version of PDW includes a technology we call PolyBase. PolyBase makes it very easy to meld Apache Hadoop with structured relational data using a standard T-SQL query, enabling large businesses to do efficient information production at scale with the existing skills and knowledge they already have in-house.

While that may read like a list of features, here’s why I think they are noteworthy. A recent Microsoft study revealed that 38% of respondents’ current data stores contain unstructured data and 53% rated increased amounts of unstructured data to analyze as extremely important. This trend is only increasing as businesses realize their unstructured data holds the key to new value not accessible in their existing structured data. In my personal experience, over the last two years, many businesses have shifted from thinking of big data as a challenge to perceiving it as an opportunity.

I’m often asked, “Where is the ultimate value in big data and how do I tap into it?” There are two key measures in my mind: 1) time to insight, and, 2) return on accessible data. These measures are, in turn, enabled through a process I call information production.

Information production is the process of converting data or information from one domain into another. Consider the following example. Assume you have an ambulance fleet which is equipped with GPS units which collect telemetry. Information production techniques allow you to convert the raw GPS telemetry – a sequence of records containing {Timestamp, Latitude, Longitude} elements – into an incident response time. The magic of information production is that it takes data which is difficult to deal with in traditional information systems, such as raw GPS telemetry, and transforms it into information, (incident ID and response time), which is both more structured and more business relevant. Once we’ve produced the incident response time, we can logically join it with models which predict patient outcome as a function of response time.

Great information production tools allow you reduce the time to insight. They allow you to get from a hunch to validation very quickly. In fact, there is an emerging class of information production tools which stimulate hypothesis generation by finding correlations in diverse data sets which may hold the key to new value.

Valuable answers require logically joining different data sets – something every database person is familiar with. In traditional databases the “accessible” data is constrained to data which is contained within the database. This data has been normalized, cleaned, and indexed so it can be used to efficiently answer a fixed set of questions over that data domain.

Big data and information production enable a much larger definition of accessible data though. Going back to the ambulance example, where would you get the patient outcome model to determine how many lives you could save by reducing response time? By using accessible demographic and population data, you could determine how many heart attack victims lives could be saved by moving or adding ambulances.

We will know big data has made it big when it makes every day experiences better. In fact, one of the things I spoke about in my Strata talk today is how our Halo 4 team is using our HDInsight and big data tools to create a better gaming experience. By using a preview of the Windows Azure HDInsight Service to do Hadoop-based analysis on their unstructured data, the team gained invaluable insight on usage patterns and as a result had the agility to make changes to improve the overall gaming experience.

Information production is a key part of our vision and product offerings and enables you to achieve fast time to insight and greater return on accessible data. Our BI tools, like PowerPivot and Power View, are geared to making it easier for a certain class of users to reduce their time to insight and then to be able to effectively share those insights with others. Data Explorer, which we released in preview yesterday, makes it easy to find, transform, and join information to both ease information production and increase the range of accessible data. HDInsight Service, our Apache Hadoop based service, available in preview on Windows Azure, is being used by our Halo 4 team and external customers to realize new value from their unstructured data. SQL Server 2012 PDW with its PolyBase technology enables extremely large scale information production for the largest business needs.

For all of us involved in big data, and me personally, it is an incredibly exciting time. The next 5-10 years are going to be breathtaking.

You can find out more about our big data solutions by visiting www.microsoft.com/bigdata, or for those interested in reading the Halo 4 case study on their cloud-based big data solution  – it’s now available online here.

Dave Campbell
Technical Fellow, STB

See more from the Busting Big Data Adoption Myth blog series:

Seamless insights on structured and unstructured data with SQL Server 2012 Parallel Data Warehouse

In the fast evolving new world of Big Data, you are being asked to answer a new set of questions that require immediate responses on data that has changed in volume, variety, complexity and velocity. A modern data platform must be able to answer these new questions without costing IT millions of dollars to deploy complex and time consuming systems.

On November 7, we unveiled details for SQL Server 2012 Parallel Data Warehouse (PDW), our scale-out Massively Parallel Processing (MPP) data warehouse appliance, which has evolved to fully embrace this new world. SQL Server 2012 PDW is built for big data and will provide a fundamental breakthrough in data processing using familiar tools to do seamless analysis on relational and Hadoop data at the lowest total cost of ownership.

  • Built for Big Data: SQL Server 2012 PDW is powered by PolyBase, a breakthrough in data processing, thatenables integrated queries across Hadoop and relational data. Without manual intervention, PolyBase Query Processor can accept a standard SQL query and join tables from a relational source with data from a Hadoop source to return a combined result seamlessly to the user. Going a step further, integration with Microsoft’s business intelligence tools allows users to join structured and unstructured data together in familiar tools like Excel to answer questions and make key business decisions quickly.   
  • Next-generation Performance at Scale: By upgrading the primary storage engine to a new updateable version of xVelocity columnstore, users can gain in-memory performance (up to 50x faster) on datasets that linearly scale out from small all the way up to 5 Petabytes of structured data.     
  • Engineered For Optimal Value: In SQL Server 2012 PDW, we optimized the hardware specifications required of an appliance through software innovations to deliver significantly greater cost savings, roughly 2.5x lower cost per TB and value. Through features delivered in Windows Server 2012, SQL Server 2012 PDW has built-in performance, reliability, and scale for storage using economical high density disks. Further, Windows Server 2012 Hyper-V virtualizes and streamlines an entire server rack of control functions down to a few nodes. Finally, xVelocity columnstore provides both compression and the potential to eliminate the rowstore copy to reduce storage usage up to 70%. As a result of these innovations, SQL Server 2012 PDW has a price per terabyte that is significantly lower than all offers in the market today.

With SQL Server 2008 R2 Parallel Data Warehouse, Microsoft already demonstrated high performance at scale when customers like HyVee improved their performance 100 times by moving from SQL Server 2008 R2 to SQL Server 2008 R2 Parallel Data Warehouse. SQL Server 2012 Parallel Data Warehouse takes a big leap forward in performance, scale, and the ability to do big data analysis while lowering costs. For the first time, customers of all shapes, sizes and data requirements from the low end to the highest data capacity requirements can get a data warehouse appliance within their reach.

We are very excited about SQL Server 2012 PDW which will be released broadly in the first half of 2013 and invite you to learn more through the following resources:

  • Watch the latest PASS Summit 2012 Keynote or sessions here
  • Microsoft Official Blog Post on PASS Summit 2012, authored by Ted Kummert here
  • Read customer examples of SQL Server 2008 R2 PDW (HyVee)
  • Visit HP’s Enterprise Data Warehouse for SQL Server 2008 R2 Parallel Data Warehouse site
  • Find out more about Dell’s SQL Server 2008 R2 Parallel Data Warehouse here