Category Archives: Parallel Data Warehouse

Architecture of the Microsoft Analytics Platform System

In today’s world of interconnected devices and broad access to more and more data, the ability to glean ambient insight from the variety of data sources has been made quite hard by the variety and speed with which data is being delivered. Think about it for a minute, your servers continue to provide interesting data to you about the operations happening in your business but now you have data coming to you from the temperature sensors in the A/C units, the power supplies and networking equipment in the data center that can be combined to show that spikes in temperature and traffic have a dramatic effect on the life of a server. This type of contextual data is growing to include larger and with more detailed insights into the operations and management of your business. As we look to the future, Pew Research has released a report that predicts 50 billion connected devices by 2025. That is 5 devices for every person expected to be alive. With data coming from sources like manufacturing equipment to jet airliners, from mobile phones to your scale, and to things we haven’t even imaged yet the question really because how do you take advantage of all of these data sources to provide insight into the current and future trends in your business.

In April 2014, Microsoft announced the Analytics Platform System (APS) as Microsoft’s “Big Data in a Box” solution for addressing this question. APS is an appliance solution with hardware and software that is purpose built and pre-integrated to address the overwhelming variety of data while providing customers the opportunity to access this vast trove of data. The primary goal of APS is to enable the loading and querying of terabytes and even petabytes of data in a performant way using a Massively Parallel Processing version of Microsoft SQL Server (SQL Server PDW) and Microsoft’s Hadoop distribution, HDInsight, which is based off of the Hortonworks Data Platform.

Basic Design

An APS solution is comprised of three basic components:

  1. The hardware – the servers, storage, networking and racks.
  2. The fabric – the base software layer for operations within the appliance.
  3. The workloads – the individual workload types offering structured and unstructured data warehousing.

The Hardware

Utilizing commodity servers, storage, drives and networking devices from our three hardware partners (Dell, HP, and Quanta), Microsoft is able to offer a high performance scale out data warehouse solution that can grow to very large data sets while providing redundancy of each component to ensure high availability. Starting with standard servers and JBOD (Just a Bunch Of Disks) storage arrays, APS can grow from a simple 2 node and storage solution to 60 nodes. At scale, that means a warehouse that houses 720 cores, 14 TB of RAM, 6PB of raw storage and ultra-high speed networking using Ethernet and InfiniBand networks while offering the lowest price per terabyte of any data warehouse appliance on the market (Value Prism Consulting).

Fabric

The fabric layer is built using technologies from the Microsoft portfolio that enable rock solid reliability, management and monitoring without having to learn anything new. Starting with Microsoft Windows Server 2012, the appliance builds a solid foundation for each workload by providing a virtual environment based on Hyper-V that also offers high availability via Failover Clustering all managed by Active Directory. Combining this base technology with Clustered Shared Volumes (CSV) and Windows Storage Spaces, the appliance is able to offer a large and expandable base fabric for each of the workloads while reducing the cost of the appliance by not requiring specialized or proprietary hardware. Each of the components offers full redundancy to ensure high-availability in failure cases.

Workloads

Building upon the fabric layer, the current release of APS offers two distinct workload types – structure data through SQL Server Parallel Data Warehouse (PDW) and unstructured data through HDInsight (Hadoop). These workloads can be mixed within a single appliance offering flexibility to customers to tailor the appliance to the needs of their business.

SQL Server Parallel Data Warehouse is a massively parallel processing, shared nothing scale-out solution for Microsoft SQL Server that eliminates the need to ‘forklift’ additional very large and very expensive hardware into your datacenter to grow as the volume of data exhaust into your warehouse increases. Instead of having to expand from a large multi-processor and connected storage system to a massive multi-processor and SAN based solution, PDW uses the commodity hardware model with distributed execution to scale out to a wide footprint. This scale wide model for execution has been proven as a very effective and economical way to grow your workload.

HDInsight is Microsoft’s offering of Hadoop for Windows based on the Hortonworks Data Platform from Hortonworks. See the HDInsight portal for details on this technology. HDInsight is now offered as a workload on APS to allow for on premise Hadoop that is optimized for data warehouse workloads. By offering HDInsight as a workload on the appliance, the pressure to define, construct and manage a Hadoop cluster has been minimized. Any by using PolyBase, Microsoft’s SQL Server to HDFS bridge technology, customers can not only manage and monitor Hadoop through tools they are familiar with but they can for the first time use Active Directory to manage security into the data stored within Hadoop – offering the same ease of use for user management offered in SQL Server.

Massively-Parallel Processing (MPP) in SQL Server

Now that we’ve laid the groundwork for APS, let’s dive into how we load and process data at such high performance and scale. The PDW region of APS is a scale-out version of SQL Server that enables parallel query execution to occur across multiple nodes simultaneously. The effect is the ability to run what appears to be a very large operation into tasks that can be managed at a smaller scale. For example, a query against 100 billion rows in a SQL Server SMP environment would require the processing of all of the data in a single execution space. With MPP, the work is spread across many nodes to break the problem into more manageable and easier ways to execute tasks. In a four node appliance (see the picture below), each node is only asked to process roughly 25 billion rows – a much quicker task.

To accomplish such a feat, APS relies on a couple of key components to manage and move data within the appliance – a table distribution model and the Data Movement Service (DMS).

The first is the table distribution model that allows for a table to be either replicated to all nodes (used for smaller tables such as language, countries, etc.) or to be distributed across the nodes (such as a large fact table for sales orders or web clicks). By replicating small tables to each node, the appliance is able to perform join operations very quickly on a single node without having to pull all of the data to the control node for processing. By distributing large tables across the appliance, each node can process and return a smaller set of data returning only the relevant data to the control node for aggregation.

To create a table in APS that is distributed across the appliance, the user simply needs to add the key to which the table is distributed on:

CREATE TABLE [dbo].[Orders]
(
  [OrderId] ...
)
WITH
(
  DISTRIBUTION = HASH([OrderId])
)

This allows the appliance to split the data and place incoming data onto the appropriate node onto the appropriate node in the appliance.

The second component is the Data Movement Service (DMS) that manages the routing of data within the appliance. DMS is used in partnership with the SQL Server query (which creates the execution plan) to distribute the execution plan to each node. DMS then aggregates the results back to the control node of the appliance which can perform any final execution before returning the results to the caller. DMS is essentially the traffic cop within APS that enables queries to be executed and data moved within the appliance across 2-60 nodes.

Performance

With the introduction of Clustered Column Indexes (CCI) in SQL Server, APS is able to take advantage of the performance gains to better process and store data within the appliance. In typical data warehouse workloads, we commonly see very wide table designs to eliminate the need to join tables at scale (to improve performance). The use of Clustered Column Indexes allows SQL Server to store data in columnar format versus row format. This approach enables queries that don’t utilize all of the columns of a table to more efficiently retrieve the data from memory or disk for processing – increasing performance.

By combining CCI tables with parallel processing and the fast processing power and storage systems of the appliance, customers are able to improve overall query performance and data compression quite significantly versus a traditional single server data warehouse. Often times, this means reductions in query execution times from many hours to a few minutes or even seconds. The net results is that companies are able to take advantage of the exhaust of structured or non-structured data at real or near real-time to empower better business decisions.

To learn more about the Microsoft Analytics Platform System, please visit us on the web at http://www.microsoft.com/aps.

ICYMI: Data platform momentum

The last couple months have seen the addition of several new products that extend Microsoft’s data platform offerings.  

At the end of January, Quentin Clark outlined his vision for the complete data platform, exploring the various inputs that are driving new application patterns, new considerations for handling data of all shapes and sizes, and ultimately changing the way we can reveal business insights from data.

 In February, we announced the general availability of Power BI for Office 365, and you heard from Kamal Hathi about how this exciting release simplifies business intelligence and how features like Power BI sites and Power BI Q&A, Power BI helps anyone, not just experts, gain value from their data. You also heard from Quentin Clark about how Power BI helps make big data work for everyone by bringing together easy access to data, robust tools that everyone can use, and a complete data platform.

In March, we announced that SQL Server 2014 would be general available beginning April 1, and shared how companies are already taking advantage of in-memory capabilities and hybrid cloud scenarios that SQL Server enables. Shawn Bice explored the platform continuum, and how with this latest release, developers can continue to use SQL Server on-premises while also dipping their toes into the possibilities with the cloud using Microsoft Azure. Additionally, Microsoft Azure HDInsight was made generally available to support Hadoop 2.2, making it easy to deploy Hadoop in the cloud.

 And earlier this month at the Accelerate your insights event in San Francisco, CEO Satya Nadella discussed Microsoft’s drive towards a data culture. In addition, we announced two other key capabilities to extend the robustness of our data platform: the Analytics Platform System, an evolution of the Parallel Data Warehouse with the addition of a Hadoop region for your unstructured data, and then a preview of the Microsoft Azure Intelligent Systems Service to help tap into the Internet of Your Things. In case you missed it, watch the keynotes on-demand, and don’t miss out on experiencing the Infinity Room, to inspire you with the extraordinary things that can be found in your data.

On top of our own announcements, we’ve been recently honored to be recognized by Gartner as a Leader in the 2014 Magic Quadrants for Data Warehouse Database Management Systems and Business Intelligence and Analytics Platforms. And SQL Server 2014, in partnership with Hewlett Packard, set two world records for data warehousing performance and price/performance.

With these enhancements across the entire Microsoft data platform, there is no better time than now to dig in. Learn more about our data platform offerings. Brush up on your technical skills for free on the Microsoft Virtual Academy. Connect with other SQL Server experts through the PASS community. Hear from Microsoft’s engineering leaders about Microsoft’s approach to developing the latest offerings. Read about the architecture of data-intensive applications in the cloud computing world from Mark Souza, which one commenter noted was a “great example for the future of application design/architecture in the Cloud and proof that the toolbox of the future for Application and Database Developers/DBAs is going to be bigger than the On-Prem one of the past.” And finally, come chat in-person – we’ll be hanging out at the upcoming PASS Business Analytics and TechEd events and are eager to hear more about your data opportunities, challenges, and of course, successes.

What can your data do for you?

Evolving your SQL Server Data Warehouse to the Next Generation, High Performance Solution, the SQL Server Parallel Data Warehouse (PDW)

Last week, we highlighted a whitepaper that focused on the performance benefits of SQL Server PDW and how it differs from traditional SQL Server. We’ve seen many SQL Server customers evolve to this latest PDW as their next generation platform for their data warehouse infrastructure for many of the reasons noted in that whitepaper and this week, we want to follow-up on that by sharing our guide for how to migrate from a traditional data warehouse to PDW.

This whitepaper is based of real world best practices and will showcase how you can leverage your existing knowledge and skills around SQL Server to deploy SQL Server PDW as your next generation solution for your data warehouse. Download the paper to find out more. 

How Does SQL Server Parallel Data Warehouse (PDW) Deliver the Performance that it Does?

Last week, we introduced you to SQL Server PDW, the version of SQL Server built specifically for high performance data warehousing that delivers performance gains of up to 50x compared to traditional data warehouses. The next logical question we often get is “how is this possible?” Is it just SQL Server running on special hardware? And the answer is yes…but there is a lot more to it than that.

PDW has a shared nothing, Massively Parallel Processing (MPP) architecture allows for better performance when loading and querying data simultaneously as well as performing the types of complex querying that is common amongst today’s analytically driven enterprises. The MPP architecture is also what enables PDW’s seamless linear scalability from terabytes to 6 petabytes of data. This is a fundamentally different architecture than traditional, Symmetric Multi-Processing (SMP) that traditional SQL Server is based on and this MPP architecture is really at the heart of what’s different about PDW and how it’s able to deliver the ground breaking performance that it does.

To learn more about how PDW works and some of the performance gains we’ve seen in typical data warehouse scenarios, we encourage you to download this whitepaper today!

 Massively Parallel Processing Engine

Get to Know the SQL Server that’s Purpose Built for High Performance Data Warehousing and Big Data Analytics – the SQL Server Parallel Data Warehouse Appliance

If your business relies on data, you know that it is a constant challenge to store, manage, and analyze it effectively as your data continues to grow. It’s also expensive to keep enough data on “hot” storage where it is readily available for analysis. Even when you have the data you need on hot storage, it can take hours or even days to run analysis and reports on today’s symmetric multi-processing (SMP) systems. To add more to the challenges, businesses today are struggling to figure out how to add the value of non-relational Hadoop data into their analysis.

As a result, business analysts are held back from making faster and more accurate data-driven business decisions that are needed to compete in today’s marketplace. This is the modern data challenge and one that some of our SQL Server customers are facing today. But help is here for those of you who know and love SQL Server (or even for those of you who don’t).

Get to know the SQL Server 2012 Parallel Data Warehouse (PDW), Microsoft’s next generation platform for data warehousing and Big Data integration. With PDW’s massively parallel processing (MPP) design, queries commonly complete 50 times faster than traditional data warehouses built on symmetric multi-processing (SMP) database management systems. 50 times faster means that queries complete in minutes instead of hours, or seconds instead of minutes. With this breakthrough performance, your business analysts can generate more extensive results faster, and can easily perform ad-hoc queries or drill down into the details. As a result, your business can make better decisions, faster.

To learn more, go to http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/upgrade-to-pdw.aspx.

Insight Through Integration: SQL Server 2012 Parallel Data Warehouse – PolyBase Demo

SQL Server 2012 PDW has a feature called PolyBase, that enables you to integrate Hadoop data with PDW data. By using PDW with PolyBase capabilities, a user can:

  1. Use an external table to define a table structure for Hadoop data.
  2. Query Hadoop data by running SQL statements
  3. Integrate Hadoop data with PDW data by running a PDW query that joins Hadoop data to a relational PDW table.
  4. Persist Hadoop data in PDW by querying Hadoop and saving the results to a PDW table.
  5. Use Hadoop as an online data archive by exporting PDW data to Hadoop. Since the data is stored online in Hadoop, user will be able to retrieve the data by querying it from PDW.

In the video below, which highlights a solution to a problem that involves sending help to evacuate potential victims of a hurricane, Microsoft SQLCAT Senior Program Manager Murshed Zaman demonstrates how to solve a customer question using relational data from SQL Server Parallel Data Warehouse 2012 (PDW 2012) and non-relational data stored inside Hadoop. The demo will show how you can analyze data by combining the capabilities of Power View and Power Pivot for Excel, Hadoop, and PDW.  This video focuses on the PolyBase feature of SQL Server Parallel Data Warehouse 2012. PowerPivot and PowerView were added to the demonstration to help visualize the data results.   

For step-by-step instructions on creating the PowerView report please visit Cindy Gross’ blog "Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View."

For more information on SQL Server Parallel Data Warehouse Appliance visit http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/pdw.aspx

PASS BAC PREVIEW SERIES: SQL Professionals and the World of Self-service BI and Big Data

As the PASS Business Analytics Conference approaches, we are excited to host this new blog series featuring content from some of the amazing speakers that you will experience at PASS BAC, April 10-12 in Chicago. This new conference is the place where Business Analytics professionals will gather to connect, share and learn in world class sessions covering a wide range of important topics. You can find out more about the sessions, the speakers, or register here.

*     *     *

FEATURED PASS BUSINESS ANALYTICS CONFERENCE SPEAKER: Cindy Gross

A member of Microsoft SQLCAT, Cindy Gross (@SQLCindy) works with Microsoft Business Intelligence, SQL Server, Hadoop, and Hive to make data more accessible to decision makers and generate more business value. She has extensive data platform experience, with roles that span engineering and direct customer engagement, and was honored to earn the title of SQL Server 2008 Microsoft Certified Master. As an active member of the technical community, she regularly speaks at local and national events and contributes to whitepapers, blogs, technical articles, and books.

Cindy will be co-presenting at PASS BAC with Eduardo Gamez of Intel in a 60 minute Breakout Session: How Intel Integrates Self-Service BI with IT for Better Business Results.

The IT buzzword of the year seems to be “big data.” If you listen to the hype, big data will solve all the world’s problems and do so at a bargain price. While that’s obviously an exaggeration, the hype does point to an underlying change in the way we approach our data and how we make that data available within a business. Self-service BI allows business experts to use familiar tools such as Excel to quickly create and share data models and reports that are highly impactful to their portion of the business. They may base their data models and reports on data sources that weren’t available in the recent past, providing new insight not previously available. External data, archival data, previously unknown and unexplored data, big data, and large volumes of data are cheaper to store and shared more easily than ever before. This changes how IT interacts with business users.

For anyone looking for the latest and greatest in BI, or someone trying to cut costs at every opportunity, the concepts of big data sound like a siren song: Scale out, distributed storage and processing, inexpensive hardware, elasticity, schema on read instead of write, flexibility, speed of end-to-end answers, data too big to store affordably in existing systems. As a result, some people are trying to replace their relational and multi-dimensional systems such as SQL Server and Analysis Services with big data technologies such as Hadoop as if they are completely interchangeable. This presents a problem, as they fill different needs, have different core strengths, and often work together in a complementary fashion rather than opposing each other.

For example, the SQL Server PDW V2 appliance has a feature, codenamed Polybase, which allows a single TSQL query to combine SQL Server PDW data with Hadoop data in a single result set. A frequent use case in the big data world, used by the likes of Yahoo! and Klout, is to do the initial data processing in a batch-based, scaled-out Hadoop system, then pull the aggregated data into Analysis Services where individual queries are fast and the business can easily and quickly get the answers needed for their standard questions. Self-service BI tools such as PowerPivot (highly compressed in-memory data models in Excel 2013 or in Excel + SharePoint) and Power View (easy, powerful visualizations within Excel reports) can see Hadoop data, including Microsoft’s distribution of Hadoop called HDInsight, as just another ODBC data source if you use the Hadoop ecosystem component Hive to impose a “rows and columns” structure on the data. Once a PowerPivot data model is deemed to be “important enough” to be shared or if it needs more memory than is available on a client machine, it can be easily imported into a server-side Analysis Services tabular model which can also be a data source for Power View reports, Excel spreadsheets, and other BI tools. The bottom line here: Your users may be accessing and incorporating big data into their self-service BI reports without evening knowing it and you, the IT professional, need to know if/when/how to adjust your tools, skill sets, and processes to handle these new realities.

Sometimes Hadoop and other big data technologies are not the only – or the best – way to provide that processing power or data sources for all those self-service reports. If the data is already well-known, well-curated, and loaded into an existing system that answers the important questions for the business units involved, then copying or moving that data to a new Hadoop-based system may not be the best or most economical answer. SQL Server systems can handle many terabytes of data and the new Parallel Data Warehouse (PDW) V2 appliances will have several petabytes of storage per appliance. But, if you have new/expanded data sources (such as keeping more columns or rows than the existing relational system is designed for), data that isn’t well defined or has multiple formats or interpretations, or you aren’t sure yet what questions you will ask, then one of the big data technologies such as Hadoop with the Hive component may be a great fit for exploring the data. With Hadoop you can easily and quickly change the structure you impose on the data and do a lot of batch and ad hoc processing across a scaled out system of relatively inexpensive nodes. Each individual program or query is exploring an entire data set (for SQL Server folks this is like a table scan), so individual queries, especially if they go against subsets of the data, are likely to be slower than well-known, well-indexed queries on a relational system such as SQL Server. But the end-to-end time from when you decide you need to explore some particular combination of often previously unknown and/or uncurated data sets to when you are able to provide some sort of programmatic answer to a set of often previously undefined questions is likely faster in Hadoop. This fits well with the idea of self-service BI. Once you figure out what data is most valuable and the sorts of questions that provide the most value, you may be able to pull a subset of the data (fewer columns and/or fewer rows) into SQL Server and/or Analysis Services, add indexes or aggregations, and run those individual queries faster. As a SQL Server or BI implementer it is very important to understand enough about Hadoop and the other big data technologies to discern which projects need to include SQL Server, Analysis Services or PowerPivot, StreamInsight, PDW, or Hadoop, and at what stage. The more you know about both current systems and new ones, the more successful you, your team, and your company will be at implementing impactful projects at a reasonable cost.

The upcoming PASS Business Analytics Conference is a great opportunity to dip your toes or really dive into analytics, self-service BI, and big data. Whether you are new to these areas or looking to gain more depth you will find sessions to help meet your goals, starting with a keynote by Dr. Steven Levitt, author of Freakonomics. This is a great opportunity to network with other professionals working with Microsoft BI and Business Analytics including data scientists, architects, analysts, implementers, and business decision makers. I look forward to talking to you at the conference and learning more about your BI/BA needs and solutions!

Microsoft in Leaders Quadrant of Gartner* Magic Quadrant for Data Warehouse Database Management Systems

Hello SQL Server community! My name is Eron Kelly and I recently took the role of General Manager of Product Marketing for the Data Platform at Microsoft. I’ve been at Microsoft for over 12 years in various product marketing roles on BizTalk, Exchange, Office 365 and Windows Azure. I’m excited that my first blog post in my new role is to announce Microsoft’s positioning as a Leader the Magic Quadrant for Data Warehouse Database Management Systems that was published on January 31, 2013.

First off, thank you to all the customers who spoke to Gartner on our behalf for this Magic Quadrant. Your engagement with us on SQL Server has made it a better product for data warehousing, and we believe that our position in the Leaders Quadrant in this report reflects the core values of our platform: ease of use, high scale, high availability and market leading TCO. This placement recognizes our strategy to provide customers with data warehouses of all sizes – from the mid-market to the largest mission critical, tier one deployments – that have the best price for performance in market.

This is different from the approach of other vendors in the data warehousing space, who can deliver similar performance at a much higher price. Customers, like Hy-Vee who saw a 100X query performance improvement, or AMD who adds a terabyte of data to their data warehouse each week, are seeing great performance with the SQL Server Parallel Data Warehouse Appliance.

We are pleased that Gartner has recognized us as a leader in Data Warehouse Database Management Systems and that customers are affirming our approach. In the coming year, we will continue to focus on delivering the highest value to our customers through innovations in mission critical scale, performance, and high availability while continuing to deliver on lowering the total cost of ownership of the data platform. Thank you for your continued support and I look forward to talking with you more in the future.

Eron Kelly
General Manager
SQL Server

* Disclaimer: Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

CROSSMARK Uses SQL Server 2008 R2 Parallel Data Warehouse to Quickly Deliver Business Insights

With the growth of the consumer goods industry, sales and marketing campaigns have created large and complex databases that are hard to sift through without the right tools. As retailers approach the busy holiday shopping season, they need to have insights into the effectiveness of their campaigns. Retailers need to know what market trends are affecting their customers to maximize the reach of these campaigns and they need to be able to sort through all of this data quickly to find useful and actionable insights.

Every now and then, we like to highlight how our customers are using Microsoft’s database platform solutions now to solve for these types of needs in real-time. One such customer is CROSSMARK, a provider of sales and marketing services for manufacturers and retail companies, who recently launched a new self-service data portal powered by SQL Server 2008 R2 Parallel Data Warehouse (PDW) to bring this data and these insights to its customers. SQL Server PDW’s on-demand data access will allow CROSSMARK’s customers to leverage shopper insights and data to inform strategies and tactics to create more effective sales and marketing campaigns to boost sales and profitability.

Before implementing SQL Server PDW, CROSSMARK had a bottleneck in its legacy platform that created data reports that weren’t scalable, making employees spend valuable time with data reporting instead of working with customers. Now, with SQL Server PDW, CROSSMARK can easily scale its resources to handle the millions of in-store activities processed each year and allow CROSSMARK employees to spend more time with its customers and less time with the data.

CROSSMARK is also on-track to implement SQL Server BI tools including Power View and PowerPivot to provide more business intelligence tools to its customers.

To read more about CROSSMARK, take a look at this Customer Spotlight feature on News Center.

Seamless insights on structured and unstructured data with SQL Server 2012 Parallel Data Warehouse

In the fast evolving new world of Big Data, you are being asked to answer a new set of questions that require immediate responses on data that has changed in volume, variety, complexity and velocity. A modern data platform must be able to answer these new questions without costing IT millions of dollars to deploy complex and time consuming systems.

On November 7, we unveiled details for SQL Server 2012 Parallel Data Warehouse (PDW), our scale-out Massively Parallel Processing (MPP) data warehouse appliance, which has evolved to fully embrace this new world. SQL Server 2012 PDW is built for big data and will provide a fundamental breakthrough in data processing using familiar tools to do seamless analysis on relational and Hadoop data at the lowest total cost of ownership.

  • Built for Big Data: SQL Server 2012 PDW is powered by PolyBase, a breakthrough in data processing, thatenables integrated queries across Hadoop and relational data. Without manual intervention, PolyBase Query Processor can accept a standard SQL query and join tables from a relational source with data from a Hadoop source to return a combined result seamlessly to the user. Going a step further, integration with Microsoft’s business intelligence tools allows users to join structured and unstructured data together in familiar tools like Excel to answer questions and make key business decisions quickly.   
  • Next-generation Performance at Scale: By upgrading the primary storage engine to a new updateable version of xVelocity columnstore, users can gain in-memory performance (up to 50x faster) on datasets that linearly scale out from small all the way up to 5 Petabytes of structured data.     
  • Engineered For Optimal Value: In SQL Server 2012 PDW, we optimized the hardware specifications required of an appliance through software innovations to deliver significantly greater cost savings, roughly 2.5x lower cost per TB and value. Through features delivered in Windows Server 2012, SQL Server 2012 PDW has built-in performance, reliability, and scale for storage using economical high density disks. Further, Windows Server 2012 Hyper-V virtualizes and streamlines an entire server rack of control functions down to a few nodes. Finally, xVelocity columnstore provides both compression and the potential to eliminate the rowstore copy to reduce storage usage up to 70%. As a result of these innovations, SQL Server 2012 PDW has a price per terabyte that is significantly lower than all offers in the market today.

With SQL Server 2008 R2 Parallel Data Warehouse, Microsoft already demonstrated high performance at scale when customers like HyVee improved their performance 100 times by moving from SQL Server 2008 R2 to SQL Server 2008 R2 Parallel Data Warehouse. SQL Server 2012 Parallel Data Warehouse takes a big leap forward in performance, scale, and the ability to do big data analysis while lowering costs. For the first time, customers of all shapes, sizes and data requirements from the low end to the highest data capacity requirements can get a data warehouse appliance within their reach.

We are very excited about SQL Server 2012 PDW which will be released broadly in the first half of 2013 and invite you to learn more through the following resources:

  • Watch the latest PASS Summit 2012 Keynote or sessions here
  • Microsoft Official Blog Post on PASS Summit 2012, authored by Ted Kummert here
  • Read customer examples of SQL Server 2008 R2 PDW (HyVee)
  • Visit HP’s Enterprise Data Warehouse for SQL Server 2008 R2 Parallel Data Warehouse site
  • Find out more about Dell’s SQL Server 2008 R2 Parallel Data Warehouse here