Category Archives: Dave Campbell

Go ‘From Data to Insights’ on March 26th in an Exclusive Webcast from Microsoft and The Economist

Data_to_InsightBig data brings big challenges, but also opportunities–for people and companies to gain insight, spot trends and make decisions in ways never before thought possible.  In the battle of the buzzwords, "big data" is poised to render "guestimation" obsolete.

To better understand the impact of big data on the future of global business, Microsoft is hosting an exclusive webcast briefing, "From Data to Insights", produced in association with the Economist.

In the webcast, you’ll hear from Tom Standage, digital editor of The Economist, on the social and economic benefits of mining data, followed by a moderated discussion featuring two Microsoft data experts, VP/Technical Fellow Dave Campbell and Technical Fellow Raghu Ramakrishnan, for an insider’s view of the trends and technologies driving the business of big data, as well as Microsoft’s big data strategy.

Register now for this big impact webcast on big data! The webcast takes place on March 26th, 2013 from 9:30 – 10:30 am PDT!

Nitin Saxena
National Application Platform Marketing Lead
Microsoft Corporation

O’Reilly Strata: Busting Big Data Adoption Myths with Halo 4–Part 3

Structured or unstructured – what to store? How about all of it?

I spoke at the O’Reilly Strata Conference last year and am back again this year to share thoughts about big data. It’s been a remarkable year for Microsoft’s data platform team. While we’ve had a long history in the data space, last year was noteworthy for several reasons.

  1. Dave Campbell Keynotes at Strata Santa Clara 2013We announced our commitment to the Hadoop community and we partnered with Hortonworks to provide enterprises with an Apache-Hadoop compatible implementation on Windows.
  2. We released new versions of our business intelligence (BI) tools, providing business users access to powerful analytics through familiar tools including Excel 2013, Power View and PowerPivot. The columnar in-memory analytics engine technology in PowerPivot, available in Excel, is the same technology which is inside SQL Server Analysis Services Tabular model. We’ve made it very easy in SQL Server 2012 to migrate PowerPivot models into Analysis Services so you can take solutions and insights developed in Excel and scale them up to server based deployments which can serve many users.
  3. And tomorrow you can start ordering SQL Server 2012 Parallel Data Warehouse (PDW), the next generation of our parallel processing data warehousing appliance. The latest version of PDW includes a technology we call PolyBase. PolyBase makes it very easy to meld Apache Hadoop with structured relational data using a standard T-SQL query, enabling large businesses to do efficient information production at scale with the existing skills and knowledge they already have in-house.

While that may read like a list of features, here’s why I think they are noteworthy. A recent Microsoft study revealed that 38% of respondents’ current data stores contain unstructured data and 53% rated increased amounts of unstructured data to analyze as extremely important. This trend is only increasing as businesses realize their unstructured data holds the key to new value not accessible in their existing structured data. In my personal experience, over the last two years, many businesses have shifted from thinking of big data as a challenge to perceiving it as an opportunity.

I’m often asked, “Where is the ultimate value in big data and how do I tap into it?” There are two key measures in my mind: 1) time to insight, and, 2) return on accessible data. These measures are, in turn, enabled through a process I call information production.

Information production is the process of converting data or information from one domain into another. Consider the following example. Assume you have an ambulance fleet which is equipped with GPS units which collect telemetry. Information production techniques allow you to convert the raw GPS telemetry – a sequence of records containing {Timestamp, Latitude, Longitude} elements – into an incident response time. The magic of information production is that it takes data which is difficult to deal with in traditional information systems, such as raw GPS telemetry, and transforms it into information, (incident ID and response time), which is both more structured and more business relevant. Once we’ve produced the incident response time, we can logically join it with models which predict patient outcome as a function of response time.

Great information production tools allow you reduce the time to insight. They allow you to get from a hunch to validation very quickly. In fact, there is an emerging class of information production tools which stimulate hypothesis generation by finding correlations in diverse data sets which may hold the key to new value.

Valuable answers require logically joining different data sets – something every database person is familiar with. In traditional databases the “accessible” data is constrained to data which is contained within the database. This data has been normalized, cleaned, and indexed so it can be used to efficiently answer a fixed set of questions over that data domain.

Big data and information production enable a much larger definition of accessible data though. Going back to the ambulance example, where would you get the patient outcome model to determine how many lives you could save by reducing response time? By using accessible demographic and population data, you could determine how many heart attack victims lives could be saved by moving or adding ambulances.

We will know big data has made it big when it makes every day experiences better. In fact, one of the things I spoke about in my Strata talk today is how our Halo 4 team is using our HDInsight and big data tools to create a better gaming experience. By using a preview of the Windows Azure HDInsight Service to do Hadoop-based analysis on their unstructured data, the team gained invaluable insight on usage patterns and as a result had the agility to make changes to improve the overall gaming experience.

Information production is a key part of our vision and product offerings and enables you to achieve fast time to insight and greater return on accessible data. Our BI tools, like PowerPivot and Power View, are geared to making it easier for a certain class of users to reduce their time to insight and then to be able to effectively share those insights with others. Data Explorer, which we released in preview yesterday, makes it easy to find, transform, and join information to both ease information production and increase the range of accessible data. HDInsight Service, our Apache Hadoop based service, available in preview on Windows Azure, is being used by our Halo 4 team and external customers to realize new value from their unstructured data. SQL Server 2012 PDW with its PolyBase technology enables extremely large scale information production for the largest business needs.

For all of us involved in big data, and me personally, it is an incredibly exciting time. The next 5-10 years are going to be breathtaking.

You can find out more about our big data solutions by visiting www.microsoft.com/bigdata, or for those interested in reading the Halo 4 case study on their cloud-based big data solution  – it’s now available online here.

Dave Campbell
Technical Fellow, STB

See more from the Busting Big Data Adoption Myth blog series:

Microsoft Talks Big Data at O’Reilly® Strata

Dave Campbell delivers keynote at Strata Santa ClaraFor those of you into big data, O’Reilly’s® Strata Conference in Santa Clara, CA is the place to be. If you are there, visit the Microsoft booth to meet with Microsoft’s big brains for big data – and while you’re there, enter to win an Xbox 360 with Kinect.

On Thursday, you won’t want to miss the keynote starting at 9:05AM PT sharp from Dave Campbell, Technical Fellow, Microsoft. Dave will share how big data is a game changer for the Halo 4 team and how they use big data technologies to enhance the gaming experience.

Also on Thursday, stick around for another keynote by Kate Crawford, Principal Researcher at Microsoft Research, at 9:45AM PT. Kate will discuss the biases that can sometimes exist in big data and how organizations might work beyond them to achieve their goals.

To coincide with Strata, we’re kicking off a short blog series that aims to “bust” some of the myths surrounding big data and discuss how our solutions can help your organization get started doing big data analytics today. We look forward to seeing you at the conference and be sure to check back here tomorrow morning for more!

Microsoft Big Data Team

Breakthrough performance with in-memory technologies

In a blog post earlier this year on “The coming database in-memory tipping point”, I mentioned that Microsoft was working on several in-memory database technologies. At the SQL PASS conference this week, Microsoft unveiled a new in-memory database capability, code named “Hekaton1”, which is slated to be released with the next major version of SQL Server. Hekaton dramatically improves the throughput and latency of SQL Server’s transaction processing (TP) capabilities. Hekaton is designed to meet the requirements of the most demanding TP applications and we have worked closely with a number of companies to prove these gains. Hekaton’s technology adoption partners include financial services companies, online gaming and other companies which have extremely demanding TP requirements. What is most impressive about Hekaton is that it achieves breakthrough improvement in TP capabilities without requiring a separate data management product or a new programming model. It’s still SQL Server!

As I mentioned in the “tipping point” post, much of the energy around in-memory data management systems thus far has been around columnar storage and analytical workloads. As the previous blog post mentions, Microsoft already ships this form of technology in our xVelocity analytics engine and xVelocity columnstore index. xVelocity columnstore index will be updated in SQL Server 2012 Parallel Data Warehouse (PDW v2) to support updatable clustered columnar indexes. Hekaton, in contrast, is a row-based technology squarely focused on transaction processing (TP) workloads. Note that these two approaches are not mutually exclusive. The combination of Hekaton and SQL Server’s existing xVelocity columnstore index and xVelocity analytics engine, will result in a great combination.

The fact that Hekaton and xVelocity columnstore index are built-in to SQL Server, rather than a separate data engine, is a conscious design choice. Other vendors are either introducing separate in-memory optimized caches or building a unification layer over a set of technologies and introducing it as a completely new product. This adds complexity forcing customers to deploy and manage a completely new product or, worse yet, manage both a “memory-optimized” product for the hot data and a “storage-optimized” product for the application data that is not cost-effective to reside primarily in memory.

Hekaton is designed around four architectural principles:

1) Optimize for main memory data access: Storage-optimized engines (such as the current OLTP engine in SQL Server today) will retain hot data in a main memory buffer pool based upon access frequency. The data access and modification capabilities, however, are built around the viewpoint that data may be paged in or paged out to disk at any point. This perspective necessitates layers of indirection in buffer pools, extra code for sophisticated storage allocation and defragmentation, and logging of every minute operation that could affect storage. With Hekaton you place tables used in the extreme TP portion of an application in memory-optimized main memory structures. The remaining application tables, such as reference data details or historical data, are left in traditional storage optimized structures. This approach lets you memory-optimize hotspots without having to manage multiple data engines.

Hekaton’s main memory structures do away with the overhead and indirection of the storage optimized view while still providing the full ACID properties expected of a database system. For example, durability in Hekaton is achieved by streamlined logging and checkpointing that uses only efficient sequential IO.

2) Accelerate business logic processing: Given that the free ride on CPU clock rate is over, Hekaton must be more efficient in how it utilizes each core. Today SQL Server’s query processor compiles queries and stored procedures into a set of data structures which are evaluated by an interpreter in SQL Server’s query processor. With Hekaton, queries and procedural logic in T-SQL stored procedures are compiled directly into machine code with aggressive optimizations applied at compilation time. This allows the stored procedure to be executed at the speed of native code.

3) Provide frictionless scale-up: It’s common to find 16 to 32 logical cores even on a 2-socket server nowadays. Storage-optimized engines rely on a variety of mechanisms such as locks and latches to provide concurrency control. These mechanisms often have significant contention issues when scaling up with more cores. Hekaton implements a highly scalable concurrency control mechanism and uses a series of lock-free data structures to eliminate traditional locks and latches while guaranteeing the correct transactional semantics that ensure data consistency.

4) Built-in to SQL Server: As I mentioned earlier – Hekaton is a new capability of SQL Server. This lays the foundation for a powerful customer scenario which has been proven out by our customer testing. Many existing TP systems have certain transactions or algorithms which benefit from Hekaton’s extreme TP capabilities. For example, the matching algorithm in financial trading, resource assignment or scheduling in manufacturing, or matchmaking in gaming scenarios. Hekaton enables optimizing these aspects of a TP system for in-memory processing while the cooler data and processing continue to be handled by the rest of SQL Server.

To make it easy to get started, we’ve built an analysis tool that you can run so you can identify the hot tables and stored procedures in an existing transactional database application. As a first step you can migrate hot tables to Hekaton as in-memory tables. Doing this simply requires the following T-SQL statements2:

image

While Hekaton’s memory optimized tables must fully fit into main memory, the database as a whole need not. These in-memory tables can be used in queries just as any regular table, however providing optimized and contention-free data operation at this stage.

After migrating to optimized in-memory storage, stored procedures operating on these tables can be transformed into natively compiled stored procedures, dramatically increasing the processing speed of in-database logic. Recompiling these stored procedures is, again, done through T-SQL, as shown below:

Hekaton_PASS_Summit_2

What can you expect for a performance gain from Hekaton? Customer testing has demonstrated up to 5X to 50X throughput gains on the same hardware, delivering extreme TP performance on mid-range servers. The actual speedup depends on multiple factors, such as how much data processing can be migrated into Hekaton and directly sped up; and, how much cross transaction contention is removed as a result of the speed up and other Hekaton optimizations such a lock free data structures.

Hekaton is now in private technology preview with a small set of customers. Keep following our product blogs for updates and a future public technology preview.

Dave Campbell
Technical Fellow
Microsoft SQL Server


[1] Hekaton is from the Greek word ἑκατόν for “hundred”. Our design goal for the Hekaton original proof of concept prototype was to achieve 100x speedup for certain TP operations.

[2] The syntax for these operations will likely change. The examples demonstrate how easy it will be to take advantage of Hekaton’s capabilities.