Category Archives: hadoop

Best of 2014: Top 10 Data Exposed Channel 9 Videos for Data Devs

Have you been watching Data Exposed over on Channel 9? If you’re a data developer, Data Exposed is a great place to learn more about what you can do with data: relational and non-relational, on-premises and in the cloud, big and small.

On the show, Scott Klein and his guests demonstrate features, discuss the latest news, and share their love for data technology – from SQL Server, to Azure HDInsight, and more!

We rounded up the year’s top 10 most-watched videos from Data Exposed. Check them out below – we hope you learn something new!

  • Introducing Azure Data Factory: Learn about Azure Data Factory, a new service for data developers and IT pros to easily transform raw data into trusted data assets for their organization at scale.
  • Introduction to Azure DocumentDB: Get an introduction to Azure DocumentDB, a NoSQL document database-as-a-service that provides rich querying, transactional processing over schema free data, and query processing and transaction semantics that are common to relational database systems.
  • Introduction to Azure Search: Learn about Azure Search, a new fully-managed, full-text search service in Microsoft Azure which provides powerful and sophisticated search capabilities to your applications.
  • Azure SQL Database Elastic Scale: Learn about Azure SQL Database Elastic Scale, .NET client libraries and Azure cloud service packages that provide the ability to easily develop, scale, and manage the stateful data tiers of your SQL Server applications.
  • Hadoop Meets the Cloud: Scenarios for HDInsight: Explore real-life customer scenarios for big data in the cloud, and gain some ideas of how you can use Hadoop in your environment to solve some of the big data challenges many people face today.
  • Azure Stream Analytics: See the capabilities of Azure Stream Analytics and how it helps make working with mass volumes of data more manageable.
  • The Top Reasons People Call Bob Ward: Scott Klein is joined by Bob Ward, Principle Escalation Engineer for SQL Server, to talk about the top two reasons why people want to talk to Bob Ward and the rest of his SQL Server Services and Support team.
  • SQL Server 2014 In-Memory OLTP Logging: Learn about In-Memory OLTP, a memory-optimized and OLTP-optimized database engine integrated into SQL Server. See how transactions and logging work on memory-optimized-tables, and how a system can recover in-memory data in case of a system failure.
  • Insights into Azure SQL Database: Get a candid and insightful behind-the-scenes look at Azure SQL Database, the new service tiers, and the process around determining the right set of capabilities at each tier.
  • Using SQL Server Integration Services to Control the Power of Azure HDInsight: Join Scott and several members of the #sqlfamily to talk about how to control cloud from on-premises SQL Server.

Interested in taking your learning to the next level? Try SQL Server or Microsoft Azure now.

The Best of Both Worlds: Top Benefits and Scenarios for Hybrid Hadoop, from On-Premises to the Cloud

Historically, Hadoop has been a platform for big data that you either deploy on-premises with your own hardware or in the cloud and managed by a hosting vendor. Deploying on-premises affords you specific benefits, like control and flexibility over your deployment.  But the cloud provides other benefits like elastic scale, fast time to value, and automatic redundancy, amongst others.

With the recent announcement of the Hortonworks Data Platform 2.2 being made generally available, Microsoft and Hortonworks are partnered to deliver Hadoop on Hybrid infrastructure in both on-premises and cloud.  This will give customers the best of both worlds with control & flexibility of on-premises deployments and the elasticity & redundancy of the cloud.

What are some of the top scenarios or use cases for Hybrid Hadoop? And what are the benefits of taking advantage of a hybrid model?

  • Elasticity: Easily scale out during peak demand times by quickly spinning up more Hadoop nodes (with HDInsight)
  • Reliability: Use the cloud as an automated disaster recovery solution that automatically geo-replicates your data. Or
  • Breadth of Analytics Offerings: If you’re already working with on-prem Hortonworks offerings, you now have access to a suite of turn-key data analytics and management services in Azure, like HDInsight, Machine Learning, Data Factory, and Stream Analytics.

To get started, customers need Hortonworks Data Platform 2.2 with Apache Falcon configured to move data from on-premises into Azure.  Detailed instructions can be found here.

We are excited to be working with Hortonworks to give Hadoop users Hadoop/Big Data on a hybrid cloud. For more resources:

How to Hadoop: 4 Resources to Learn and Try Cloud Big Data

Are you curious about how to begin working with big data using Hadoop? Perhaps you know you should be looking into big data analytics to power your business, but you’re not quite sure about the various big data technologies available to you, or you need a tutorial to get started.  

  1. If you want a quick overview on why you should consider cloud Hadoop: read this short article from MSDN Magazine that explores the implications of combining big data and the cloud and provides an overview of where Microsoft Azure HDInsight sits within the broader ecosystem.  

  1. If you’re a technical leader who is new to Hadoop: check out this webinar about Hadoop in the cloud, and learn how you can take advantage of the new world of data and gain insights that were not possible before.  

  1. If you’re on the front lines of IT or data science and want to begin or expand your big data capabilities: check out the ‘Working with big data on Azure’ Microsoft Press eBook, which provides an overview of the impact of big data on businesses, a step-by-step guide for deploying Hadoop clusters and running MapReduce in the cloud, and covers several use cases and helpful techniques.  

  1. If you want a deeper tutorial for taking your big data capabilities to the next level: Master the ins and outs of Hadoop for free on the Microsoft Virtual Academy with this ‘Implementing Big Data Analysis’ training series

What question do you have about big data or Hadoop? Are there any other resources you might find helpful as you learn and experiment? Let us know. And if you haven’t yet, don’t forget to claim your free one month Microsoft Azure trial

Expanding the Availability and Deployment of Hadoop Solutions on Microsoft Azure

Author: Shaun Connolly
VP Corporate Strategy – Hortonworks

Data growth threatens to overwhelm existing systems and bring those systems to their knees. That’s one of the big reasons we’ve been working with Microsoft to enable a Modern Data Architecture for Windows users and Microsoft customers.

A history of delivering Hadoop to Microsoft customers

Hortonworks and Microsoft have been partnering to deliver solutions for big data on Windows since 2011. Hortonworks is the company that Microsoft relies on for providing the industry’s first and only 100% native Windows distribution of Hadoop, as well as the core Hadoop platform for Microsoft Azure HDInsight and the Hadoop region of the Microsoft Analytics Platform System.

This week we made several announcements that further enable hybrid choice for Microsoft-focused enterprises interested in deploying Apache Hadoop on-premises and in the cloud.

New capabilities for Hadoop on Windows

Hortonworks has announced the newest version of the Hortonworks Data Platform for Windows  – the market’s only Windows native Apache Hadoop-based platform. This brings many new innovations for managing, processing and analyzing big data including:

  • Enterprise SQL at scale
  • New capabilities for data scientists
  • Internet of things with Apache Kafka
  • Management and monitoring improvements
  • Easier maintenance with rolling upgrades

Automated cloud backup for Microsoft Azure
Data architects require Hadoop to act like other systems in the data center, and business continuity through replication across on-premises and cloud-based storage targets is a critical requirement.  In HDP 2.2, we extended the capabilities of Apache Falcon to establish an automated policy for cloud backup to Microsoft Azure.  This is an important first step in a broader vision to enable seamlessly integrated hybrid deployment models for Hadoop.

Certified Hadoop on Azure Infrastructure as a Service (IaaS)

Increasingly the cloud is an important component for big data deployments. On Wednesday October 15 we announced that the Hortonworks Data Platform (HDP) is the first Hadoop platform to be Azure certified to run on Microsoft Azure Virtual Machines. This gives customers new deployment choices for small and large deployments in the cloud. With this new certification, Hortonworks and Microsoft make Apache Hadoop more widely available and easy to deploy for data processing and analytic workloads enabling the enterprise to expand their modern data architecture.

Maximizing Hadoop Deployment choice for Microsoft Customers

These latest efforts further expand the deployment options for Microsoft customers while providing them with complete interoperability between workloads on-premises and in the cloud. This means that applications built on-premises can be moved to the cloud seamlessly. Complete compatibility between these infrastructures gives customers the freedom to use the infrastructure that best meets their needs. You can backup data where the data resides (geographically) and provide the flexibility and opportunity for others to do Hadoop analytics in the cloud (globally).

We are excited to be the first Hadoop vendor to offer Hadoop on Azure Virtual Machines and we look forward to continuing our long history of working with Microsoft to engineer and offer solutions that meet the most flexible and easy to use deployment options for big data available, further increasing the power of the Modern Data Architecture. 

Read more information as follows:

The ins and outs of Apache Storm – real-time processing for Hadoop

Yesterday at Strata + Hadoop World, Microsoft announced the preview of Apache Storm clusters on Azure HDInsight.  This post will give you the ins and outs of Storm.

What is Storm?

Apache Storm is a distributed, fault-tolerant, open source real-time event processing solution. Storm was originally used by Twitter to process massive streams of data from the Twitter firehose. Today, Storm is an incubator project as part of the Apache Software foundation. Typically, Storm will be integrated with a scalable event queuing system like Apache Kafka or Azure Event Hubs.

What can it do?

Combined with an event queuing system, the combined solution will be able to process a large amount of real-time data. This can enable many different scenarios like real-time fraud detection, click-stream analysis, financial alerts, telemetry from connected sensors/devices, and more. For information on real world scenarios, read how companies are using Storm.

How do I get started?

For Microsoft customers, we offer Storm as a preview cluster in Azure HDInsight. This gives you a managed cluster where you will have the benefit of being easy-to-setup (within a few clicks and a few minutes), having high availability (clusters are monitored 24/7 and under the Azure SLA for uptime), having elastic scale (where more resources can be added depending on need), and being integrated to the broad Azure ecosystem (ie. Event Hubs, HBase, VNet, etc).

To get started, customers will need to have an Azure subscription or a free trial to Azure. With this in hand, you should be able to get a Storm cluster up and running in minutes by going through this getting started guide.

For more information on Storm:

For more information on Azure HDInsight:

Microsoft announces real-time analytics in Hadoop and new ML capabilities in Marketplace

This morning at Strata + Hadoop World, Microsoft announced the preview of Apache Storm clusters inside HDInsight as well as new machine learning capabilities in the Azure Marketplace.

Apache Storm is an open source project in the Hadoop ecosystem which gives users access to an event-processing analytics platform that can reliably process millions of events. Now, users of Hadoop can gain insights to events as they happen in real-time.  Learn more from here:

As part of Strata, Microsoft partner, Hortonworks announced the next version of their Hadoop distribution HDP 2.2 will include capabilities to orchestrate data from on-premise to Azure.  This will allow customers to back-up their on-premise data or elastically scale out using the power of the cloud.

Finally, Microsoft is offering new machine learning capabilities as part of the Azure Marketplace.  Customers can now access ML as web services which enable scenarios like doing anomaly detection, running a recommendation engine, doing fraud detection, and a set of R packages.

Read more of Microsoft’s Strata announcements on the Official Microsoft Blog

A story of scale-out on HDInsight

Lately I have been doing a lot of work using Microsoft’s Hadoop offering, HDInsight. I suspect a lot of people who read my blog are unfamiliar with what Hadoop actually is so I thought I’d recount a recent test I did that exhibits the scale-out nature of processing data on Hadoop.

The test used a mapreduce job (written in Java) to process an incoming CSV file and load it into an HBase database on that same cluster (HBase is a NoSQL data store, for the purposes of this blog post that’s all you need to know right now). Mapreduce is a technique for doing data processing on Hadoop’s scale out architecture where incoming data is split according to some expression over the incoming rows and then rows with the same result of that expression are combined using some sort of aggregation. The splitting and aggregation can be done on multiple nodes in the Hadoop cluster and hence I like to refer to mapreduce as distributed GroupBy.

The actual work that our mapreduce job is doing isn’t important, but for your information its combining many rows of data pertaining to a single customer into a JSON document and loading that document into HBase. Upon successful completion of the job HBase contained 3125000 JSON documents.


I loaded 1.6GB of raw source data (62.5m rows) into HBase using HDInsight, the quickest I was able to complete this was 34m2s.The key to speeding up our throughput was to (a) use a bigger HDInsight cluster and (b) split our input data into multiple files thus forcing the processing to be distributed over more map tasks. With more performance tuning (aka scaling out to more HDInsight nodes) I am confident we get this much lower.

Note that it is possible to specify the number of tasks that the map phase uses rather than Hadoop guessing how many it should use, for this test I chose not to specify that. In other words, splitting the incoming data over multiple files is not a necessity, it was just a trick I pulled to affect the mapreduce job.

The detail

I generated a 1.6GB (1,677,562,500B) file containing 62 500 000 rows. On the first run I used an HDInsight cluster that had 2 worker nodes. The entire mapreduce job took ~1hr13m50s. The map phase took ~58m, reduce phase took ~1hr6m (so clearly they overlapped – that is because the reduce phase starts as soon as the first map task completes and as you will see below the map tasks completed at wildly different times).

Even though the reduce phase took longer its actually the map phase which caused the long execution time. To try and pinpoint why it took so long I dove into the logs that Hadoop produces. Unless you tell Hadoop otherwise it determines how many tasks it should spin up in the map phase and in this case it determined it needed 4 map tasks:


I’ve highlighted the elapsed times for each, note the 4th is much lower. This would explain why the reduce phase took so long, it started as soon as the first map task completed but then had to wait ~52minutes until all the other map tasks were complete.

Each one of those tasks has its own task log and from those task logs I found the following information:

Processing split: wasb://
Processing split: wasb://
Processing split: wasb://
Processing split: wasb://

The numbers at the end represent the byte ranges that each task is processing (the first one starts from byte 0 as you would expect). Notice the last one (1610612736+66949764). That means it is starting from byte 1610612736 and processing the next 66949764 bytes. Given that task is the 4th task of 4 it shouldn’t surprise you to know that if you add those two numbers together they come to 1677562500 which is exactly the same size as the input file. In other words, the logs tell us exactly what we should expect, that the input data has been split over the 4 tasks that it deemed necessary to process this file.

Notice that the first 3 tasks processed 536 870 912B, the 4th processed only about 12% of that, 66 949 764B. This would explain why the 4th task completed so much quicker than the others. The data has not been split evenly, and clearly that’s a problem because one of the map tasks completed so much quicker than the others which ultimately means the reduce phase has to sit around waiting for all the data – the uneven split of the data has caused inefficient use of our cluster.

We can infer some things from this:

  • The less data that a task has to process, then the less time that task takes to complete (pretty obvious)
  • If we can increase the number of tasks, the data will be distributed more uniformly over those tasks and they should complete much quicker (and in roughly the same amount of time) due to having less data to process and less competition for resources.

Thus I ran the same test again changing only one variable, the number of nodes in the HDInsight cluster – I increased it from 2 to 20. I hoped that this would increase the number of map tasks. Unfortunately the job failed (my fault, I left some output files lying around from a previous run and that caused a failure) however it got as far as completing the map phase which is pretty much all I cared about:


As you can see there were still only 4 tasks and they actually took longer. So, we didn’t achieve more tasks and thus we didn’t speed the job up. That’s not good. I can’t explain right now why they actually took longer. The same number of tasks (4) distributed over a greater number of nodes (20) would, you would think, be slightly quicker due to less resource contention. Bit of a weird one that and I can’t explain it right now.

I wondered if splitting the input file into lots of smaller files would make a difference so I split that file into 20 equally sized smaller files and ran the job again on the 2-node cluster. This time we got 20 tasks:


Which is great, however the map phase failed due to out-of-memory issues:


So, I uploaded those same 20 files to the 20node cluster and ran again. We got 20 tasks in the map phase and, thankfully, this time they all completed successfully. The entire job (map + reduce) completed in 34m2s (less than half the time taken on the 2node cluster when loading the single file), the map phase completed in 10m34s, reduce phase took 24m46s. The overlap there is only 1m18s and that’s because the durations of the map tasks were more uniformly distributed due to the data being separated over more tasks. Here are the 20 map tasks with durations:


That has been a bit of a braindump but I figured it might be interesting to anyone starting out on the path of doing data processing on Hadoop. Please post questions in the comments.


UPDATE, Thanks to Paul Lambert I’ve found this very useful primer on attempting to artificially set the number of mappers and reducers: HowManyMapsAndReduces 

Microsoft Azure Offers HDInsight (Hadoop-as-a-service) to China

Microsoft today announced that Azure HDInsight is now available for all customers in China as a public preview, making it the first global cloud provider to have a publicly available cloud Hadoop offering in China. With this launch, Microsoft is bringing Azure HDInsight’s ability to process big data volumes from unstructured and semi-structured sources to China. Both local Chinese organizations and multi-national corporations with offices in China can spin up a Hadoop cluster within minutes. 

Hadoop is an open-source platform for storing and processing massive amounts of data. By using Hadoop alongside your traditional data architecture, you can gain deep insights into data you never imagined being able to access.  As an example, Blackball, a Taiwanese Tea and Dessert restaurant chain was able to combine traditional data from its point-of-sale systems with new data from social sentiment and weather feeds to understand why a customer would purchase their products.  By combining traditional sources with new “Big Data”, Blackball found out that hot/cold weather was not really a factor in its sales of hot/cold drinks and was able to adjust accordingly to customer demand.

It is these types of results that are causing a wave of demand for big data. We invite you to try HDInsight today either in China or across the other Azure data centers. 

Learn more through the following resources:

Azure previews fully-managed NoSQL database and search services

I am pleased to announce previews of new NoSQL database and search services and the evolution of our Hadoop-based service. Available as previews today are Azure DocumentDB, a fully-managed transactional NoSQL document database-as-a-service, and Azure Search, which enables developers to easily add search capabilities to mobile and cloud applications. Generally available today, Azure HDInsight, our Hadoop-based solution for the cloud, now supports Apache HBase clusters.

With these new and updated services, we’re continuing to make it easier for customers to work with data of any type and size – using the tools, languages and frameworks they want to — in a trusted cloud environment. From Microsoft products like Azure Machine Learning, Azure SQL Database and Azure HDInsight to data services from our partners, we’re committed to supporting the broadest data platform so our customers get data benefits, in the cloud, on their terms.

Preview of Azure DocumentDB

Applications today must support multiple devices, multiple platforms with rapid iterations from the same data source, and also deliver high-scale and reliable performance. NoSQL has emerged as the leading database technology to address these needs. According to Gartner inquiries, flexible data schemas and application development velocity are cited as primary factors influencing adoption. Secondary factors attracting enterprises are global replication capabilities, high performance and developer interest.*

However, while NoSQL technologies address some document database needs, we’ve been hearing feedback that customers want a way to bridge document database functionality with the transactional capabilities of relational databases. Azure DocumentDB is our answer to that feedback – it’s a NoSQL document database-as-a-service that provides the benefits of a NoSQL document database but also adds the query processing and transaction semantics common to relational database systems.

Built for the cloud, Azure DocumentDB natively supports JSON documents enabling easy object mapping and iteration of data models for application development. Azure DocumentDB offers programming libraries for several popular languages and platforms, including.Net, Node.js, JavaScript, and Python. We will be contributing the client libraries to the open source community, so they can incorporate improvements into the versions published on

One DocumentDB customer, Additive Labs, builds online services to help their customers move to the cloud. "DocumentDB is the NoSQL database I am expecting today,” said Additive Labs Founder Thomas Weiss. “The ease and power of SQL-like queries had me started in a matter of minutes. And the ability to augment the engine’s behavior with custom JavaScript makes it way easier to adapt to our customers’ new requirements.”

Preview of Azure Search

Search has become a natural way for users to interact with applications that manage volumes of data.  However, managing search infrastructure at scale can be difficult and time consuming and often requires specialized skills and knowledge. Azure Search is a fully-managed search-as-a-service that customers can use to integrate complete search experiences into applications and connect search results to business objectives through fine-tuned, ranking profiles. Customers do not have to worry about the complexities of full-text search or deploying, maintaining or managing a search infrastructure.

With Azure Search developers can easily provision a search service, quickly create and tune one or more indexes, upload data to be indexed and start issuing searches. The service offers a simple API that’s usable from any platform or development environment and makes it easy to integrate search into new or existing applications. With Azure Search, developers can use the Azure portal or management APIs to increase or decrease capacity in terms of queries per second and document count as load changes, delivering a more cost effective solution for search scenarios. 

Retail platform provider Xomni is already using Azure Search to help the company manage its cloud infrastructure. "We have the most technically advanced SaaS solution for delivering product catalogue data to the retail industry in the market today,” said Xomni CTO Daron Yondem. “Integrating Azure Search into our platform will help solidify our leadership as datasets and faceted search requirements evolve over time."

General availability of Apache HBase for HDInsight

In partnership with Hortonworks, we’ve invested in the Hadoop ecosystem through contributions across projects like Tez, Stinger and Hive. Azure HDInsight, our Hadoop-based service, is another outcome of that partnership.

Azure HDInsight combines the best of Hadoop open source technology with the elasticity and manageability that enterprises require. Today, we’re making generally available HBase as a managed cluster inside HDInsight. HBase clusters are configured to store data directly in Azure Blob storage. For example, customers can use HDInsight to analyze large datasets in Azure Blobs generated from highly-interactive websites or can use it to analyze sensor and telemetry data from millions of end points.

Microsoft data services

Azure data services provide unparalleled choice for businesses, data scientists, developers and IT pros with a variety of managed services from Microsoft and our partners that work together seamlessly and connect to our customers’ data platform investments– from relational data to non-relational data, structured data to unstructured data, constant and evolving data models. I encourage you to try out our new and expanded Azure data services and let us know what you think.

*Gartner, Hype Cycle for Information Infrastructure, 2014, Mark Beyer and Roxane Edjlali, 06 August 2014



POST rows to HBase REST API on HDInsight using Powershell

I’ve been very quiet on the blogging front of late. There are a few reasons for that but one of the main ones is that I’ve spent the past three months on a new gig immersing myself in Hadoop, primarily in Microsoft’s Hadoop offering called HDInsight. I’ve got a tonne of learnings that I want to share at some point but in this blog post I’ll start with a handy little script that I put together yesterday.

I’m using a tool in the Hadoop ecosystem called HBase which was made available on HDInsight in preview form about a month ago. HBase is a NoSQL solution intended to provide very very fast access to data and my colleagues and I think it might be well suited for a problem we’re currently architecting a solution for. In order to evaluate HBase we wanted to shove lots of meaningless data into it and in the world of HDInsight the means of communicating with your HDInsight cluster is Powershell. Hence I’ve written a Powershell script that will use HBase’s REST API to create a table and insert random data into it. Likely if you’ve googled this post then you’re already familiar with Hadoop, HDInsight, REST, Powershell, HBase, column families, cells, rowkeys and other associated jargon so I won’t cover any of those, what is important is the format of the XML payload that has to get POSTed/PUTted up to the REST API. That payload looks like this:

<?xml version="1.0" encoding="UTF-8"?>
  <Row key="myrowkey">
    <Cell column="columnfamily:column1">somevalue</Cell>
    <Cell column="columnfamily:column2">anothervalue</Cell>

The payload can contain as many cells as you like. When the payload gets POSTed/PUTted the values therein need to be base64 encoded but don’t worry, the script I’m sharing herein takes care of all that for you. The script will also create the table for you. The data that gets inserted is totally meaningless and is also identical for each row, modifying the script to insert something meaningful is an exercise for the reader.

Another nicety of this script is that it uses Invoke-RestMethod which is built into Powershell 4. You don’t need to install other Powershell modules, nothing Azure specific. If you have Powershell 4 you’re good to go!

Embedding code on this blog site is ugly so I’ve made it available on my OneDrive: CreateHBaseTableAndPopulateWithData.ps1 Screenshot below gives you an idea of what’s going on here.

Hope this helps!


UPDATE. I’ve posted a newer script CreateHBaseTableAndPopulateWithDataQuickly.ps1 which loads data in much quicker. This one sends multiple rows in each POST and hence I was able to insert 13.97m rows in 3 hours and 37 minutes which, given latency to the datacentre and that this was over a RESTful API, isn’t too bad. The previous version of the script did singleton inserts and hence would have taken weeks to insert that much data.

The number of POSTs and the number of rows in each POST are configurable.