Category Archives: Azure

The Best of Both Worlds: Top Benefits and Scenarios for Hybrid Hadoop, from On-Premises to the Cloud

Historically, Hadoop has been a platform for big data that you either deploy on-premises with your own hardware or in the cloud and managed by a hosting vendor. Deploying on-premises affords you specific benefits, like control and flexibility over your deployment.  But the cloud provides other benefits like elastic scale, fast time to value, and automatic redundancy, amongst others.

With the recent announcement of the Hortonworks Data Platform 2.2 being made generally available, Microsoft and Hortonworks are partnered to deliver Hadoop on Hybrid infrastructure in both on-premises and cloud.  This will give customers the best of both worlds with control & flexibility of on-premises deployments and the elasticity & redundancy of the cloud.

What are some of the top scenarios or use cases for Hybrid Hadoop? And what are the benefits of taking advantage of a hybrid model?

  • Elasticity: Easily scale out during peak demand times by quickly spinning up more Hadoop nodes (with HDInsight)
  • Reliability: Use the cloud as an automated disaster recovery solution that automatically geo-replicates your data. Or
  • Breadth of Analytics Offerings: If you’re already working with on-prem Hortonworks offerings, you now have access to a suite of turn-key data analytics and management services in Azure, like HDInsight, Machine Learning, Data Factory, and Stream Analytics.

To get started, customers need Hortonworks Data Platform 2.2 with Apache Falcon configured to move data from on-premises into Azure.  Detailed instructions can be found here.

We are excited to be working with Hortonworks to give Hadoop users Hadoop/Big Data on a hybrid cloud. For more resources:

Archiving Azure Automation logs to Azure BLOB Storage

For the past few months at work I’ve been diving headlong into a service on Microsoft Azure called Azure Automation which is, to all intents and purposes, a service for hosting and running Powershell scripts (properly termed “runbooks”). I’ve got a lot to say and a lot to share about Azure Automation and this is the first such post.

Each time Azure Automation runs a runbook (a running instance of a runbook is termed a job) it stores a log of that job however that log isn’t readily available as a log file on a file system as one might be used to. There are two ways to access the log, either via the Azure Management Portal at or programatically using Powershell cmdlets. Both are handy however I wanted the logs available as a file in Azure BLOB Storage so that they could be easily viewed and archived (in the future we’ll be building Hive tables over the top of the logs – remind me to blog about that later once its done). I have written a runbook called Move-AzureAutomationRunbookLogs that will archive logs for a given runbook and for a given time period into a given Azure BLOB Storage account and I’ve made it available on the Azure Automation gallery:


It does pretty much what it says on the tin so if you’re interested hit the link and check it out. For those that are simply interested in reading the code I’ve provided a screenshot of it below too (suggestions for how to copy-paste Powershell code into Live Writer so that it keeps its formatting are welcome!!). There’s a screenshot of the output too if that’s the sort of thing that floats your boat.

After this archiving runbook has run you’ll find two folders under the root folder that you specified:


and inside each of those is a folder for each runbook whose logs you are collecting (the folder contains a file per job). The “job” folder contains a short summary of each job however the “joboutput” folder contains the good stuff – the actual output from the run.

One last point, if you’re going to use this runbook I highly recommend doing so in conjunction with a technique outlined by Eamon O’Reilly at Monitoring Azure Services and External Systems with Azure Automation. Eamon’s post describes a technique for running operations such as these on a schedule and its something that I‘ve found to be very useful indeed.

Hope this is useful!




My MVA course about SQL Server 2014 & Azure

If you are tired today maybe you could take your time and spend the whole weekend with the SQL Server 2014 and Azure?

The new MVA course is ready 🙂 and waiting for you. Just visit the link and enjoy. Watch out – it’s in Polish 🙂

The topics covered:

  • backup encryption
  • manual backup to Azure storage
  • smart backup 🙂
  • buffer pool extension
  • SQL Server files integration with Azure storage

more comming….

Two Presentations This Weekend at #SQLSaturday349 in Salt Lake City

I have the privilege of being selected to fill two slots this weekend at #SQLSaturday349 in Salt Lake City.  The venue is simultaneously hosting the Big Mountain Data event.

My talks are:

To the Cloud, Infinity, & Beyond: Top 10 Lessons Learned at MSIT


Columnstore Indexes in SQL Server 2014: Flipping the DW /faster Bit

The latter is one of my staples.  The former is a new presentation, a preview of my forthcoming delivery for PASS Summit14.

Register, see the schedule, or see the event home page on the SQL Saturday site.  I’ll look forward to seeing you here:

Spencer Fox Eccles Business Building
1768 Campus Center Drive
Salt Lake City, UT 84112

Kudos to Pat Wright (blog | @sqlasylum) & crew for their crazy efforts (pardon the pun) in coordinating the event.

TJay Belt (@tjaybelt), Andrea Alred (@RoyalSQL), & Ben Miller (@DBADuck), keep your peepers peeled—I’m on the way!

Cloudera Selects Azure as a Preferred Cloud Platform

We are working to make Azure the best cloud platform for big data, including Apache Hadoop. To accomplish this, we deliver a comprehensive set of solutions such as our Hadoop-based solution Azure HDInsight and managed data services from partners, including Hortonworks. Last week Hortonworks  announced the most recent milestone in our partnership and yesterday we announced even more data options for our Azure customers through a partnership with Cloudera.

Cloudera is recognized as a leader in the Hadoop community, and that’s why we’re excited Cloudera Enterprise has achieved Azure Certification. As a result of this certification, organizations will be able to launch a Cloudera Enterprise cluster from the Azure Marketplace starting October 28. Initially, this will be an evaluation cluster with access to MapReduce, HDFS and Hive. At the end of this year when Cloudera 5.3 releases, customers will be able to leverage the power of the full Cloudera Enterprise distribution including HBase, Impala, Search, and Spark.

We’re also working with Cloudera to ensure greater integration with Analytics Platform System, SQL Server, Power BI and Azure Machine Learning. This will allow organizations to build big data solutions quickly and easily by using the best of Microsoft and Cloudera, together.  For example Arvato Bertelsmann was able to help clients cut fraud losses in half and speed credit calculations by 1,000x.

Our partnership with Cloudera allows customers to use the Hadoop distribution of their choice while getting the cloud benefits of Azure. It is also a sign of our continued commitment to make Hadoop more accessible to customers by supporting the ability to run big data workloads anywhere – on hosted VM’s and managed services in the public cloud, on-premise or in hybrid scenarios.

From Strata in New York to our recent news from San Francisco it’s exciting times ahead for those in the data space.  We hope you join us for this ride!

Eron Kelly
General Manager, Data Platform

Expanding the Availability and Deployment of Hadoop Solutions on Microsoft Azure

Author: Shaun Connolly
VP Corporate Strategy – Hortonworks

Data growth threatens to overwhelm existing systems and bring those systems to their knees. That’s one of the big reasons we’ve been working with Microsoft to enable a Modern Data Architecture for Windows users and Microsoft customers.

A history of delivering Hadoop to Microsoft customers

Hortonworks and Microsoft have been partnering to deliver solutions for big data on Windows since 2011. Hortonworks is the company that Microsoft relies on for providing the industry’s first and only 100% native Windows distribution of Hadoop, as well as the core Hadoop platform for Microsoft Azure HDInsight and the Hadoop region of the Microsoft Analytics Platform System.

This week we made several announcements that further enable hybrid choice for Microsoft-focused enterprises interested in deploying Apache Hadoop on-premises and in the cloud.

New capabilities for Hadoop on Windows

Hortonworks has announced the newest version of the Hortonworks Data Platform for Windows  – the market’s only Windows native Apache Hadoop-based platform. This brings many new innovations for managing, processing and analyzing big data including:

  • Enterprise SQL at scale
  • New capabilities for data scientists
  • Internet of things with Apache Kafka
  • Management and monitoring improvements
  • Easier maintenance with rolling upgrades

Automated cloud backup for Microsoft Azure
Data architects require Hadoop to act like other systems in the data center, and business continuity through replication across on-premises and cloud-based storage targets is a critical requirement.  In HDP 2.2, we extended the capabilities of Apache Falcon to establish an automated policy for cloud backup to Microsoft Azure.  This is an important first step in a broader vision to enable seamlessly integrated hybrid deployment models for Hadoop.

Certified Hadoop on Azure Infrastructure as a Service (IaaS)

Increasingly the cloud is an important component for big data deployments. On Wednesday October 15 we announced that the Hortonworks Data Platform (HDP) is the first Hadoop platform to be Azure certified to run on Microsoft Azure Virtual Machines. This gives customers new deployment choices for small and large deployments in the cloud. With this new certification, Hortonworks and Microsoft make Apache Hadoop more widely available and easy to deploy for data processing and analytic workloads enabling the enterprise to expand their modern data architecture.

Maximizing Hadoop Deployment choice for Microsoft Customers

These latest efforts further expand the deployment options for Microsoft customers while providing them with complete interoperability between workloads on-premises and in the cloud. This means that applications built on-premises can be moved to the cloud seamlessly. Complete compatibility between these infrastructures gives customers the freedom to use the infrastructure that best meets their needs. You can backup data where the data resides (geographically) and provide the flexibility and opportunity for others to do Hadoop analytics in the cloud (globally).

We are excited to be the first Hadoop vendor to offer Hadoop on Azure Virtual Machines and we look forward to continuing our long history of working with Microsoft to engineer and offer solutions that meet the most flexible and easy to use deployment options for big data available, further increasing the power of the Modern Data Architecture. 

Read more information as follows:

Passing credentials between Azure Automation runbooks

Update: Turns out there’s an even easier way of achieving this than the method I’ve described in this blog post. Joe Levy explains all in the comments below. 

I’ve been doing a lot of work lately using Azure Automation to run (what are essentially) Powershell scripts against Azure. Recently the ability for those scripts authenticate against your Azure subscription using a username and password was provided (see Authenticating to Azure using Azure Active Directory) and it basically involves a call to Get-AutomationPSCredential, here’s a handy screenshot (stolen from the aforementioned blog post) to illustrate:


That’s all fine and dandy however you may find that you want to modularise your runbooks so that you have lots of smaller discrete code modules rather than one monolithic script, if you do so you’re probably not going to want to make a call to Get-AutomationPSCredential each time (for a start, such calls make it hard to unit test your runbooks) hence you may prefer to pass the credentials between your runbooks instead.

Here is how I do this. In your calling runbook get the credentials, extract the username and password and pass them to the called runbook:

workflow CallingRunbook
    $AzureCred = Get-AutomationPSCredential -Name “MyCreds”
    $AzureUserName = $AzureCred.GetNetworkCredential().UserName
    $AzurePassword = $AzureCred.GetNetworkCredential().Password
    CalledRunbook -AzureUserName $AzureUserName -AzurePassword $AzurePassword

In the called runbook use the passed-in values to authenticate to Azure

workflow CalledRunbook
    $SecurePassword = $AzurePassword | ConvertTo-SecureString -AsPlainText -Force
    $AzureCred = New-Object System.Management.Automation.PSCredential `
            -ArgumentList $AzureUserName, $SecurePassword
    Add-AzureAccount -Credential $AzureCred

Job done!


A quick and dirty blog post but one which folks should find useful if they’re using Azure Automation. Feedback welcome.


Microsoft announces real-time analytics in Hadoop and new ML capabilities in Marketplace

This morning at Strata + Hadoop World, Microsoft announced the preview of Apache Storm clusters inside HDInsight as well as new machine learning capabilities in the Azure Marketplace.

Apache Storm is an open source project in the Hadoop ecosystem which gives users access to an event-processing analytics platform that can reliably process millions of events. Now, users of Hadoop can gain insights to events as they happen in real-time.  Learn more from here:

As part of Strata, Microsoft partner, Hortonworks announced the next version of their Hadoop distribution HDP 2.2 will include capabilities to orchestrate data from on-premise to Azure.  This will allow customers to back-up their on-premise data or elastically scale out using the power of the cloud.

Finally, Microsoft is offering new machine learning capabilities as part of the Azure Marketplace.  Customers can now access ML as web services which enable scenarios like doing anomaly detection, running a recommendation engine, doing fraud detection, and a set of R packages.

Read more of Microsoft’s Strata announcements on the Official Microsoft Blog

Using SSDs in Azure VMs to store SQL Server TempDB and Buffer Pool Extensions

A common practice on-premise to improve the performance of SQL Server workloads is storing TempDB and/or Buffer Pool Extensions (SQL Server 2014) on SSDs. The former improves the performance of workloads that use temporary objects heavily (e.g. queries handling large recordsets, index rebuilds, row versioning isolation levels, temp tables, and triggers). The latter improves the performance of read workloads which working set doesn’t fit in memory.

Now you can do the same in Azure VMs using the new D-Series VM Sizes.

Important: The SSD drive is transient

The SSD drive (D:) is not persistent, so its contents and permissions will be lost if the VM moves to a different host. This can happen in case of a host failure or a VM resize operation.

  • Do not store your data or log files there. Use (persistent) drives from Azure Storage.
  • If using TempDB and/or Buffer Pool Extensions, SQL Server requires accessing the directory specified to store them. The following section describes how to store SQL Server TempDB and/or Buffer Pool Extensions on the SSD drive and automatically recreate directory if the VM moves to a different host.


Configuration Steps

1) Create a folder in the D: drive

This is the folder that you will store TempDB and/or Buffer Pool Extensions in, for example as “D:SQLTEMP”.

2) To move TempDB to the SSD

Using SSMS connect to your SQL Server instance. Execute the following T-SQL commands to change location of the TempDB files:




ALTER DATABASE tempdb MODIFY FILE (name = templog, filename = 'D:SQLTEMPtemplog.ldf')


2) To configure Buffer Pool Extensions in the SSD

Using SSMS connect to your SQL Server instance. Execute the following T-SQL commands to configure the Buffer Pool Extension, specifying the location and size of its file. The general recommendation is to set the size to 4-6 times the size of the VM memory. For more details read the documentation.



( FILENAME = 'D:SQLTEMPExtensionFile.BPE' , SIZE = <size> [ KB | MB | GB ] )


3) Configure the Start Mode of SQL Server and SQL Agent startup as Manual

Using Configuration Manager set the Start Mode of SQL Server and SQL Agent as Manual.


4) Create a Powershell script to recreate the folder in D: if needed and start SQL server

Copy and paste the following script and save it as a Powershell file in the C: drive (OS drive), for example as “C:SQL-startup.ps1”. If needed, modify the D: folder to the one you specified in Step 1.

$SQLService="SQL Server (MSSQLSERVER)"
$SQLAgentService="SQL Server Agent (MSSQLSERVER)"
if (!(test-path -path $tempfolder)) {
    New-Item -ItemType directory -Path $tempfolder
Start-Service $SQLService
Start-Service $SQLAgentService


5) Change the Powershell Execution Policy to allow the execution of signed scripts

From the Powershell console execute Set-ExecutionPolicy RemoteSigned

6)      Create a schedule task at system startup to execute the Powershell script

Using Task Scheduler create a Basic Task that executes when the computer starts and executes the script in the Powershell file. For this, specify:

  • Program/script:                powershell
  • Arguments:                  –file ‘C:SQL-startup.ps1’

7) Test the setup

Restart the VM and verify using Configuration Manager that the SQL Server service is started.

This guarantees that if the VM moves to a different host SQL Server will start successfully.


Try the D-SERIES VM SIZES at the AZURE PORTAL today.

Azure Automation Pain Points

In April of this year Microsoft announced a new offering on their Azure platform called Azure Automation (which I briefly talked about at Azure Automation – the beginning of cloud ETL on Azure?). Azure Automation allows one to run Powershell scripts on Azure.

I have been using Azure Automation for the past couple of months on my current project and, as the offering is still in preview, have been asked by the product team to share with them some experiences of using it, specifically the pain points. As is my habit I’d like to share those experiences here on my blog too and given that (a) anyone can sign up and use Azure Automation and (b) I’m not under NDA I see no reason not to do that. Hence, here is the transcript of an email I sent to a member of the Azure Automation team yesterday:


Thank you for the invite to email you with my pain points regarding AA so far.

Biggest thing has to be the developer experience. I consider myself a developer rather than a sysadmin-type-person so the develop->deploy-> test-> fix cycle is important to me. As a developer its important that everything is source controlled and so I choose to develop my runbooks locally in my Powershell editor of choice and then upload to Azure Automation using scripts (which are, of course, written in Powershell). These same deployment scripts will be the ones that we eventually use to deploy to our production environment (when we reach that stage – we’re not there yet) and are themselves source-controlled. Eventually we will get to the point where each time someone checks a change into the master branch we automatically push to the production environment (aka Continuous Deployment).

The problem I have is that Azure Automation is not well suited to this type of quick development cycle. Firstly, because there are activities in my runbooks that don’t work outside of Azure Automation (i.e. Get/Set-AutomationVariable) I can’t run my runbooks locally. Hence, each time I make a change to a runbook (of which I currently have 6) I run my deployment script so that I can test it. Deploying then publishing a runbook takes in the region of 20seconds, multiply that by 6 and I’ve got a long wait between making a change and getting to the stage where I can test that change.

Worse than that though is the time it takes to run a runbook. I know Azure Automation is currently in preview (and hence can kind of excuse what I’m about to say) but I still find the length of time that a runbook spends in the “starting” status inordinately long. I want to be able to fire off a runbook and have it start executing immediately – currently I have to wait a long time until the runbook actually starts executing (and I’m talking minutes, not seconds).

Put all that together and you’ve got a situation where testing just a single change to a runbook takes minutes. That is unacceptable to be honest – I need it to be seconds.

You might say that I could use the in-browser editing experience but that means I’m not editing the master copy of my runbook that goes into source control, and it doesn’t solve the problem of the massive start up time.

Hope this is useful to you.

Jamie Thomson



I was asked to describe my pain points which is why that email comes over as rather derogatory however I should point out that I have found the overall capabilities of Azure Automation to be quite impressive. Its a very capable platform for automating the management of Azure resources and I am looking forward to it becoming generally available very soon.