Category Archives: SQL Server Blogs

Parallel Performance in London: SQLbits XIV, Superheros Edition!

I’m happy to announce that I’ll be returning again this year to the UK’s premier SQL Server conference, SQLbits. This year’s show is in London, and has a Superheros theme; it should be a great time. On Wednesday March 4th, I’ll be delivering “Better Performance Through Parallelism” as a full-day preconference seminar. This seminar teaches you how to properly leverage SQL Server’s parallel processing capabilities to greatly improve performance of your biggest queries. It includes a large amount of…(read more)

Relational Data Lake

What is a Data Lake?
Pentaho CTO James Dixon is credited with coining the term “Data Lake”. As he describes it in his blog entry, “If you think of a Data Mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

These days, demands for BI data stores are changing. BI data consumers not only require cleansed and nicely modeled data, updated on a daily basis, but also raw, uncleansed and unmodeled data which is available near real-time. With new and much more powerful tooling like Power BI, users can shape and cleanse data in a way that fits their personal needs without the help of the IT department. This calls for a different approach when it comes to offering data to these users.

BI data consumers also demand a very short time-to-market of new data, they don’t want to wait for a few months until data is made available by a BI team, they want it today. The raw uncleansed form of data in a Data Lake can be loaded very quickly because it’s suitable for generated data loading technologies and replication, which makes this short time-to-market possible. Once users have discovered the data and have acquired enough insights that they want to share with the entire organization in a conformed way, the data can be brought to traditional Data Warehouses and cubes in a predictable manner.

Furthermore there is rise in the presence of unstructured and or semi-structured data and the need to have “big data” available for adhoc analyses. To store and analyze these forms of data new technologies and data structures are required.

When the Data Lake comes in place a lot of data streams from sources into the “lake” without knowing up front if it is eligible for answering business questions. The data can’t be modeled yet, because it’s not clear how it will be used later on. Data consumers will get the possibility to discover data and find answers before they are even defined. This differs fundamentally from the concept of a Data Warehouse in which the data is delivered through predefined data structures, based on relevant business cases and questions.

Technology
From a technology view, a Data Lake is a repository which offers storage for large quantities and varieties of both unstructured, semi-structured and structured data derived from all possible sources. It can be formed by multiple underlying databases which store these different structured forms of data in both SQL and NoSQL technologies.
20141217_JK_Technologies

For the semi-structured/unstructured side of data which is used for big data analytics, Data Lakes based on Hadoop and other NoSQL technologies are common. For the semi-structured/structured data, SQL technologies are the way to go.

In this blog post I will describe the semi-structured/structured, relational appearance of the Data Lake in the form of a SQL Server database: The Relational Data Lake.
RelationalDataLake2

Extract Load (Transform)
Data in a Data Lake is in raw form. Transformations will not be performed during loading and relationships and constraints between tables will not be created which is the default for transactional replication and keeps the loading process as lean and fast as possible. Because of the lack of transformations, movement of the data follows the Extract-Load-(Transform) (EL(T)) pattern instead of the traditional E-T-L. This pattern makes loading of data to the Data Lake easier, faster and much more suitable to perform using replication technologies or generated SSIS processes, for example with BIML. This creates a very attractive time-to-market for data which is added to the Data Lake. Latency of data is as low as possible, preferable data is loaded in near real-time: data should stream into the lake continuously.

Transformations take place after the data is loaded into the Data Lake, where applicable. Cosmetic transformations like translations from technical object and column names to meaningful descriptions which end users understand or other lightweight transformations can be performed in new structures (like SQL views) that are created inside the Data Lake.

Unlike Data Marts and Data Warehouses, which are optimized for data analysis by storing only the required attributes and sometimes dropping data below the required level of aggregation, a Data Lake always retains all attributes and (if possible) all records. This way it will be future proof for solutions that will require this data in a later moment in time or for users that will discover the data.

Accessing data
Data is made accessible through structures which can either be accessed directly, or indirectly through the exposure as OData Feeds. These structures are secured and are the only objects end users or other processes have access to. The feeds can be accessed with any tool or technology that is best suited to the task at any moment in time, for example using Power BI tooling like Excel PowerPivot/PowerQuery.

We normally create SQL Views in which security rules and required transformation are applied.

The Data Lake also acts as a hub for other repositories and solutions like Data Warehouses and Operational Cubes.

Master Data
Success of the Data Lake depends on good master data. When end users discover new raw data from the Data Lake they need to be able to combine it with high quality master data to get proper insights. Therefore a master data hub is a must have when a Data Lake is created. This hub should just be a database with master data structures in it, master data management on this data is preferable but not required. The master data hub should be a standalone solution, independent from the other BI solutions, as master data isn’t part of these solutions but is only used as data source. It should be sourced independently too, preferable using master data tooling or using tools like SSIS. Just like with data from the Data Lake, master data should also only be accessed through structures which can also be exposed as OData Feeds.

Next to the purpose of combining master data with data from the Data Lake, the master data can be used as source for other BI solutions like Data Warehouses. In there, the master data structures are often used as Data Warehouse Dimensions. To prevent the unnecessary duplicate loading of master data in the Data Warehouse that already exists in the master data hub, it can be a good choice to leave the master data out of the Data Warehouse Dimensions. Only the business keys are stored which can be used to retrieve the data from the master data hub when required. This way the Data Warehouse remains slim and fast to load and master data is stored in a single centralized data store.

Architecture
The entire Data Lake architecture with all the described components are fit in the model below. From bottom to top the highlights are:

  • Extract/Load data from the sources to the Data Lake, preferably in near real-time.
  • The Data Lake can consist of multiple SQL (and NoSQL) databases.
  • Transformations and authorizations are handled in views.
  • The Data Lake acts as hub for other BI solutions like Data Warehouses and Cubes.
  • The master data hub is in the center of the model and in the center of the entire architecture. It’s loaded as a standalone solution and isn’t part of any of the other BI solutions.
  • Traditional BI will continue to exist and continue to be just as important as it has always been. It will be sourced from the Data Warehouses and cubes (and master data hub).
  • The Discovery Platform with its new Power BI tooling is the place where “various users of the lake can come to examine, dive in, or take samples.” These samples can be combined with the data from the master data hub.

20141211JK_Data Lake BI Architecture

Data Lake Challenges
Setting up a Data Lake comes with many challenges, especially on the aspect of data governance. For example it’s easy to create any view in the Data Lake and lose control on who gets access to what data. From a business perspective it can be very difficult to deliver the master data structures that are so important for the success of the Data Lake. And from a user perspective wrong conclusions can be made by users who get insights from the raw data, therefore the Data Warehouse should still be offered as a clean trusted data structure for decision makers and a data source for conformed reports and dashboards.

Summary
The Data Lake can be a very valuable data store that complements the traditional Data Warehouses and Cubes that will stay as important as they are now for many years to come. But considering the increased amount and variety of data, the more powerful self-service ETL and data modeling tooling which appear and the shortened required time-to-market of near real-time data from source up and to the user, the Data Lake offers a future proof data store and hub that enables the answering of yet undefined questions and gives users personal data discovery and shaping possibilities.

Thanks go to my Macaw colleague Martijn Muilwijk for brainstorming on this subject and reviewing this blog post.

In The Cloud: Enhancements to Azure SQL Database

Last week Microsoft released a preview of the next version of Azure SQL Database, which is available now in a preview version. The feature set is nearly complete when compared to the standalone version of SQL Server 2014 and a clustered index on every table is no longer required. To help highlight the differences, I’ve used the SQL Database Migration Wizard available on Codeplex in a side by side comparison. Figure 1. SQL Database Migration Wizard start page. AdventureWorks2014 is used in the compatibility…(read more)

Rename table, columns, and measures in #ssas #tabular without breaking existing #dax

A feature that many people require in SSAS Tabular is the ability to refactor existing names, doing a correspondent rename in all existing objects in the model. I agree that this is an important feature that should be added in the development environment, but this will help only the development cycle. Once you release a Tabular model, the names you published becomes part of queries created by the users. For example, if you save a pivot table, the objects selected (table, column, and measure names) are all part of the MDX code that is generated automatically by Excel. If you rename something… at the next refresh, Excel will remove renamed objects from the Pivot Table. In less politically correct way, any renaming operation potentially break existing reports.

Some years ago I heard from a student in a course that they were using translations in SSAS Multidimensional to avoid this issue. They were developing using English, but since users were using other languages (I was in North Europe) they had a decoupling layer between internal model names (in English) and external ones. Any rename operation was completely painless in this way.

A few days ago, I reminded that and I thought if it was usable for Tabular… and I discovered that you can also use a translation for the same primary language of your model! This is really interesting and deserve to be investigated more. Please, read my article Frictionless Renaming in Tabular Models with Analysis Services and give me your feedback if you have time to test this approach. I am curious to see possible issues of this technique. Thanks!

PASS SQL Saturday #356 Slovenia Recapitulation

So the event is over. I think I can say for all three organizers, Mladen Prajdić, Matija Lah, and me, that we are tired now. However, we are extremely satisfied. It was a great event. First few numbers and comparison with SQL Saturday #274, the first SQL Saturday Slovenia event that took place last year.

SQL Saturday #274

SQL Saturday #356

People

135

220

Show rate

~87%

~95%

Proposed sessions

40

82

Selected sessions

15

24

Selected speakers

14

23

Countries

12

16

The numbers nearly doubled. We are especially proud of the show rate; with 95%, this is much better than average for a free event, and probably the highest so far for a SQL Saturday. We asked registered attendees to be fair and to unregister if they know they can’t attend the event in order to make room for those from the waiting list. An old Slovenian proverb says “A nice word finds a nice place”, and it works. 36 registered attendees unregistered. Therefore, we have to thank to both, the attendees of the event and those who unregistered.

Of course, as always, we also need to thank to all of the speakers, sponsors and volunteers. All volunteers were very helpful; however, I would like to especially point out Saša Mašič. Her work goes well beyond simple volunteering. I must mention also the FRI, the Faculty of Computer and Information Science, where the event was hosted for free. It is also worth mentioning that we are lucky to live in Ljubljana, such a beautiful city with extremely nice inhabitants who like to enjoy good food, hanging around and mingling, and long parties. Because of that we could be sure in advance that both speakers and attendees from other countries would enjoy spending time here also outside the event, that they would feel safe, and get help whenever they would need it.

From the organizational perspective, we tried to do our best, and we hope that everything was OK for speakers, sponsors, volunteers, and attendees. Thank you all!

Should there be code differences between Azure SQL Database editions?

I spend a lot of time working with software houses, helping them to make their applications work well with SQL Server. One thing that I’ve heard loud and clear over the years is that most software houses won’t write a single line of code that will only run on the enterprise edition of SQL Server, because they are not prepared to limit their potential pool of customers to those running enterprise edition.

This is completely at odds with the discussions that I’ve had with the SQL Server marketing team members who think that having feature differences will cause people to purchase enterprise edition instead. I’m sure that’s true for customers who write their own applications in-house and is also why at promotional events, the customers that you see mentioned are often those types of customers. However, most SQL Server customers run 3rd party applications written by other companies. The customers will often ask the software houses what software is required to run their applications and they then purchase what they need, unless they have some other pre-existing form of relationship with Microsoft.

So this means that having a difference in features can actually cost Microsoft money as the customers will often purchase standard edition because that’s all the software that they will be running requires.

Worse, when software houses are comparing SQL Server to other database engines, they compare SQL Server standard edition to the other engines, not the enterprise edition. This makes SQL Server compare badly for marketing reasons instead of technical reasons. For example, I saw a software house the other day comparing SQL Server with PostgreSQL. Their contention was that PostgreSQL (a free database engine) had a good high availability story and that SQL Server did not. Their logic was that SQL Server only had mirroring (and log shipping) and Microsoft had announced the deprecation of mirroring. So their contention was that SQL Server did not have a good availability story. The fact that enterprise edition had a really good story was irrelevant as they don’t consider anything in that version.

A further issue appears with coding. There is no developer edition of SQL Server that is limited to standard edition features. Software houses want to write code once and have it work across all target editions.

Another core issue is that this focus on enterprise edition has removed the upgrade reasons for standard edition customers. I think that every edition should have a compelling upgrade story, for every version. As an example, in SQL Server 2014, the reasons to upgrade for standard edition customers are the ability to use 128GB of memory and to have backup encryption. I’ll leave it to the reader to decide if that’s a strong story. I don’t think it is.

The final issue with the existing situation is that the product is moving into areas that need support from software houses. SQL Server 2014 introduced a range of in-memory options. For any customer that can’t change the code (ie: most customers), this is irrelevant. Again you’ll see the same large customers who write their own apps being mentioned in the launch events. I this case, I think the marketing team really have made a mistake. While new HA features, etc. can be retrofitted by a DBA to an existing database, the new in-memory options really need to be architected into the design of the applications. And that’s where it’s a real problem that it’s in enterprise edition only. The software houses are unlikely to use it, and yet they are exactly the same people that we need to embrace it.

So what does this have to do with Azure?

Bob Beauchemin wrote a great blog post today about how Azure SQL Database is moving to a SQL Server 2014 code base. That’s a great thing but one aspect that caught my eye was the mention that this is the first version of Azure SQL Database where features like columnstore indexes, etc. will only appear in the premium editions of Azure SQL Database.

While I’ve had concerns about how the licensing has been handled in the on-premises versions of SQL Server, in Azure SQL Database this concerns me even more. I really think that Azure SQL Database should offer the same code surface no matter which edition you are using. It makes sense to have performance and availability (including HA) options differ between Basic, Standard, and Premium but I really don’t like the idea of coding/feature differences. First up, it will again see software houses ignoring useful features. But worse, in the Azure SQL Database arena, customers are much more likely to use a mix of database editions than they currently do on-premises.

For example, if I am offering an application as a service, I want to be able to have different databases for different tenant customers. I really want to be able to choose the performance, reliability, availability options, etc. on a tenant by tenant basis, not across all tenants that are using my application. Having coding differences across the editions would make this a mess, or at least I think so.

I’d love to hear your thoughts.

Office 365 coming to Australian Data Centres

We’ve been so excited having local data centres for Microsoft Azure.

The one disappointment has been that Office 365/Power BI has still been based out of Singapore data centres. That has an effect on people that worry about data sovereignty and it also means that we’ve had higher latency on the connections.

But no more! I was really pleased to read an article in the newspaper yesterday that mentioned that Microsoft is moving Office 365 to the Australian data centres. The article claims this is happening in April next year. I can only hope it’s true as this will be a really good outcome thanks Microsoft!

Here’s the article: http://www.theaustralian.com.au/business/latest/microsoft-to-deliver-office-365-from-australian-data-centres/story-e6frg90f-1227149621552?nk=41c535e280e2d2c90d15e82eefcb763c

Parallel Foreach loops, one reason to use Powershell Workflow instead of SSIS

Lately I’ve been using Azure Automation which is a service on Azure for running Powershell Workflow scripts, as such as I’ve diving deep into Powershell Workflow. What’s Powershell Workflow? I’m glad you asked:

A workflow is a sequence of programmed, connected steps that perform long-running tasks or require the coordination of multiple steps across multiple devices or managed nodes. Windows PowerShell Workflow lets IT pros and developers author sequences of multi-device management activities, or single tasks within a workflow, as workflows. By design, workflows can be long-running, repeatable, frequent, parallelizable, interruptible, stoppable, and restartable. They can be suspended and resumed; they can also continue after an unexpected interruption, such as a network outage or computer restart.
Getting Started with Windows PowerShell Workflow

So Powershell Workflow does exactly what it says on the tin, its a workflow engine. Those of you that, like me, have been knee deep in SQL Server Integration Services (SSIS) for most of the past decade will know that SSIS too has its own workflow engine – its got a different name, Control Flow, but its still a workflow engine. One frequent ask of SSIS’s Control Flow is a Parallel ForEach Loop but there seems little hope that we’re going to get one (raised on Connect in March 2014, closed as Won’t Fix 5 months later) without going to 3rd party solutions like Cozyroc’s Parallel Loop Task.

As the title of this post has already told you Powershell Workflow has a big advantage over SSIS, it includes a Parallel Foreach Loop out-of-the-box and you can read all about it at about_Foreach-Parallel. Or type “help about_Foreach-Parallel” into your nearest Powershell window. From there:

The Parallel parameter of the ForEach keyword runs the commands in a ForEach script block once for each item in a specified collection.
The items in the collection, such as a disk in a collection of disks, are processed in parallel. The commands in the script block run sequentially on each item in the collection.

That’s fairly self-explanatory. If you’re interested to know more I’ve put together a cool little Powershell Workflow demo that makes calls to a URI, first serially (using a conventional Foreach Loop) then parallelly (which isn’t, I’m pretty sure, a real word but I’m using it anyway) and hosted it as a githib gist: Parallel-vs-Serial.ps1. You can simply copy-paste the script into Powershell ISE, hit F5 and you’re away. On my machine the serial calls completed in 19seconds, the parallel calls in 13seconds. No brainer!

Take a look if you have 5 minutes. This is good stuff.

@Jamiet 

image