Category Archives: Data Warehouse

PASS Summit 2014 Pre-Con Preview: Davide Mauri

If you’re into Data Warehousing, you may be interested in attending to the workshop I’ll deliver at PASS Summit 2014 in Seattle on November 4th.

http://www.sqlpass.org/summit/2014/Sessions/PreConferenceSessions.aspx

The workshop is entirely dedicated to explaining why and how a *successful* Data Warehouse can be thought, designed, architected, built, loaded and tested, using the  Agile approach that, so far, has mainly be applied to the application development field and in the last year has gained traction also (and finally I would say) in the BI field. Both Gartner and Forrester also underline that the Agile is a key factor for success in modern BI world, since has been verified that 50% of the requirement change in the first year in a BI project.

If you want to read more about the workshop, we you read the Q&A just published here:

http://www.sqlpass.org/Community/PASSBlog/tabid/1476/entryid/676/PASS-Summit-2014-Pre-Con-Preview-Davide-Mauri.aspx

In addition to that I’d also like to share the agenda of the workshop, that will give you even more information on what we’ll discuss on that day:

  • Why a Data Warehouse?
  • The Agile Approach
  • Modeling the Data Warehouse
    • Kimball, Inmon & Data Vault
    • Dimensional Modeling
    • Dimension, Fact, Measures
    • Star & Snowflake Schema
    • Transactional, Snapshot and Temporal Fact Tables
    • Slowly Changing Dimensions
  • Engineering the Solution
    • Building the Data Warehouse
      • Solution Architecture
      • Naming conventions, mandatory columns and other rules
      • Views and Stored Procedure usage
    • Loading the Data Warehouse
      • ETL Patterns
      • Best Practices
    • Automating Extraction and Loading
      • Making the solution automatable
      • BIML
  • Unit Testing Data
  • The Complete Picture
    • Where Big Data comes into play?
  • After the Data Warehouse
    • Optimized Hardware & Software
  • Conclusions

As you can see it will be a fully packed day…so brings two cups of coffee and you’ll be good 🙂

See you in Seattle!

On Agile Data Warehousing

In the last month, I’ve been working on the slide deck of my Agile Data Warehousing workshop. In order to give to it additional value that goes beyond the pure technical aspects, and since now the “Agile” approach is becoming more and more mainstream also (and finally!) on BI, I did a small research to check what one can find on the web regarding this topic. Many things happened from the very first time I presented at PASS 2010, where I first mentioned the need to be Agile (or “Adaptive” as I prefer to say when talking about BI & Agility). In 2011 Gartner, at their BI Summit, stated through the voice of Andy Bitterer that

50% of requirements change in the first year of a BI project

and, as a result, the only possible way to succeed in a BI project is to be able to adapt quickly to the new requirements and requests. The doors to Agile BI were opened.

Agile BI as grown from that point on, until the point that Forrester even started to evaluate Agile Business Intelligence Platform, even nominating Microsoft as one of the Leaders:

Microsoft named a Leader in Agile Business Intelligence by Forrester

I must say I’m not 100% with the definition of Agile BI the Forrester gives, since it puts together to many things (Data Visualization, Automation, Self-Service BI just to name a few), but I understand that they see the things from the end user perspective, that simply wants to “do everything, immediately, easily and nicely” with its data. There is also a definition on Wikipedia (page created on January 2013) that is better, more complete and less marketing-oriented:

http://en.wikipedia.org/wiki/Agile_Business_Intelligence

Beside those definitions, terms like Agile BI and Lean BI became quite common. Of course, with them, came also the idea of Agile Project Management and Agile Modeling. Especially this latter subject seems to be very hot and of course is something that is also close to my interests. Now, I won’t want to go into a deep discussion of the topic, telling you what it good and what is bad. There is already a lot on the web for or against any possible modeling solutions. Data Vault, BEAM, Model-Storming…a simple search on the web and you’ll find thousands of articles. Who’s the best? Should we go for Data Vault? Or for an Inmon-style DWH? Or Kimball? Or something else?

Well…I don’t really care. Or, to be honest, a care just a little bit.

Now, since “Ideas without execution are hallucinations”, and models are ideas after all, it’s my strong opinion that you don’t model the be agile: you “engineer” to be agile. Why? It’s simple: all models are agile…since they are models, and nothing more. Is not a problem to change a model, since it’s “just” a formal definition of a system…(of course, I’m bringing the idea to the extreme here)  and, since we’re assuming that business requirement will be changing, you known in advance that no model that will satisfy them all (immediately) exists (yeah, you can try to model the “Universal Data Model” but it’s going to be *very* complex…). So, the main point is to be able to bring changes quickly, with a measurable quality, in a controlled and economic way.

We all know that the one and only one model that should be presented to the end user is the Dimensional Model. This is how your Data Mart should look like. But how do you model your Data Warehouse is completely up to you. And it will change over time, for sure. So how you implement the process in order to extract, transform and load the data, is the key point. That implementation must be agile. What lies behind the scenes, following the information hiding principle, should be considered a simple “technical enabler” that could change at any time. So, if one prefer to use Data Vault, or Inmon, or just store anything in some hundreds Hadoop server…I don’t see any problem with that. As soon as you have defined an engineered approach with naming conventions, design pattern automation, quality checks, metadata and all the stuff in order to make sure that when you have to change something, you can do the smallest change possible, measure its impact, and test the result.

I’ve been trying to apply Agile principles to BI since 2003…I’ve been through any possible changes that you can imagine (even a complete change of an ERP that was the main source of data) and the most important thing I’ve learned is that the only model that works is the one that is liquid and is able to adapt quickly to changing requirements. I usually start modeling in the easiest way possible, and thus I apply the Dimension Model, and then I make all the changes to it in order to be able to keep

  • All the data at the highest granularity
  • Optimal performances
  • Historical Informations (that may not be visible to end user, but may be needed to correctly transform data)

Which, for complex DWH, means that at the beginning the DWH and the Data Mart overlaps, and that they diverge as the project goes on. In one project we even decided to go for a normalized model of data since the DWH became the source not only for reporting and analysis but also for other, more operative, duties.

Now, in order to be really agile, it’s mandatory to have an engineered approach that make sure that from agility the project doesn’t fall into anarchy. Because this is the biggest risk. The line that separates the two realities it’s very thin and crossing it is very easy. When you have a team of people, or they work as one, or Agile BI is not for you. Otherwise chaos will reign. And to make sure this does not happen, you have to have a streamlined building process, tools and methods (design patterns, frameworks and so on) so that everyone can technically do a good job and technical quality of the outcome is not only proportional to the experience of the person doing it.

It’s really important that everyone who wants to approach Agile BI understand the “engineering” part. I found it always underestimated and in all post I’ve found on the web, I never read someone stressing the importance of that part. That’s why I felt the urge to write this post, and that’s why I’ll go very deep in this topic during my PASS Workshop.

Now, before finishing the post, there is still one thing missing, but vital, for the success of an Agile VI solution: testing. Agility cannot exist if you don’t have an automated (or semi-automated) testing framework that assures you and your users that no errors will be introduced in the data as a result of a change done to satisfy some new or changed requirements. This is mandatory and I’m quite disappointed to see that almost no-one underline this point enough. Forrester doesn’t even took into consideration this point when evaluating the existing “Agile BI Platforms”. That’s a very big mistake in my opinion…since everyone give for granted data quality, but it’s actually the most difficult thing to obtain and maintain.

Testing frameworks are quite common in development, even Visual Studio has a testing engine integrated, and they should become common in BI to. Something is starting to appear (https://nbi.codeplex.com/), but I wish that also big players (Microsoft above all) start to take this subject more seriously. How cool and useful will be a strong integration of testing in SSIS? After DWH/DM/Cube loading one could launch all the tests (maybe done right from Excel, from a power-user, or even created automatically if certain conditions are met…say the generation of year balance) and make sure that the freshly produced data are of good quality.

Just like water. Because data IS water. I won’t drink it if not tested.

Updateable Column Store Indexes in SQL Server 2014

Introduction

Column store indexes had been released with SQL Server 2012 to optimize data warehouse workloads that have specific patterns, data are loaded through T-SQL scripts or through SSIS packages, several times a day, in any case the frequency is not important, only that the available data is loaded in the same execution. At the end of ETL process, data is read with reporting tools. Usually data is written one time, then read multiple times.

In SQL Server 2012 there was the non-clustered column store index only; like a traditional B-Tree non-clustered index, it was a secondary index. However, it differs from a traditional B-Tree index because it is based on a columnar structure, though, the base table remains organized by row (in a structure called row-store and saved in 8K data pages).

The column store indexes are part of Microsoft In-Memory Technologies because they use xVelocity engine for data compression optimization and its implementation is based on a columnar structure such as PowerPivot and SSAS Tabular. Data in column store indexes are organized by column, each memory page stores data from a single column, so each column can be accessed independently. This means that SQL Server Storage Engine will be able to fetch the only columns it needs. In addition, data is highly compressed, so more data will fit in memory and the I/O operations can greatly decrease.

Column store indexes structure

Before talking about new feature of column store indexes in SQL Server 2014, it is important to introduce three keywords: Segment, Row Group and Compression. In a column store index, a segment contains values for one column of a particular set of rows called row group. As it is possible to see in the following picture, each red and gray portions are segments. When you create a column store index, the rows in the table will be divided in groups and each row group contains about 1 million rows (the exact number of rows in each row group is 1,048,576; in other word there are 2^20 rows in each row group). Column store transforms the internal index organization from row organization to columnar organization and there will be one segment for each column and for each row group. Column store indexes are part of Microsoft In-Memory technologies in which data is compressed and the compression plays a very important role, so each segment is compressed and stored in a separate LOB.

This article does not detail the algorithms to compress data in column store index. At any rate, keep in mind that each segment stores the same type of data, so in a segment, there will be homogeneous data and in this scenario, the compression algorithms will be more efficient than the one used to compress table rows because row usually contains different type of data. In general, data compression can be implemented using different techniques such as:

  • Dictionary Encoding
  • Run-length Encoding
  • Bit Packing
  • Archival Compression (only in SQL Server 2014)
    • It can reduce disk space up to 27%

The techniques used by SQL Server to compress data are undocumented.

The following picture shows an example of row groups and segments in a column store index… continue reading full article here.

 

Thanks to my friend Viviana Viola for the grammar review.

Tune in tomorrow and accelerate your insights

Tomorrow’s the day! Tune in to hear from Microsoft CEO Satya Nadella, COO Kevin Turner, and Data Platform Group CVP Quentin Clark about Microsoft’s approach to data, and how the latest advancements in technology can help you transform data into action.

Who should watch?

Join us tomorrow morning at 10AM PDT if you like data or want to learn more about it. If you store it, you manage it, you explore it, you slice and dice it, you analyze it, you visualize it, you present it, or if you make decisions based on it. If you’re architecting data solutions or deciding on the best data technology for your business. If you’re a DBA, business analyst, data scientist, or even just a data geek on the side, join the live stream.

What will I hear about?

Data infrastructure. Data tools. And ultimately, the power of data. From finding the connections that could cure cancer, to predicting the success of advertising campaigns, data can do incredible things. Join us online and get inspired. You’ll see how your peers are putting their data, big and small, to work.

From a product perspective, we’ll celebrate the latest advancements in SQL Server 2014, Power BI for Office 365, SQL Server Parallel Data Warehouse, and Microsoft Azure HDInsight. And ultimately, we’ll explore how these offerings can help you organize, analyze, and make sense of your data – no matter the size, type, or location.

Where do I sign up?

Mark your calendar now or RSVP on Facebook so you’re ready to go tomorrow. When streaming goes live, you can join us here for all the action live from San Francisco.

When do things get started?

Tomorrow, April 15, at 10AM PDT. Be there.

See you tomorrow!

PASS DW/BI Virtual Chapter Upcoming Sessions (December 2013)

Let me point out the upcoming live events scheduled for December 2013 organized by PASS Business Intelligence Virtual Chapter.

 

Create and Load a Staging Environment from Scratch in an Hour with Biml

Date: Thursday 12 December Noon PST / 3 PM EST / 8 PM GMT
Speaker: Scott Currie
URL: https://attendee.gotowebinar.com/register/7424713205660411905

Business Intelligence Markup Language (Biml) automates your BI patterns and eliminates the manual repetition that consumes most of your SSIS development time. During this hour long presentation, Scott Currie from Varigence will use the free BIDSHelper add-in for BIDS and SSDT to introduce Biml and use to automatically generate large quantities of custom SSIS packages. The session will be largely demonstration driven, and reusable sample code will be distributed for you to use in your own projects. Using a live-typing approach, Scott will start from scratch and by the end of the session create a full-blown staging environment. This will include the creation of *hundreds* of target table creation scripts, data load packages, data scrubbing rules, logging, and more. The best part is that you can freely reuse the code in your own environment just by changing the connection strings – or make small changes to implement your own data load patterns.

 

Inferred Dimension Members within MDS and SSIS

Date: Monday 16 December 3 PM PST / 6 PM EST / 11 PM GMT
Speaker: Reza Rad
URL: https://attendee.gotowebinar.com/register/7123625140094491905

Combining Master Data Services with Data Warehouses, will cause some challenges in ETL Scenarios. In this session we will go through a demo of Inferred Dimension Members implementation with SSIS considering the fact that MDS keeps the single version of truth for the dimension record. In this session you will learn how we will write back new record’s data into MDS entity as an Inferred member. The staging structure of Master Data Services and Batch Processing will be used for this. Then you will learn what is the best practice to add the inferred record into Data Warehouse dimension. Updating the existing dimension member also would consider the Inferred member and apply SCD types only if this is not an inferred Member.

 

Guerrilla MDS/MDM The Road To Data Governance

Date: Thursday 19 December Noon PST / 3 PM EST / 8 PM GMT
Speakers: Ira Whiteside and Victoria Stasiewicz
URL: https://attendee.gotowebinar.com/register/336644427020709122

Ira and Vic’s session “Guerrilla MDS” will be a walk-through of a real-world implementation for a master data model (MDM) and metadata mart utilizing SSIS, MDS and POWER BI EXCEL add-ins as well as applying proper data quality techniques. We will walk through in detail the processes necessary for utilizing the complete MDS functionality as follows: creating entities attribute, relating entities the domain based attributes, staging leave table, updated entity content, apply business rules, create subscription view and set up security. Source code for all samples and PowerPoint will be made available. 

The Modern Data Warehouse

In recent surveys by TDWI Research, roughly half of respondents report that they will replace their primary data warehouse (DW) platform and/or analytic tools within three years. Ripping out and replacing a DW or analytics platform is expensive for IT budgets and intrusive for business users. This raises the question: What circumstances would lead so many people down such a dramatic path?

It’s because many organizations need a more modern DW platform to address a number of new and future business and technology requirements. Most of the new requirements relate to big data and advanced analytics, so the data warehouse of the future must support these in multiple ways. Hence, a leading goal of the modern data warehouse is to enable more and bigger data management solutions and analytic applications, which in turn helps the organization automate more business processes, operate closer to real time, and through analytics learn valuable new facts about business operations, customers, products, and so on.

For organizations that need a modern data warehouse that satisfies new and future requirements, TDWI offers a checklist of our top six recommendations. These can guide your selection of vendor products and your solution design.

Users facing new and future requirements for big data, analytics, and real-time operation need to start planning today for the data warehouse of the future. To help them prepare, this TDWI Checklist Report drills into each of the six recommendations, listing and discussing many of the new vendor product types, functionality, and user best practices that will be common in the near future, plus the business case and technology strengths of each.

This checklist was sponsored by Microsoft.

To read the full report, sign up and download the report here

*This post excerpted from the TDWI Checklist Report, November 2013

PASS Business Intelligence Virtual Chapter Upcoming Sessions (November 2013)

Let me point out the upcoming live events, dedicated to Business Intelligence with SQL Server, that PASS Business Intelligence Virtual Chapter has scheduled for November 2013.

 

The “Accidental Business Intelligence Project Manager”

Date: Thursday 7th November – 8:00 PM GMT / 3:00 PM EST / Noon PST
Speaker: Jen Stirrup
URL: https://attendee.gotowebinar.com/register/5018337449405969666

  • You’ve watched the Apprentice with Donald Trump and Lord Alan Sugar. You know that the Project Manager is usually the one gets fired
  • You’ve heard that Business Intelligence projects are prone to failure
  • You know that a quick Bing search for “why do Business Intelligence projects fail?” produces a search result of 25 million hits!
  • Despite all this… you’re now Business Intelligence Project Manager – now what do you do?

In this session, Jen will provide a “sparks from the anvil” series of steps and working practices in Business Intelligence Project Management. What about waterfall vs agile? What is a Gantt chart anyway? Is Microsoft Project your friend or a problematic aspect of being a BI PM? Jen will give you some ideas and insights that will help you set your BI project right: assess priorities, avoid conflict, empower the BI team and generally deliver the Business Intelligence project successfully!

 

Dimensional Modelling Design Patterns: Beyond Basics

Date: Tuesday 12th November – Noon AEDT / 1:00 AM GMT / Monday 11th November 5:00 PM PST
Speaker: Jason Horner, Josh Fennessy and friends
URL: https://attendee.gotowebinar.com/register/852881628115426561

This session will provide a deeper dive into the art of dimensional modeling. We will look at the different types of fact tables and dimension tables, how and when to use them. We will also some approaches to creating rich hierarchies that make reporting a snap. This session promises to be very interactive and engaging, bring your toughest Dimensional Modeling quandaries.

 

Data Vault Data Warehouse Architecture

Date: Tuesday 19th November – 4:00 PM PST / 7 PM EST / Wednesday 20th November 11:00 PM AEDT
Speaker: Jeff Renz and Leslie Weed
URL: https://attendee.gotowebinar.com/register/1571569707028142849

Data vault is a compelling architecture for an enterprise data warehouse using SQL Server 2012. A well designed data vault data warehouse facilitates fast, efficient and maintainable data integration across business systems. In this session Leslie and I will review the basics about enterprise data warehouse design, introduce you to the data vault architecture and discuss how you can leverage new features of SQL Server 2012 help make your data warehouse solution provide maximum value to your users. 

SQL Rally Pre-Con: Data Warehouse Modeling – Making the Right Choices

As you may have already learned from my old post or Adam’s or Kalen’s posts, there will be two SQL Rally in North Europe.

In the Stockholm SQL Rally, with my friend Thomas Kejser, I’ll be delivering a pre-con on Data Warehouse Modeling:

Data warehouses play a central role in any BI solution. It’s the back end upon which everything in years to come will be created. For this reason, it must be rock solid and yet flexible at the same time. To develop such a data warehouse, you must have a clear idea of its architecture, a thorough understanding of the concepts of Measures and Dimensions, and a proven engineered way to build it so that quality and stability can go hand-in-hand with cost reduction and scalability. In this workshop, Thomas Kejser and Davide Mauri will share all the information they learned since they started working with data warehouses, giving you the guidance and tips you need to start your BI project in the best way possible―avoiding errors, making implementation effective and efficient, paving the way for a winning Agile approach, and helping you define how your team should work so that your BI solution will stand the test of time.

You’ll learn:

  • Data warehouse architecture and justification
  • Agile methodology
  • Dimensional modeling, including Kimball vs. Inmon, SCD1/SCD2/SCD3, Junk and Degenerate Dimensions, and Huge Dimensions
  • Best practices, naming conventions, and lessons learned
  • Loading the data warehouse, including loading Dimensions, loading Facts (Full Load, Incremental Load, Partitioned Load)
  • Data warehouses and Big Data (Hadoop)
  • Unit testing
  • Tracking historical changes and managing large sizes

With all the Self-Service BI hype, Data Warehouse is become more and more central every day, since if everyone will be able to analyze data using self-service tools, it’s better for him/her to rely on correct, uniform and coherent data. Already 50 people registered from the workshop and seats are limited so don’t miss this unique opportunity to attend to this workshop that is really a unique combination of years and years of experience!

http://www.sqlpass.org/sqlrally/2013/nordic/Agenda/PreconferenceSeminars.aspx

See you there!

Conferences in the next months

Next months look full of interesting conferences and I’ll be speaking in several of those. I’ll be speaking at the following ones:

SQL DAYS
17 – 18 September 2013, Germany
http://www.sqldays.net/

SQL Saturday 260 Milano
8 October 2013, Italy
http://www.sqlsaturday.com/260/eventhome.aspx

SQL PASS Nordic
4 – 6 November 2013, Sweden
http://www.sqlpass.org/sqlrally/2013/nordic/

SQL PASS Amsterdam
6 – 8 November 2013, Nederland
http://www.sqlpass.org/sqlrally/2013/amsterdam/

SQL Saturday 257 Verona
9 November 2013, Italy
http://www.sqlsaturday.com/257/eventhome.aspx

Advanced Data Warehouse Workshop
20 November 2013, Stockholm

SQL Saturday 264 Ancona
13 December 2013, Italy
http://www.sqlsaturday.com/264/eventhome.aspx

I’ll deliver session on Data Quality & Master Data Management, Techniques for the Agile Business Intelligence, Automating ETL using Metadata (and BIML) and, at SQL PASS Nordic I’ll also deliver, again with my friend Thomas Kejser, a very nice workshop:

Data Warehouse Modeling – Making the Right Choices
http://www.sqlpass.org/sqlrally/2013/nordic/Agenda/PreconferenceSeminars.aspx#ThomasKejserDavideMauri

It’s very likely that I’ll be missing the PASS Summit this year, but at some point I also have to do some work here and now…:)

If you’re going to be to one of those conferences, please come and say “Hi”: it’s always great to meet people all around the world!