Category Archives: Training

PASS SQL Saturday #356 Slovenia Recapitulation

So the event is over. I think I can say for all three organizers, Mladen Prajdić, Matija Lah, and me, that we are tired now. However, we are extremely satisfied. It was a great event. First few numbers and comparison with SQL Saturday #274, the first SQL Saturday Slovenia event that took place last year.

SQL Saturday #274

SQL Saturday #356

People

135

220

Show rate

~87%

~95%

Proposed sessions

40

82

Selected sessions

15

24

Selected speakers

14

23

Countries

12

16

The numbers nearly doubled. We are especially proud of the show rate; with 95%, this is much better than average for a free event, and probably the highest so far for a SQL Saturday. We asked registered attendees to be fair and to unregister if they know they can’t attend the event in order to make room for those from the waiting list. An old Slovenian proverb says “A nice word finds a nice place”, and it works. 36 registered attendees unregistered. Therefore, we have to thank to both, the attendees of the event and those who unregistered.

Of course, as always, we also need to thank to all of the speakers, sponsors and volunteers. All volunteers were very helpful; however, I would like to especially point out Saša Mašič. Her work goes well beyond simple volunteering. I must mention also the FRI, the Faculty of Computer and Information Science, where the event was hosted for free. It is also worth mentioning that we are lucky to live in Ljubljana, such a beautiful city with extremely nice inhabitants who like to enjoy good food, hanging around and mingling, and long parties. Because of that we could be sure in advance that both speakers and attendees from other countries would enjoy spending time here also outside the event, that they would feel safe, and get help whenever they would need it.

From the organizational perspective, we tried to do our best, and we hope that everything was OK for speakers, sponsors, volunteers, and attendees. Thank you all!

Columnstore Indexes in SQL Server 2014: Flipping the DW /faster Bit at #SQLSaturday347 / #SQLSatDC

This weekend on December 6, 2014 I continue my evangelization of Columnstore—possibly the most exciting unheralded feature in SQL Server 2014—at SQL Saturday 347 at an auspicious venue: the Microsoft Technology Center in Chevy Chase, proximal to the US capital city of Washington DC.

Register, see the schedule, or see the event home page on the SQL Saturday site.  I’ll look forward to seeing you here:

Microsoft MTC
5404 Wisconsin Ave
Chevy Chase, MD 20815

Join me & an All-Star cast of speakers including MVPs such as Reeves Smith, Joey D’Antoni, Allen White, Jessica Moss, Grant Fitchey, Steve Jones, Randy Knight, Jason Brimhall, & Wayne Sheffield.  Other luminaries include shiny new Microsoft PFE Theresa Iserman & Consultant Ayman El-Ghazali.

image

Execution plans, laundry, and a giveaway

In just a week from now, SQL Saturday #337 will kick off in Portland, Oregon. And I will be there – the third time already for me to be speaking in Portland. For a European, Portland is not the most logical location. But the organization over there is a pretty smart bunch of people. They figure that being close to Seattle gives them a great opportunity – so whenever they get the chance, they will reserve a Saturday just before the PASS Summit in Seattle for their own event. And then they make sure…(read more)

PASS SQL Saturday #356 Slovenia Pre-Conference Seminars

I am proud and glad I can announce two top pre-conference seminars at the PASS SQL Saturday #356 Slovenia conference. The speakers and the seminars titles are:

Both seminars will take place on Friday, December 12th, in the classrooms of our sponsor Kompas Xnet. The price for a seminar is € 149, with early bird price at € 119. Early bird price is valid until October 31st.

I am also using this opportunity to explain how and why we decided for these two seminars. The decision was made by the conference organizers, Matija Lah, Mladen Prajdič, and Dejan Sarka. There was a lot of discussion in different social networks about PASS Summit pre-conference seminars lately. If you have any objections for our seminars, please do not start big discussions in public; please tell them to the three of us directly.

First of all, unlike at the PASS Summit seminars, the speakers are not going to earn big money here, and therefore it is not really worth spending much time and energy on our decision. We think that any of the speakers who sent proposals for our SQL Saturday could present a top quality seminar. We would like to enable seminars for every speaker that wants to deliver one. However, in a small country, we will have already hard time to fill up the two seminar we have currently. Our intention is to reimburse at least part of the money the speakers spent on their own for travelling expenses and accommodation. In our opinion, it makes sense to do this for the speakers that spent the most for the travelling. Coming here from USA is expensive, and it also takes three days in both directions. That’s why we decided to organize the seminars for the first two speakers from USA.

Of course, this is not the last event. If everything goes well with SQL Saturday #356 and with the seminars, we will definitely try to organize more events in the future, and invite more speaker to deliver a seminar as well.

Thank you for understanding!

Join me for 3 Days of SSIS Training in Denmark – Advanced Integration Services 18-20 Nov 2014

I am honored to announce Advanced Integration Services will be delivered in cooperation with  in Aarhus, Denmark 18-20 Nov 2014. Target Audience The target audience for this course is intermediate SQL Server Integration Services developers (or quick learners) who wish to learn best practices and the more advanced stuff, and those who wish upgrade their existing SSIS skills to 2012. Since there are almost no changes between SSIS 2012 and 2014, this course will also be suitable for SSIS 2014 users…(read more)

AdventureWorks 2014 Sample Databases Are Now Available

 

Where in the World is AdventureWorks?

Recently, SQL Community feedback from twitter prompted me to look in vain for SQL Server 2014 versions of the AdventureWorks sample databases we’ve all grown to know & love.

I searched Codeplex, then used the bing & even the google in an effort to locate them, yet all I could find were samples on different sites highlighting specific technologies, an incomplete collection inconsistent with the experience we users had learned to expect.  I began pinging internally & learned that an update to AdventureWorks wasn’t even on the road map

Fortunately, SQL Marketing manager Luis Daniel Soto Maldonado (t) lent a sympathetic ear & got the update ball rolling; his direct report Darmodi Komo recently announced the release of the shiny new sample databases for OLTP, DW, Tabular, and Multidimensional models to supplement the extant In-Memory OLTP sample DB. 

What Success Looks Like

In my correspondence with the team, here’s how I defined success:

1. Sample AdventureWorks DBs hosted on Codeplex showcasing SQL Server 2014’s latest-&-greatest features, including: 

  • In-Memory OLTP (aka Hekaton)
  • Clustered Columnstore
  • Online Operations
  • Resource Governor IO

2. Where it makes sense to do so, consolidate the DBs (e.g., showcasing Columnstore likely involves a separate DW DB)

3. Documentation to support experimenting with these features

As Microsoft Senior SDE Bonnie Feinberg (b) stated, “I think it would be great to see an AdventureWorks for SQL 2014.  It would be super helpful for third-party book authors and trainers.  It also provides a common way to share examples in blog posts and forum discussions, for example.” 

Exactly.  We’ve established a rich & robust tradition of sample databases on Codeplex.  This is what our community & our customers expect.  The prompt response achieves what we all aim to do, i.e., manifests the Service Design Engineering mantra of “delighting the customer”.  Kudos to Luis’s team in SQL Server Marketing & Kevin Liu’s team in SQL Server Engineering for doing so.

Download AdventureWorks 2014

Download your copies of SQL Server 2014 AdventureWorks sample databases here.

Getting Trained on Microsoft’s Expanding Data Platform

With data volumes exploding, having the right technology to find insights from your data is critical to long term success.  Leading organizations are adjusting their strategies to focus on data management and analytics, and we are seeing a consistent increase in organizations adopting the Microsoft data platform to address their growing needs around data.  The trend is clear: CIOs named business intelligence (BI) and analytics their top technology priority in 2012, and again in 2013. Gartner expects this focus to continue during 2014. 2

At Microsoft, we have great momentum in the data platform space and we are proud to be recognized by analysts like IDC reporting that Microsoft SQL Server continues to be the unit leader and became the #2 database vendor by revenue.1 Microsoft was named a leader in both the Enterprise Data Warehouse and Business Intelligence Waves by Forrester, 3,4 and is named a leader in the OPDMS Magic quadrant. 5

The market is growing and Microsoft has great momentum in this space, so this is a great time to dig in and learn more about the technology that makes up our data platform through these great new courses in the Microsoft Virtual Academy.

Microsoft’s data platform products

Quentin Clark recently outlined our data platform vision. This calendar year we will be delivering an unprecedented lineup of new and updated products and services:

  • SQL Server 2014 delivers mission critical analytics and performance by bringing to market new in-memory capabilities built into the core database for OLTP (by 10X and up to 30X) and Data Warehousing (100X). SQL Server 2014 provides the best platform for hybrid cloud scenarios, like cloud backup and cloud disaster recovery, and significantly simplifies the on-ramp process to cloud for our customers with new point-and-click experiences for deploying cloud scenarios in the tools that are already familiar to database administrators (DBAs).
  • Power BI for Office 365 is a new self-service BI solution delivered through Excel and Office 365 which provides users with data analysis and visualization capabilities to identify deeper business insights from their on-premises and cloud data.
  • Windows Azure SQL Database is a fully managed relational database service that offers massive scale-out with global reach, built-in high availability, options for predictable performance, and flexible manageability. Offered in different service tiers to meet basic and high-end needs, SQL Database enables you to rapidly build, extend, and scale relational cloud applications with familiar tools.
  • Windows Azure HDInsight makes Apache Hadoop available as a service in the cloud, and also makes the Map Reduce software framework available in a simpler, more scalable, and cost efficient Windows Azure environment.
  • Parallel Data Warehouse (PDW) is a massively parallel processing data warehousing appliance built for any volume of relational data (with up to 100x performance gains) and provides the simplest way to integrate with Hadoop. With PolyBase, PDW can also seamlessly query relational and non-relational data.

In-depth learning through live online technical events 

To support the availability of these products, we’re offering live online events that will enable in-depth learning of our data platform offerings. These sessions are available now through the Microsoft Virtual Academy (MVA) and are geared towards IT professionals, developers, database administrators and technical decision makers. In each of these events, you’ll hear the latest information from our engineering and product specialists to help you grow your skills and better understand what differentiates Microsoft’s data offerings.

Here is a brief overview of the sessions that you can register for right now:

Business Intelligence

Faster Insights with Power BI Jumpstart | Register for the live virtual event on February 11

Session Overview: Are you a power Excel user? If you're trying to make sense of ever-growing piles of data, and you're into data discovery, visualization, and collaboration, get ready for Power BI. Excel, always great for analyzing data, is now even more powerful with Power BI for Office 365. Join this Jump Start, and learn about the tools you need to provide faster data insights to your organization, including Power Query, Power Map, and natural language querying. This live, demo-rich session provides a full-day drilldown into Power BI features and capabilities, led by the team of Microsoft experts who own them.

Data Management for Modern Business Applications

SQL Server in Windows Azure VM Role Jumpstart | Register for the live virtual event on February 18

Session Overview: If you're wondering how to use Windows Azure as a hosting environment for your SQL Server virtual machines, join the experts as they walk you through it, with practical, real-world demos. SQL Server in Windows Azure VM is an easy and full-featured way to be up and running in 10 minutes with a database server in the cloud. You use it on demand and pay as you go, and you get the full functionality of your own data center. For short-term test environments, it is a popular choice. SQL Server in Azure VM also includes pre-built data warehouse images and business intelligence features. Don't miss this chance to learn more about it.

Here’s a snapshot of the great content available to you now, with more to come later on the on the MVA data platform page:

Data Management for Modern Business Applications

Modern Data Warehouse

For more courses and training, keep tabs on the MVA data platform page and the TechNet virtual labs as well.

Thanks for digging in.

Eron Kelly
General Manager
Data Platform Marketing

———– 

1Market Analysis: Worldwide Relational Database Management Systems 2013–2017 Forecast and 2012 Vendor Shares, IDC report # 241292 by Carl W. Olofson, May 2013

2Business Intelligence and Analytics Will Remain CIO's Top Technology Priority G00258063 by W. Roy Schulte | Neil Chandler | Gareth Herschel | Douglas Laney | Rita L. Sallam | Joao Tapadinhas | Dan Sommer 25 November 2013

3The Forrester Wave™: Enterprise Data Warehouse, Q4 2013, Forrester Research, Inc.,  December 9, 2013

4The Forrester Wave™: Enterprise Business Intelligence Platforms, Q4 2013, Forrester Research, Inc.,  December 18, 2013

5Gartner, Magic Quadrant for Operational Database Management Systems by Donald Feinberg, Merv Adrian and Nick Heudecker, October 21, 2013.

Disclaimer:

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. 

Database Design training – for free?

When I started this blog, my plan was to focus on two main subject areas. One of them is SQL Server, T-SQL, and performance. The other is database design. Looking back over my post history, I honestly cannot say I delivered on the second area. Not because I have nothing to say about database design, but because I found it doesn’t lend itself for a blog. In the SQL Server / T-SQL area, there are subjects that can be isolated and described in a single post, or in a short series. In database design,…(read more)

SQL Saturday #274 Slovenia Recapitulation

Pure success!

I could simply stop here. However, I want to mention again everybody involved in this, and also some who were unfortunately missing.

First of all, PASS is the organization that defined SQL Saturdays. And apparently the idea worksSmile

I have to thank again to all of the speakers. Coming to share your amazing knowledge is something we really appreciate. The presentations were great, from the technical and other perspectives.

Of course, we could not do the event without sponsors. I am not going to enlist all of them again; I will just mention the host, pixi* labs, the company that hosted the event and who’s  employees helped with all of the organization. In addition, I need to mention Vina Kukovec. Boštjan Kukovec, an old member of Slovenian SQL Server and Developers users group, organized free wine tasting after the event. And what a wine it is!

Finally, thanks to all attendees for coming. We had approximately 85% show up rate; only 15% or the registered attendees didn’t come. This is an incredible result, worldwide! And from the applause after the raffle, when we closed the vent (and started wine tasting), I conclude that the attendees were very satisfied as well.

I want to mention three people that wanted to come, but run out of luck this time. Peter Stegnar from pixi* labs was the one that immediately offered the venue, and permeated with his enthusiasm also other pixi* labs members. Due to family reasons he couldn’t join us to see the results of his help. Tobiasz Janusz Koprowski wanted to speak, organized his trip, looked forward to join us; however, just couple of days before the event he had to cancel because of some urgent work at customer’s site. And what to say about Kevin Boles? He really tried hard to come. Think of it, he was prepared to come from USA! he was already on the airport, when his flight got cancelled due to technical problems. We were in constant touch Friday evening. He managed to change the flight, went to the gate, but was not admitted to the plane. because there was only seven minutes left till take off. Catching next flights would not make any sense anymore, because he would come too late anyway. He really did the best he could do, he just didn’t have enough luck this time. Peter, Tobiasz, and Kevin, thank you for your enthusiasm, we seriously missed you, and we certainly hope we will meet on our next event!

Fraud Detection with the SQL Server Suite Part 3

This is the third part of the fraud detection whitepaper. You can find the first part and the second part in my previous blog posts about this topic.

Data Preparation

The problem of credit card fraud detection is not trivial. With every transaction processed, only a limited amount of data is available, making it difficult if not impossible to distinguish between a “good” transaction and a (potentially) fraudulent one. In addition, there are literally millions of points of sales and web sites where a single credit card can be used. Even additional properties that could be available in the card owner’s profile, such as demographical data, probably will not make things much clearer. Do we focus on the customer or on the credit card? From experience, it seems that the customers who use multiple credit cards typically use each card for a specific purpose. Although this means that we might start by profiling the card, it might also be worth checking the customer’s profile, as this might reveal different habits in different cultures.

It is also quite hard to request specific card properties and/or customer profile properties in advance. Different companies collect, maintain, and have access to different data sets. On the other hand, some common data, like geographical location, time of usage, type of a product purchased, type of a transaction (purchase, cash advance), and similar, can be available to us. A good overview of data, useful for credit card and online banking fraud detection, can be found in

Hand D.J., & Blunt G. (2001, October). Prospecting gems in credit card data. IMA Journal of Management Mathematics.

In addition to the source variables, many calculated variables can be extremely handy. If geographic data is not available, it can often be extracted from IP addresses, ZIP codes, and similar source data. Web addresses also might contain country of origin, or, alternatively, business type. Universal product codes (UPCs) contain country of origin and the company that produces it. Many variables can be calculated from the time of the transaction and other data, for example:

  • A flag designating whether multiple transactions have been issued from different IPs and the same person in a particular time frame
  • A flag designating whether transactions from multiple persons and the same IP have been issued in a particular time frame
  • Whether there are multiple persons using the same credit card or user account
  • Whether the total amount of a transaction is near the maximum amount allowed for a particular type of transaction, or whether it is nearer the minimum amount
  • The time of day could be significant: is the day a holiday, a weekday, or was the transaction issued on a weekend, or a particular day during the month
  • The frequency of transactions in a moving time frame
  • The number of distinct transactions in a moving time frame (often, the same kind of a transaction could be repeated regularly)
  • The quantity of deviations from a moving average for the type of the transaction

We could address the problem with number of detected frauds that is too low in all of the transactions by oversampling, by repeating, or copying, known fraudulent transactions, or by undersampling, by lowering the number of non-fraudulent transactions in the sample used for model training. From experience we have learned to prefer the approach of undersampling. We select non-fraudulent transactions either with simple random sampling, or when we have clearly determined groups of transactions, with stratified sampling. For example, it might be obvious that there are significant differences in the patterns across different countries; by using countries or regions we can define different strata and then randomly select appropriate numbers of cases from each stratum separately. You can read more about sampling at

Wikipedia. (n.d.). Sample size determination. Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Sample_size_determination.

Different data mining algorithms are more or less prone to giving erroneous predictions when the target state is presented with a low enough frequency. With SSAS algorithms, we usually observe the following:

  • The Microsoft Neural Networks algorithm works best when you have about 50% of frauds in the total sample data set
  • The Microsoft Naïve Bayes algorithm already works well with 10% frauds
  • The Microsoft Decision Trees algorithm even works well with only 1% of frauds

During the data preparation, we also have to take care of missing values and outliers. Missing values can have a seriously negative effect on a data mining project. However, if the number of missing values is small enough, they can be handled by using a variety of methods:

  • Do nothing (a simple, but rarely a valid approach)
  • Filter out the rows containing the missing data (note that we could also filter out too many rows and lose a pattern)
  • Ignore the column (note that we could also ignore too many columns)
  • Predict the missing values with data mining algorithms, like Decision Trees (note that we could lose variability)
  • Build separate models, for example, one model for all the data (including missing values if the algorithm can handle them), one model for known data, and one model for rows with missing values (this does represent quite a lot of additional work)
  • Modify the operational systems so that the missing values can be collected later (this represents the best alternative, but unfortunately it is also the most difficult to achieve)
  • Replace the missing data with a mean (this is a very popular technique, although we could lose variability again)

Whenever we make any changes to the data, we are influencing the analysis. Before making any changes, we should determine whether there are any patterns in the missing data. We use data mining for this analysis. For instance, we add a flag with a value of 1 for a row that includes missing values or the value of 0 for rows where all of the variable values for the case are known and present. Then we use a predictive algorithm like Decision Trees to explain this new flag variable with other input variables. The resulting tree should be very shallow, without any strong patterns; otherwise, we have identified a pattern in the missing values. If a pattern has been found, it should be explained, and then we should use an appropriate missing value handling technique that does not alter the data (e.g., build separate models). Otherwise, we prefer to filter the rows with missing values.

Outliers are rare and far out-of-bound values. They are so far out of bound that they can influence the results of the analyses. Similarly to handling missing values, we determine whether there is any pattern in the outliers before handling them. We can do one of the following to address outliers:

  • Check if the outlier is an erroneous value, and if is, correct it (the best possibility)
  • Do nothing (a simple, but rarely a valid approach)
  • Filter out the rows with the outliers (note that we could end up filtering out too many rows and lose a pattern)
  • Ignore the column (note that we could ignore too many columns)
  • Replace outliers with common (mean) values (note that, of course, we are losing variability)
  • Bin values into equal height ranges (this is a good approach, especially for algorithms that use discrete input variables)
  • Normalize the data values in predefined limited ranges

Some well-known methods of normalization include:

  • Range normalization.
  • Z-score normalization
  • The logistic (sigmoid) function normalization
  • The hyperbolic tangent function normalization

Discretization (or binning, or categorization, or recoding) is also useful for other purposes, not only for dealing with outliers. For example, some algorithms, for instance the Microsoft Naïve Bayes algorithm, can accept only discrete input variables. Discretization is performed on a single column. Examples of discretization include:

  • Age
  • Income
  • Transaction amount

Note that with proper discretization we can compare otherwise incomparable data. For example, if one country income per capita is significantly higher than in another country, then it is difficult to compare the amounts of the transactions directly. However, if we discretize the income into three groups, like low, average and high, and appropriately tailor the group boundaries per country, we get comparable data. Please refer to

Pyle D. (1999). Data Preparation for Data Mining. Morgan Kaufmann

for further reading about data preparation for data mining.

Data Overview

As already mentioned, data overview activities interleave with the data preparation. In order to find outliers, we must get the idea of the distribution of a variable. We can use Microsoft Office Excel Pivot Tables and Pivot Graphs for this task. However, many times it is faster to use statistical computations and interpret the results. With Transact-SQL queries, we can calculate a lot of useful statistical information.

For a quick overview of discrete variables, we can use frequency tables. In a frequency table, we can show values, the absolute frequency of those values, absolute percentages, cumulative frequency, cumulative percent, and a histogram of the absolute percentage. OLAP cubes can be used to establish an overview of the frequency distribution for tens, if not hundreds of variables, very quickly.

For continuous variables, we can use descriptive statistics and calculate the first four population moments: Mean, Standard Deviation, Skewness, and Kurtosis. This gives us a quick impression of the distribution of values of those variables.

It is also worth checking linear dependencies between pairs of variables. Some algorithms, like the Microsoft Decision Trees algorithm, tend to exclude one variable from the dependent pair in the analysis, while other algorithms, like the Microsoft Clustering algorithm, might find too good a clusters if they use pairs of dependent variables. There are multiple methods for calculating these dependencies:

  • Chi-Squared test for pairs of discrete variables
  • Correlation Coefficient for pairs of continuous variables
  • Analysis of variance (ANOVA) for pairs where one variable is continuous per one discrete variable

Based on our experience in the field of fraud detection, we developed a much faster method to test all possible linear dependencies between multiple variables, not just pairs. We use the Microsoft Naïve Bayes data mining algorithm, where we declare all variables as input and predictable at the same time; this way, we can determine all important dependencies with a single analysis. Of course, the Naïve Bayes algorithm expects discrete inputs only; however, SSAS can discretize variables on the fly, using different discretization methods. We usually use the Equal Heights method, even though it is typically not useful for the final analysis, because it changes the shape of the distribution; however, it is extremely useful for the overview of linear dependencies, because it retains the maximum amount of information in the variables.

Variables with an insignificant amount of information are useless in analysis. We measure the amount of information in a variable by calculating the Entropy. As this calculation is quite slow, we do it only for variables, which we suspect contain a low amount of information.

For more details about the statistics mentioned, please refer to

Wonnacott T.H., & Wonnacott R.J. (1990). Introductory Statistics, 5th Edition . Wiley.

For more information on the information theory, please refer to

Kullback S. (1997). Information Theory and Statistics (Dover Books on Mathematics). Dover Publications.