Author Archives: admin

Merry Christmas

I just wanted to take a moment to wish a Merry Christmas to you: people in the SQL Community; people I see at clients and the like; and especially those people I am incredibly privileged to have working at possibly the best SQL Consultancy in the world.

To those who I have represent my brand: I love you guys! You’re all passionate about providing the best experience for our customers, developing the SQL community, and doing amazing things to help people improve their data story. I couldn’t be prouder of you all. Sure, there are times when I lose sleep (and hair) over stuff, but I know that we have each other’s backs, and that’s a brilliant thing. I’ve often likened us to that story about the tiger in a cage. The best way to defend such a tiger is to let it out of its cage. If I can help enable you, and remove any obstacles that come between you and your ability to be phenomenal, then that’s what I’ll try to do. We all have our different styles, but together I think we can be an incredible force. It’s been a crazy year in many ways, including starting the LobsterPot story in the US (and Ted – you’ve been incredible!), but we have even more exciting times ahead, I’m sure. The Microsoft data stack is developing quicker than ever, and people are using it in bigger and better ways all the time.

Merry Christmas guys. Let’s continue to spread the SQL cheer… 🙂

@rob_farley

Retrieving N rows per group

Sometimes a forum response should just be a blog post… so here’s something I wrote over at http://dba.stackexchange.com/a/86765/4103.

The question was somewhat staged I think, being from Paul White (@sql_kiwi), who definitely knows this stuff already.

His question:

I often need to select a number of rows from each group in a result set.

For example, I might want to list the ‘n’ highest or lowest recent order values per customer.

In more complex cases, the number of rows to list might vary per group (defined by an attribute of the grouping/parent record). This part is definitely optional/for extra credit and not intended to dissuade people from answering.

What are the main options for solving these types of problems in SQL Server 2005 and later? What are the main advantages and disadvantages of each method?

AdventureWorks examples (for clarity, optional)

  1. List the five most recent recent transaction dates and IDs from the TransactionHistory table, for each product that starts with a letter from M to R inclusive.
  2. Same again, but with n history lines per product, where n is five times the DaysToManufactureProduct attribute.
  3. Same, for the special case where exactly one history line per product is required (the single most recent entry by TransactionDate, tie-break on TransactionID.

And my answer:

Let’s start with the basic scenario.

If I want to get some number of rows out of a table, I have two main options: ranking functions; or TOP.

First, let’s consider the whole set from Production.TransactionHistory for a particular ProductID:

SELECT h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = 800;

This returns 418 rows, and the plan shows that it checks every row in the table looking for this – an unrestricted Clustered Index Scan, with a Predicate to provide the filter. 797 reads here, which is ugly.

Expensive Scan with 'Residual' Predicate

So let’s be fair to it, and create an index that would be more useful. Our conditions call for an equality match on ProductID, followed by a search for the most recent by TransactionDate. We need the TransactionID returned too, so let’s go with: CREATE INDEX ix_FindingMostRecent ON Production.TransactionHistory (ProductID, TransactionDate) INCLUDE (TransactionID);.

Having done this, our plan changes significantly, and drops the reads down to just 3. So we’re already improving things by over 250x or so…

Improved plan

Now that we’ve levelled the playing field, let’s look at the top options – ranking functions and TOP.

WITH Numbered AS
(
SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (ORDER BY TransactionDate DESC) AS RowNum
FROM Production.TransactionHistory h
WHERE h.ProductID = 800
)
SELECT TransactionID, ProductID, TransactionDate
FROM Numbered
WHERE RowNum <= 5;

SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
FROM Production.TransactionHistory h
WHERE h.ProductID = 800
ORDER BY TransactionDate DESC;

Two plans - basic TOPRowNum

You will notice that the second (TOP) query is much simpler than the first, both in query and in plan. But very significantly, they both use TOP to limit the number of rows actually being pulled out of the index. The costs are only estimates and worth ignoring, but you can see a lot of similarity in the two plans, with the ROW_NUMBER() version doing a tiny amount of extra work to assign numbers and filter accordingly, and both queries end up doing just 2 reads to do their work. The Query Optimizer certainly recognises the idea of filtering on a ROW_NUMBER() field, realising that it can use a Top operator to ignore rows that aren’t going to be needed. Both these queries are good enough – TOP isn’t so much better that it’s worth changing code, but it is simpler and probably clearer for beginners.

So this work across a single product. But we need to consider what happens if we need to do this across multiple products.

The iterative programmer is going to consider the idea of looping through the products of interest, and calling this query multiple times, and we can actually get away with writing a query in this form – not using cursors, but using APPLY. I’m using OUTER APPLY, figuring that we might want to return the Product with NULL, if there are no Transactions for it.

SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
FROM 
Production.Product p
OUTER APPLY (
    SELECT TOP (5) h.TransactionID, h.ProductID, h.TransactionDate
    FROM Production.TransactionHistory h
    WHERE h.ProductID = p.ProductID
    ORDER BY TransactionDate DESC
) t
WHERE p.Name >= 'M' AND p.Name < 'S';

The plan for this is the iterative programmers’ method – Nested Loop, doing a Top operation and Seek (those 2 reads we had before) for each Product. This gives 4 reads against Product, and 360 against TransactionHistory.

APPLY plan

Using ROW_NUMBER(), the method is to use PARTITION BY in the OVER clause, so that we restart the numbering for each Product. This can then be filtered like before. The plan ends up being quite different. The logical reads are about 15% lower on TransactionHistory, with a full Index Scan going on to get the rows out.

ROW_NUMBER plan

Significantly, though, this plan has an expensive Sort operator. The Merge Join doesn’t seem to maintain the order of rows in TransactionHistory, the data must be resorted to be able to find the rownumbers. It’s fewer reads, but this blocking Sort could feel painful. Using APPLY, the Nested Loop will return the first rows very quickly, after just a few reads, but with a Sort, ROW_NUMBER() will only return rows after a most of the work has been finished.

Interestingly, if the ROW_NUMBER() query uses INNER JOIN instead of LEFT JOIN, then a different plan comes up.

ROW_NUMBER() with INNER JOIN

This plan uses a Nested Loop, just like with APPLY. But there’s no Top operator, so it pulls all the transactions for each product, and uses a lot more reads than before – 492 reads against TransactionHistory. There isn’t a good reason for it not to choose the Merge Join option here, so I guess the plan was considered ‘Good Enough’. Still – it doesn’t block, which is nice – just not as nice as APPLY.

The PARTITION BY column that I used for ROW_NUMBER() was h.ProductID in both cases, because I had wanted to give the QO the option of producing the RowNum value before joining to the Product table. If I use p.ProductID, we see the same shape plan as with the INNER JOIN variation.

WITH Numbered AS
(
SELECT p.Name, p.ProductID, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY p.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
FROM Production.Product p
LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
WHERE p.Name >= 'M' AND p.Name < 'S'
)
SELECT Name, ProductID, TransactionID, TransactionDate
FROM Numbered n
WHERE RowNum <= 5;

But the Join operator says ‘Left Outer Join’ instead of ‘Inner Join’. The number of reads is still just under 500 reads against the TransactionHistory table.

PARTITION BY on p.ProductID instead of h.ProductID

Anyway – back to the question at hand…

We’ve answered question 1, with two options that you could pick and choose from. Personally, I like the APPLY option.

To extend this to use a variable number (question 2), the 5 just needs to be changed accordingly. Oh, and I added another index, so that there was an index on Production.Product.Name that included the DaysToManufacture column.

WITH Numbered AS
(
SELECT p.Name, p.ProductID, p.DaysToManufacture, h.TransactionID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY h.ProductID ORDER BY h.TransactionDate DESC) AS RowNum
FROM Production.Product p
LEFT JOIN Production.TransactionHistory h ON h.ProductID = p.ProductID
WHERE p.Name >= 'M' AND p.Name < 'S'
)
SELECT Name, ProductID, TransactionID, TransactionDate
FROM Numbered n
WHERE RowNum <= 5 * DaysToManufacture;

SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
FROM 
Production.Product p
OUTER APPLY (
    SELECT TOP (5 * p.DaysToManufacture) h.TransactionID, h.ProductID, h.TransactionDate
    FROM Production.TransactionHistory h
    WHERE h.ProductID = p.ProductID
    ORDER BY TransactionDate DESC
) t
WHERE p.Name >= 'M' AND p.Name < 'S';

And both plans are almost identical to what they were before!

Variable rows

Again, ignore the estimated costs – but I still like the TOP scenario, as it is so much more simple, and the plan has no blocking operator. The reads are less on TransactionHistory because of the high number of zeroes in DaysToManufacture, but in real life, I doubt we’d be picking that column. 😉

One way to avoid the block is to come up with a plan that handles the ROW_NUMBER() bit to the right (in the plan) of the join. We can persuade this to happen by doing the join outside the CTE. (Edited because of a silly typo that meant that I turned my Outer Join into an Inner Join.)

WITH Numbered AS
(
SELECT h.TransactionID, h.ProductID, h.TransactionDate, ROW_NUMBER() OVER (PARTITION BY ProductID ORDER BY TransactionDate DESC) AS RowNum
FROM Production.TransactionHistory h
)
SELECT p.Name, p.ProductID, t.TransactionID, t.TransactionDate
FROM Production.Product p
LEFT JOIN Numbered t ON t.ProductID = p.ProductID
AND t.RowNum <= 5 * p.DaysToManufacture WHERE p.Name >= 'M' AND p.Name < 'S';

image

The plan here looks simpler – it’s not blocking, but there’s a hidden danger.

Notice the Compute Scalar that’s pulling data from the Product table. This is working out the 5 * p.DaysToManufacture value. This value isn’t being passed into the branch that’s pulling data from the TransactionHistory table, it’s being used in the Merge Join. As a Residual.

image

So the Merge Join is consuming ALL the rows, not just the first however-many-are-needed, but all of them and then doing a residual check. This is dangerous as the number of transactions increases. I’m not a fan of this scenario – residual predicates in Merge Joins can quickly escalate. Another reason why I prefer the APPLY/TOP scenario.

In the special case where it’s exactly one row, for question 3, we can obviously use the same queries, but with 1 instead of 5. But then we have an extra option, which is to use regular aggregates.

SELECT ProductID, MAX(TransactionDate)
FROM Production.TransactionHistory
GROUP BY ProductID;

A query like this would be a useful start, and we could easily modify it to pull out the TransactionID as well for tie-break purposes (using a concatenation which would then be broken down), but we either look at the whole index, or we dive in product by product, and we don’t really get a big improvement on what we had before in this scenario.

But I should point out that we’re looking at a particular scenario here. With real data, and with an indexing strategy that may not be ideal, mileage may vary considerably. Despite the fact that we’ve seen that APPLY is strong here, it can be slower in some situations. It rarely blocks though, as it has a tendency to use Nested Loops, which many people (myself included) find very appealing.

I haven’t tried to explore parallelism here, or dived very hard into question 3, which I see as a special case that people rarely want based on the complication of concatenating and splitting. The main thing to consider here is that these two options are both very strong.

I prefer APPLY. It’s clear, it uses the Top operator well, and it rarely causes blocking.

@rob_farley

Data Science Perspectives: Q&A with Microsoft Data Scientists Val Fontama and Wee Hyong Tok

You can’t read the tech press without seeing news of exciting advancements or opportunities in data science and advanced analytics. We sat down with two of our own Microsoft Data Scientists to learn more about their role in the field, some of the real-world successes they’ve seen, and get their perspective on today’s opportunities in these evolving areas of data analytics.

If you want to learn more about predictive analytics in the cloud or hear more from Val and Wee Hyong, check out their new book, Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes.

First, tell us about your roles at Microsoft?

 [Val] Principal Data Scientist in the Data and Decision Sciences Group (DDSG) at Microsoft

 [Wee Hyong] Senior Program Manager, Azure Data Factory team at Microsoft

 And how did you get here? What’s your background in data science?

[Val] I started in data science over 20 years ago when I did a PhD in Artificial Intelligence. I used Artificial Neural Networks to solve challenging engineering problems, such as the measurement of fluid velocities and heat transfer. After my PhD, I applied data mining in the environmental science and credit industry: I did a year’s post-doctoral fellowship before joining Equifax as a New Technology Consultant in their London office. There, I pioneered the application of data mining to risk assessment and marketing in the consumer credit industry. I hand coded over ten machine learning algorithms, including neural networks, genetic algorithms, and Bayesian belief networks in C++ and applied them to fraud detection, predicting risk of default, and customer segmentation.    

[Wee Hyong] I’ve worked on database systems for over 10 years, from academia to industry.  I joined Microsoft after I completed my PhD in Data Streaming Systems. When I started, I worked on shaping the SSIS server from concept to release in SQL Server 2012. I have been super passionate about data science before joining Microsoft. Prior to joining Microsoft, I wrote code on integrating association rule mining into a relational database management system, which allows users to combine association rule mining queries with SQL queries. I was a SQL Server Most Valuable Professional (MVP), where I was running data mining boot camps for IT professionals in Southeast Asia, and showed how to transform raw data into insights using data mining capabilities in Analysis Services.

What are the common challenges you see with people, companies, or other organizations who are building out their data science skills and practices?

[Val] The first challenge is finding the right talent. Many of the executives we talk to are keen to form their own data science teams but may not know where to start. First, they are not clear what skills to hire – should they hire PhDs in math, statistics, computer science or other? Should the data scientist also have strong programming skills? If so, in what programming languages? What domain knowledge is required? We have learned that data science is a team sport, because it spans so many disciplines including math, statistics, computer science, etc. Hence it is hard to find all the requisite skills in a single person. So you need to hire people with complementary skills across these disciplines to build a complete team.

The next challenge arises once there is a data science team in place – what’s the best way to organize this team? Should the team be centralized or decentralized? Where should it sit relative to the BI team? Should data scientists be part of the BI team or separate? In our experience at Microsoft, we recommend having a hybrid model with a centralized team of data scientists, plus additional data scientists embedded in the business units. Through the embedded data scientists, the team can build good domain knowledge in specific lines of business. In addition, the central team allows them to share knowledge and best practices easily. Our experience also shows that it is better to have the data science team separate from the BI team. The BI team can focus on descriptive and diagnostic analysis, while the data science team focuses on predictive and prescriptive analysis. Together they will span the full continuum of analytics.

The last major challenge I often hear about is the actual practice of deploying models in production. Once a model is built, it takes time and effort to deploy it in production. Today many organizations rewrite the models to run on their production environments. We’ve found success using Azure Machine Learning, as it simplifies this process significantly and allows you to deploy models to run as web services that can be invoked from any device.

[Wee Hyong] I also hear about challenges in identifying tools and resource to help build these data science skills. There are a significant number of online and printed resources that provide a wide spectrum of data science topics – from theoretical foundations for machine learning, to practical applications of machine learning. One of the challenges is trying to navigate amongst the sea of resources, and selecting the right resources that can be used to help them begin.

Another challenge I have seen often is identifying and figuring out the right set of tools that can be used to model the predictive analytics scenario. Once they have figured out the right set of tools to use, it is equally important for people/companies to be able to easily operationalize the predictive analytics solutions that they have built to create new value for their organization.

What is your favorite data science success story?

[Val] My two favorite projects are the predictive analytics projects for ThyssenKrupp and Pier 1 Imports. I’ll speak today about the Pier 1 project. Last spring my team worked with Pier 1 Imports and their partner, MAX451, to improve cross-selling and upselling with predictive analytics. We built models that predict the next logical product category once a customer makes a purchase. Based on Azure Machine Learning, this solution will lead to a much better experience for Pier 1 customers.

[Wee Hyong] One of my favorite data science success story is how OSIsoft collaborated with the Carnegie Mellon University (CMU) Center for Building Performance and Diagnostics to build an end-to-end solution that addresses several predictive analytics scenarios. With predictive analytics, they were able to solve many of their business challenges ranging from predicting energy consumption in different buildings to fault detection. The team was able to effectively operationalize the machine learning models that are built using Azure Machine Learning, which led to better energy utilization in the buildings at CMU.

What advice would you give to developers looking to grow their data science skills?
[Val] I would highly recommend learning multiple subjects: statistics, machine learning, and data visualization. Statistics is a critical skill for data scientists that offers a good grounding in correct data analysis and interpretation. With good statistical skills we learn best practices that help us avoid pitfalls and wrong interpretation of data. This is critical because it is too easy to unwittingly draw the wrong conclusions from data. Statistics provides the tools to avoid this. Machine learning is a critical data science skill that offers great techniques and algorithms for data pre-processing and modeling. And last, data visualization is a very important way to share the results of analysis. A good picture is worth a thousand words – the right chart can help to translate the results of complex modeling into your stakeholder’s language. So it is an important skill for a budding data scientist.

[Wee Hyong] Be obsessed with data, and acquire a good understanding of the problems that can be solved by the different algorithms in the data science toolbox. It is a good exercise to jumpstart by modeling a business problem in your organization where predictive analytics can help to create value. You might not get it right in the first try, but it’s OK. Keep iterating and figuring out how you can improve the quality of the model. Over time, you will see that these early experiences help build up your data science skills.

Besides your own book, what else are you reading to help sharpen your data science skills?

[Val] I am reading the following books:

  • Data Mining and Business Analytics with R by Johannes Ledolter
  • Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems) by Ian H. Witten, Eibe Frank, and Mark A. Hall
  • Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie or Die by Eric Siegel

[Wee Hyong] I am reading the following books:

  • Super Crunchers: Why Thinking-By-Numbers Is the New Way to Be Smart by Ian Ayres
  • Competing on Analytics: The New Science of Winning by Thomas H. Davenport and Jeanne G. Harris.

Any closing thoughts?

[Val]  One of the things we share in the book is that, despite the current hype, data science is not new. In fact, the term data science has been around since 1960. That said, I believe we have many lessons and best practices to learn from other quantitative analytics professions, such as actuarial science. These include the value of peer reviews, the role of domain knowledge, etc. More on this later.

[Wee Hyong] One of the reasons that motivated us to write the book is we wanted to contribute back to the data science community, and have a good, concise data science resource that can help fellow data scientists get started with Azure Machine Learning. We hope you find it helpful. 

SQL Server Performance Monitoring Tools

Being able to identify SQL Server performance issues at the drop of a hat is easier said than done. Without a means to collect and analyze the performance data it is difficult at best to understand and correct the items in a timely manner. SQL Server ships with a handful of tools to include Profiler, SysmonPerfmon and the Database Engine Tuning AdvisorIndex Tuning Wizard. Much of the time these tools meet the needs for manual collection and analysis, but what if you need to go beyond the tools that are available and to resolve a performance issue quickly?

Invalid Quorum Configuration Warnings when failing over SQL Server Availability Group

At a client site today and they asked me about a warning that they got every time they manually failed over their SQL Server availability group.

It said: “The current WSFC cluster quorum vote configuration is not recommended for the availability group.” They were puzzled by this as they had a valid quorum configuration. In their case, they had a two node cluster using MNS (majority node set) and a fileshare witness.

The problem with that message is that it is returned when the node voting weight is not visible.

Windows Server 2008 failover clustering introduced node-based voting but later an option was provided to adjust the voting weight for each node. If the cluster is based on Windows Server 2008 or Windows Server 2008 R2, and KB2494036 has not been applied, even though each node has a vote, the utilities that check voting weight are not supplied a weight value. You can see this by querying:

SELECT * FROM sys.dm_hadr_cluster_members;

This will return a row for each cluster member but will have a missing vote weight.

Applying the KB hotfix will make this DMV return the correct values, and will make this invalid warning disappear.

Online certification exams are now available in Australia

I’ve been hoping this would happen for a while and now it’s here (in beta).

Whenever I take a certification exam, I find it removes my ability to work for most of a day, so I tend to schedule myself for two or three exams in a day, to avoid the overhead. It also means that I tend to limit the number of exams that I would take.

Online proctoring of exams changes all that for me. If I can just schedule an exam for lunch time or night, or weekend from my own office, I’ll be much more inclined to take more certification exams.

There are some rules:

  • You’ll be recorded—both video and audio—for the duration of the exam.
  • You can’t take notes during the exam.
  • You can’t eat, drink, or chew gum while you take the exam.
  • You can’t take a break—for any reason.

They seem reasonable to me and if you don’t like them, you can always attend an in-person exam. Better make sure you get to the rest room beforehand though Smile It’s not available in all countries yet but fortunately Australia is one of the countries in the list. I’m not sure how the countries are chosen because I notice that our Kiwi buddies aren’t in the list yet. I’m sure that will change over time.

Regardless, this is a really good initiative. Well done Microsoft Learning.

You’ll find more information here: https://www.microsoft.com/learning/en-us/online-proctored-exams.aspx