Monthly Archives: January 2014

Microsoft to Join the Big Data Conversation as an Elite Sponsor at Strata, Santa Clara, CA

Join the big data conversations at Strata Santa ClaraOn February 11-13, Microsoft will join 2,700 of the world’s leading big data scientists, CIOs and executives at the Strata conference in Santa Clara, CA. For some, Microsoft does not come immediately to mind when thinking about leadership in data; however, along with the power of HDInsight, Microsoft is working to democratize big data – connecting companies to previously untouched data sources and enabling everyone to gain insight through familiar and powerful tools in Microsoft Office Excel.

Quentin Clark, Corporate Vice President of the Data Platform Group will share some of that vision in his keynote address followed by an interview with O’Reilly’s Mac Slocum – streamed live from the Strata conference website. Additionally, Herain Oberoi, Director Product Marketing, Microsoft will deliver a joint session with Bryan Hurd, Director of Advanced Analytics, Microsoft Cybercrime Center, highlighting how they use Microsoft’s portfolio of Big Data solutions, including simple, familiar end user tools and cutting-edge cloud technologies, to fight organized crime and BotNets that commit hundreds of millions of dollars of identity fraud ever year (view the Cybercrime Center video on YouTube).  Visit the Strata Website for full Event Agenda.

Microsoft is also hosting a hack-a-thon on February 8-9.  Not limited to technology, this big data hack-a-thon is aimed at spending 1½ days hacking for a good cause. Extra credit will go to those who develop using HDInsight and Power BI, but you are free to use whichever technology you like.  Sign up here.

Join the conversation – attend one of our sessions in person, visit our exhibit booth (#401) and interact with our data experts or follow us on Facebook.com/sqlserver and Twitter @SQLServer. You will how Microsoft’s complete data platform delivers Big Data: enabling you to enrich your data with external data sources; connect any type of data – structured or unstructured; and extend the accessibility of data analysis beyond scientists into the hands of every employee.

Increasing Buffer Pool in SQL Server 2014

Under heavy load your system may not work as intended due to lack of resources. To handle this situation SQL Server 2014 gives us the possibility to use an SSD or Ram device to dynamically increase the Buffer Pool. Keep reading to see how you can take advantage of this wonderful feature.

Query Performance and Parallelism Seminars in Boston and Chicago

You’re a DBA or database developer working on a big (or even big-ish — doesn’t have to be huge) database. Your end-users are doing all sorts of data analysis, and even though you have a pretty beefy server, your queries just don’t seem to run as fast as they should . The reason (I bet)? Your queries aren’t taking full advantage of the CPU power available in today’s servers . SQL Server has powerful parallel processing capabilities, but left to its own devices (and defaults) you’re never going to…(read more)

In-Memory OLTP Index Troubleshooting, Part II

With the In-Memory OLTP feature in SQL Server 2014 we introduce two new types of indexes for memory-optimized tables: the memory-optimized NONCLUSTERED indexes, and the memory-optimized HASH indexes.

Memory-optimized NONCLUSTERED indexes behave similar to traditional NONCLUSTERED indexes from an application point-of-view. The main difference being that memory-optimized indexes are always covering (i.e. all columns are virtually included), while with traditional disk-based NONCLUSTERED indexes you need to specify which column you want to include alongside the index key columns.

Memory-optimized HASH indexes behave different from NONCLUSTERED indexes. They are optimized for point-lookup operations, and do not support ordered scans or inequality seek operations. In addition, you need to specify a BUCKET_COUNT when creating the index, and you need to pick the right value in order for the index to perform optimally. In earlier posts we covered the following three index troubleshooting aspects:

  • Bucket_count troubleshooting for HASH indexes: how to find out whether a bucket_count is too low, and how to determine what should be the correct bucket_count. You can also find more information about determining the right bucket_count in Books Online.
  • Searching on a subset of the index key: HASH indexes do not support searching on the leading columns of an index key.
  • Searching on inequality predicates and ordered scans: both operations are not supported with HASH indexes, but they are supported with NONCLUSTERED indexes. But note that NONCLUSTERED indexes support ordered scans only in the direction indicated with the index key: for example, if the index key is (c1 DESC), the index supports retrieving the values for c1 in descending order, but not in ascending order.

All that said, it seems that HASH indexes have quite a few limitations, compared with NONCLUSTERED indexes. For that reason, it is usually a safer bet to start with NONCLUSTERED indexes, both for new applications and when migrating existing disk-based tables to memory-optimized. You can then use HASH indexes to further optimize the workload. Indeed, HASH indexes are the most efficient index for equality search (point lookup) and full table scan operations.

In the remainder of this post we:

  • Describe a way to troubleshoot seek vs. scans on memory-optimized indexes using the new DMV sys.dm_db_xtp_index_stats: if there are more full index scans than expected, one of the above-mentioned issues may be the case: trying to search on a subset of a HASH index key or trying to search on inequality predicates with a HASH index.
  • Troubleshoot duplicates: you may run into problems if the index keys of a memory-optimized index contain a lot of duplicate values, particularly when using a HASH index. We describe how to detect this situation and how to work around.

Troubleshooting seek vs. scan using XTP index DMVs

The DMV sys.dm_db_xtp_index_stats shows statistics for index operations performed by the in-memory storage engine. The index contains stats about the usage of the index since its creation in memory – note that memory-optimized indexes are always recreated on database restart. You can use the following query to retrieve key statistics about the usage of indexes on your table:

SELECT ix.index_id, ix.name, scans_started, rows_returned

FROM sys.dm_db_xtp_index_stats ixs JOIN sys.indexes ix ON

ix.object_id=ixs.object_id AND ix.index_id=ixs.index_id

WHERE ix.object_id=object_id('<table name>')

For troubleshooting indexes, the columns ‘scans_started’ and ‘rows_returned’ contain key information:

  • scans_started – this is the number of scan operations the in-memory storage engine has started. Note that from the storage engine point-of-view, all operations to locate a row or the location to insert a new row are scans: for example, a full index scan, a point lookup and, a row insert all require a single scan operation.
  • rows_returned – the cumulative number of rows returned by all scan operations in the storage engine. Note that this number reflects the rows returned by the storage engine, not the number of rows returned to the client. For example, the query plan may call for a filter condition or aggregation that reduces the number of rows before it is returned to the client.
    • Insert operations do not result in rows being returned. Update and delete operations consist of a scan, to locate the row to be updated, followed by the actual row update/delete.

If the number of rows_returned is significantly larger than the scans_started, this is an indication that, on average, index operations scan a large part of the index. If all index operations are expected to be point lookups, this could be an indication of one of the earlier-mentioned problems where the query calls for an operation to be supported by the index, thus causing a revert to full index scan, such as: search requires a subset of hash index key columns or search on inequality predicates with a hash index.

The scans_started being larger than rows_returned is an indication that the workload is insert-heavy, or that a lot of point lookups failed to locate a row.

Index keys with many duplicate values

Issue: Index keys with many duplicate values can cause performance problems. If each index key has 5 duplicates this is usually not a problem, but if the discrepancy between the number of unique index keys and the number of rows in the tables becomes very large – more than 10X – this can become problematic.

All rows with the same index key end up in the same duplicate chain. For hash indexes this can create a lot of overhead in case of hash collisions: if multiple index keys end up in the same bucket due to a hash collision, index scanners always need to scan the full duplicate chain for the first value before they can locate the first row corresponding to the second value. For nonclustered indexes this causes additional overhead for garbage collection.

Symptom: For hash indexes the performance of DML operations degrades and CPU utilization increases. In addition, there is an increase in CPU utilization during database startup, and a potential increase in recovery time. This becomes especially clear when inserting a large number of rows.

For nonclustered indexes garbage collection will start to consume more resources, resulting in an overall increased CPU utilization in the system. The problem does not affect DML operations directly,[1] but it does put more overall pressure on system resources.

How to troubleshoot: The average number of duplicate rows for a given index key can be obtained using T-SQL queries. First determine the row count of the table, then determine the number of unique index key values. Divide the row count by the number of unique index keys to obtain the average number of duplicates.

To determine the row count for the table use the following query:

select count(*) as 'Row count' from <tableName>

To determine the number of unique index key values use the following query:

select count(*) as 'Distinct index key values' from (select distinct <indexKeyColumns> from <tableName>) a

For hash indexes specifically, you can also troubleshoot using the hash index stats DMV. Use the following query:

SELECT hs.object_id, object_name(hs.object_id) AS 'object name', i.name as 'index name', hs.*

FROM sys.dm_db_xtp_hash_index_stats AS hs

JOIN sys.indexes AS i ON hs.object_id=i.object_id AND hs.index_id=i.index_id;

If the average chain length is high and the number of empty buckets is high, it is likely that there are many rows with duplicate index key values or there is a skew in the key values.

Workaround: First, evaluate if the index is truly needed. The index may have been added in the past to support queries that are no longer part of the workload.

For HASH indexes there are two ways to work around the issue of many duplicates:

  • In most cases you will want to use a NONCLUSTERED index instead, as NONCLUSTERED indexes generally perform better in case of duplicates. If you go for this option, consider uniquifying the index key, as indicated below.
  • Alternatively, you can over-size the index by using a very high bucket count; for example, 20 – 100 times the number of unique index key values. This will make hash collisions unlikely.

For NONCLUSTERED indexes with a lot of duplicates, consider adding additional columns to the index key. For example, you can add the primary key columns to the index key to make it unique, in other words to uniquify the index.

Adding columns to the index key does come with a performance penalty for DML operations. Therefore, it is important to test the new uniquified index under production load, and compare the performance with the old index. Consider the overall throughput of the system, query performance, and also the overall resource utilization, especially CPU.



[1] In SQL 2014 CTP2 user transactions could be affected, due to garbage collection performed at commit time. This issue will be resolved in SQL 2014 RTM: garbage collection no longer directly affects user transactions.

Recent thoughts on learning Chinese

Over the last few days, I’ve been getting a lot of questions again about what I’ve found works and doesn’t work with learning Chinese.


Here are a summary of things I’ve tried and thoughts on them:


1. Local classes


My wife and I attended some local classes in the first year I tried to learn. That was only about 2 hours per week and while it was interesting, it’s way too little to be very meaningful.


2. ChinesePod


This is a great podcast. They concentrate on spoken Chinese and much of it is good fun. I really liked the older material with Ken and Jenny but the new hosts are getting there. They have different levels. Newcomers might find much of the Newbies level quite challenging at first, but it’s important to persist with it. The step up to Elementary is substantial but not too scary. The step up to Intermediate though, is really too big a jump. I now find Elementary too easy but Intermediate is often too hard. I can’t comment on higher levels.


The accents on ChinesePod are pretty clean but some people complain that it’s a bit too Shanghai-ish. It seems pretty good but it’s not as clean as material that I’ve used from Beijing. You can really tell the difference.


ChinesePod also have lessons that you can do daily or weekly with teachers. These are not inexpensive but I think they’d be good value for those that want to progress quickly. Having someone to talk to for 20 mins a day, 5 days a week should help enormously, and allows for the teacher to provide one-on-one help in your weaker areas. I’ve done a demo session and I might take this up soon.


3. Rosetta Stone


This is probably the best known of the traditional language learning systems. For most languages, there are 3 levels. For Chinese, there are 5 levels. Each level consists of 4 units. I’m currently half a unit from completing the 5 levels. My inlaws (Chinese) say I’ve learned an enormous amount by doing this material. I’m sure that’s true but I have to say I’ve found it very, very frustrating at times.


A surprise for most people is that all spoken and written words in the course are in your target language. Read that last statement carefully. They argue that children learn by seeing. While this concept mostly works, I could not have learned using Rosetta Stone without having Google Translate and Pleco open the whole time. There were way too many subtleties that I could not have picked up from the pictures alone. I have had many moments of “how the @#$@#$@ could I have known that???” when working with Rosetta Stone.


When I signed up, they also had a Studio product that gave you online access for about $60 each six months. That was amazing value and one of the main reasons that I used it. It let you book into online classes that were conducted directly from Beijing. There was no limit to the number of classes you could book into. Clearly this was a problem for them so about a year ago, they dramatically changed the rules to where the Studio product was no longer worthwhile in my opinion, for many, many reasons. That was a real pity as it was the best part of the product and they broke it.


4. Pleco


This is by far the best app I’ve found for using on my iPad. It helps me translate Chinese to/from English in many useful ways. I use it every day. One of the reasons that I can’t move to a Windows Phone is that it’s not available on that platform and I wouldn’t want to live without Pleco.


5. Serge Melnyk’s Podcast


This is an odd one. Serge is a Russian (yes you read that correctly) that has a podcast on learning Chinese. I’ve come to really like Serge and have signed up to his site so that I could download all the content. If you have travelling time when you can listen to podcasts, it’s well worthwhile, although the accent is quite bizarre.


6. HSK Review


This is a great app for learning the Chinese characters that are required for the HSK exams. I hope to do at least HSK 1 and HSK 2 exams this year, perhaps HSK 3 if I’m lucky.


7. AnkiApp


This is in many ways an even better app for learning HSK based characters. It keeps track of what you are getting right and wrong, and how hard you find each character. It works out what you need to work on and provides great review drills for you based on that feedback. It also then assesses what level it thinks you are at for each of the levels.


8. Google Translate


This is invaluable and much superior translation-wise to the Bing Translator (sorry Microsoft). The pinyin input editor doesn’t work on IE11 (sorry again Microsoft) but no surprise, works like a charm on Chrome. Many of the translations are “interesting” so I’m glad I have a bunch of patient Chinese speaking buddies on Facebook that help me out where I just can’t work out what something really means.


9. PeraPera Chinese Popup Dictionary


This is an add-on for Chrome (and for Firefox) that has quickly become one of my favourite things. You just click a button in the top right of the browser when you want to use it, and you just point at words or phrases in Chinese that you don’t understand and it tells you about them. Awesome idea.


10. FluentU


This is another excellent service that I’ve signed up to. It’s about $19 a month when I signed up and it produces a bunch of content each month. I particularly like the way that different levels are supported, the way that the content is often quite current, is presented in a natural way, with characters, pinyin, and English all shown.


GENERAL ADVICE


1. PinYin


I made the mistake of spending too long with pinyin visible when I was first learning. It can become a crutch for you, and eventually you need to learn to read Chinese characters. In Rosetta Stone, you can choose what’s displayed. For the last year or so, I changed to only having Chinese characters displayed. It was VERY painful at first but I’m so glad I persevered with it. I now only turn on the pinyin display (in addition to the characters) when I don’t understand something. It’s important to force yourself to read.


Learning to pronounce pinyin correctly is also a challenge. There is no simple story here. You just need to learn the basic rules on how words sound, and after a while you will see “xin” but “shin” will come out of your mouth.


Focus a lot on listening. Some sounds take a while to get used to. Good examples are the difference between the “u” in “qu” and “zhu” or the sounds that we really don’t have in English like “nv” (The v is really a u with an umlaut). Other challenges are getting used to “c” sounding more like “ts” in words like “cai”, etc. So, listen a lot.


2. Movies and Video


Watching movies is obviously good but it’s a real challenge. At first, I struggled to find many Chinese movies that weren’t endless martial arts of medieval warriors.


My Spanish friends told me they learned English by watching movies in English but with English subtitles. This is important because you don’t want to know what something means, when learning you want to also know what they actually said. The problem with Chinese is that you have the added problem of not being able to read the characters, and certainly not at the speed that they are displayed at. (And I might be getting old, but they are really small too).


I’m told that Karaoke is good for learning. I haven’t tried it. Many people swear by learning languages by singing songs in the target language.


There is a world of good content on YouTube. I really like many of the game shows. You might as well find something entertaining while you’re learning.


3. Growing up with Chinese


CCTV has a series called “Growing up with Chinese” that is targeted at teenagers wanting to learn. There are 100 episodes that are available for download. I watch those while trying to get some exercise on a treadmill. They start out very basic but I think they move a little too fast into more complex topics. However, the host Charlotte is adorable, and the more of these things you learn from the better.


BIGGEST HINT


The biggest hint though is just to get out there and try to talk to native speakers. They will correct you much more quickly than you’ll ever work out yourself. I see this every day. For example, we were at my in-laws for lunch. I said to my mother-in-law “我很高兴我不需要药”, and she look puzzled then said “我很高兴我不需要吃药”. Note that I’d left a word out that helped a lot with what I meant. Only natives can do this for you.


With all the formal types of courses too, you’ll quickly find that what they teach really is quite formal and you’ll come to a sad reality (at some point) where you find that’s just not what people say. That’s why it’s so important to get to common spoken language pretty quickly. For example, people often learn “ni hao” as the first phrase for “hello”. After a while, you’ll realise that it’s not common for people to say that. “nin hao” which is more formal is pretty common in some situations though. A better example is to ask yourself how often you’d say a phrase like “good morning” or “good day”. Most Australians would be more likely to say just “morning” rather than “good morning”. Similarly, while you’ll learn “zao shang hao”, you’ll often hear people just say “zao”. For “bus” the courses will teach “gong gong qi che” but as an example, my brother in law says “gong che”. There is no answer to this apart from talking to natives.


One source of natives that I’ve found useful is local meetup groups. There are lots of Chinese who want to learn English and are happy to chat to you back/forth with both languages, plus native speakers who are just happy to help because they get to meet new people.


Anyway, that’s it for now. If I think of more, I’ll add them later. Good luck!

Database Design training – for free?

When I started this blog, my plan was to focus on two main subject areas. One of them is SQL Server, T-SQL, and performance. The other is database design. Looking back over my post history, I honestly cannot say I delivered on the second area. Not because I have nothing to say about database design, but because I found it doesn’t lend itself for a blog. In the SQL Server / T-SQL area, there are subjects that can be isolated and described in a single post, or in a short series. In database design,…(read more)

What Drives Microsoft’s Data Platform Vision?

FEATURED POST BY:   Quentin Clark, Corporate Vice President, The Data Platform Group, Microsoft Corporation

Quentin Clark, Corporate Vice President

If you follow Microsoft’s data platform work, you have probably observed some changes over the last year or so in our product approach and in how we talk about our products.  After the delivery of Microsoft SQL Server 2012 and Office 2013, we ramped-up our energy and sharpened our focus on the opportunities of cloud computing.  These opportunities stem from technical innovation, the nature of cloud computing, and from an understanding of our customers.

In my role at Microsoft, I lead the team that is responsible for the engineering direction of our data platform technologies.  These technologies help our customers derive important insights from their data and make critical business decisions.  I meet with customers regularly to talk about their businesses and about what’s possible with modern data-intensive applications.  Here and in later posts, I will share some key points from those discussions to provide you with insight into our data platform approach, roadmap, and key technology releases.

Microsoft has made significant investments on the opportunities of cloud computing.  In today’s IT landscape, it’s clear that the enterprise platform business is shifting to embrace the benefits of cloud computing—accessibility to scale, increased agility, diversity of data, lowered TCO and more. This shift will be as significant as the move from the mainframe/mini era to the microprocessor era.  And, due to this shift, the shape and role of data in the enterprise will change as applications evolve to new environments.

Today’s economy is built on the data platform that emerged with the microprocessor era—effectively, transactional SQL databases, relational data warehousing and operational BI.  An entire cycle of business growth was led by the emergence of patterns around Systems of Record, everything from ERP applications to Point of Sale systems.  The shift to cloud computing is bringing with it a new set of application patterns, which I sometimes refer to as Systems of Observation (SoO).  There are several forms of these new application patterns: the Internet of Things (IoT), generally; solutions being built around application and customer analytics; and, consumer personalization scenarios.  And, we are just beginning this journey! 

These new application patterns stem from the power of cloud computing—nearly infinite scale, more powerful data analytics and machine learning, new techniques on more kinds of data, a whole host of new information that impacts modern business, and ubiquitous infrastructure that allows the flow of information like never before.  What is being done today by a small number of large-scale Internet companies to harness the power of available information will become possible to apply to any business problem. 

To provide a framework for how we think applications and the information they generate or manage will change—and how that might affect those of us who develop and use those applications—consider these characteristics:

Data types are diverse.  Applications will generate, consume and manipulate data in many forms: transactional records, structured streamed data, truly unstructured data, etc.  Examples include the rise of JSON, the embracing of Hadoop by enterprises, and the new kinds of information generated by a wide variety of newly connected devices (IoT).

Relevant data is not just from inside the enterprise.  Cross-enterprise data, data from other industries and institutions, and information from the Web are all starting to factor into how businesses and the economy function in a big way.  Consider the small business loan extension that accounts for package shipping information as a criteria; or, companies that now embrace the use of social media signals.

Analytics usage is broadening.  Customer behavior, application telemetry, and business trends are just a few examples of the kinds of data that are being analyzed differently than before.  Deep analytics and automated techniques, like machine learning, are being used more often. And, modern architectures (cloud-scale, in-memory) are enabling new value in real-time, highly-interactive data analysis.

Data by-products are being turned into value.  Data that were once considered as by-products of a core business are now valuable across (and outside of) the industries that generate this data; for example, consider the expanding uses of search term data.  Perhaps uniquely, Microsoft has very promising data sets that could impact many different businesses.  

With these characteristics in mind, our vision is to provide a great platform and solutions for our customers to realize the new value of information and to empower new experiences with data.  This platform needs to span across the cloud and the enterprise – where so much key information and business processes exist.  We want to deliver Big Data solutions to the masses through the power of SQL Server and related products, Windows Azure data services, and the BI capabilities of Microsoft Office. To do this, we are taking steps to ensure our data platform meets the demands of today’s modern business.

Modern Transaction Processing—The data services that modern applications need are broader now than traditional RDBMS.  Yes, this too needs to become a cloud asset, and our investments in Windows Azure SQL Database reflect that effort.  We recognize that other forms of data storage are essential, including Windows Azure Storage and Tables, and we need to think about new capabilities as we develop applications in cloud-first patterns.  These cloud platform services need to be low friction, easy to incorporate, and operate seamlessly at scale—and have built-in fundamental features like high availability and regulatory compliance.  We also need to incorporate technical shifts like large memory and high-speed low latency networking—in our on-premises and cloud products. 

Modern Data Warehousing—Hadoop brought flexibility to what is typically done with data warehousing: storing and performing operational and ad-hoc analysis across large datasets.  Traditional data warehousing products are scaling up, and the worlds of Hadoop and relational data models are coming together.  Importantly, enterprise data needs broad availability so that business can find and leverage information from everywhere and for every purpose—and this data will live both in the cloud and in the enterprise datacenter.  We are hearing about customers who now compose meaningful insights from data across Windows Azure SQL Database and Windows Azure Storage processed with Windows Azure HDInsight, our Hadoop-based big data solution. Customers are leveraging the same pattern of relational + Hadoop in our Parallel Data Warehouse appliance product in the enterprise. 

Modern Business Intelligence—Making sense of data signals to gain strategic insight for business will become commonplace.  Information will be more discoverable; not just raw datasets, but those facets of the data that can be most relevant—and the kinds of analytics, including machine learning, that can be applied—will be more readily available.  Power BI for Office 365, our new BI solution, enables balance between self-service BI and IT operations—which is a key accelerant for adoption. With Power BI for Office 365, data from Windows Azure, Office, and on-premises data sources comes together in modern, accessible BI experiences. 

Over the coming months, we are going to publish regular posts to encourage discussions about data and insights and the world of modernized data. We will talk more about the trends, the patterns, the technology, and our products, and we’ll explore together how the new world of data is taking shape. I hope you will engage in this conversation with us; tell us what you think; tell us whether you agree with the trends we think we see—and with the implications of those trends for the modern data platform.

If you’d like more information about our data platform technologies, visit www.microsoft.com/bigdata and follow @SQLServer on Twitter for the latest updates.