I am very pleased to announce that I will be speaking at the SQL Intersection conference in Las Vegas again this year. This time around, I am giving a full-day workshop, "Mastering SQL Server Execution Plan Analysis" as well as a two-part session, "Parallel Query Execution" during the main conference. The workshop is a pre-conference event, held on Sunday 9 November (straight after this year’s PASS Summit). Being on Sunday gives you the whole Monday off to recover and before the…(read more)
PASS VP of Marketing Denise McInerney – a SQL Server MVP and Data Engineer at Intuit – began her career as a SQL Server DBA in 1998 and attended her first PASS Summit in 2002. The SQL Server Team caught up with her ahead of this year’s event, returning to Seattle, WA, Nov. 4-7, to see what she’s looking forward to at the world’s largest conference for SQL Server and BI professionals.
For those who’ve never attended or who’ve been away for a while, what is PASS Summit?
PASS Summit is the world’s largest gathering of Microsoft SQL Server and BI professionals. Organized by and for the community, PASS Summit delivers the most technical sessions, the largest number of attendees, the best networking, and the highest-rated sessions and speakers of any SQL Server event.
We like to think of PASS Summit as the annual reunion for the #sqlfamily. With over 200 technical sessions and 70+ hours of networking opportunities with MVPs, experts and peers, it’s 3 focused days of SQL Server. You can take hands-on workshops, attend Chalk Talks with the experts, and get the answers you need right away at the SQL Server Clinic, staffed by the Microsoft CSS and SQLCAT experts who build and support the features you use every day. Plus, you can join us early for 2 days of pre-conference sessions with top industry experts and explore the whole range of SQL Server solutions and services under one roof in the PASS Summit Exhibit Hall.
Nowhere else will you find over 5,000 passionate SQL Server and BI professionals from 50+ countries and 2,000 different companies connecting, sharing, and learning how to take their SQL Server skills to the next level.
What’s on tap this year as far as sessions?
We’ve announced a record 160+ incredible community sessions across 5 topic tracks: Application and Database Development, BI Information Delivery, BI Platform Architecture, Development and Administration; Enterprise Database Administration and Deployment, and Professional Development. And watch for over 60 sessions from Microsoft’s top experts to be added to the lineup in early September.
You can search by speaker, track, session skill level, or session type – from 10-minute Lightning Talks, to 75-minute General Sessions, to 3-hour Half-Day Sessions and our full-day pre-conference workshops.
And with this year’s new Learning Paths, we’ve made it even easier to find the sessions you’re most interested in. Just use our 9 Learning Path filters to slice and dice the lineup by everything from Beginner sessions to Big Data, Cloud, Hardware Virtualization, and Power BI sessions to SQL Server 2014, High Availability/Disaster Recovery, Performance, and Security sessions.
Networking is at the heart of PASS Summit – what opportunities do you have for attendees to connect with each other?
PASS Summit is all about meeting and talking with people, sharing issues and solutions, and gaining knowledge that will make you a better SQL Server professional. Breakfasts, lunches, and evening receptions are all included and are designed to offer dedicated networking opportunities. And don't underestimate the value of hallway chats and the ability to talk to speakers after their sessions, during lunches and breaks, and at the networking events.
We have special networking activities for first-time attendees, for people interested in the same technical topics at our Birds of a Feather luncheon, and at our popular annual Women in Technology luncheon, which connects 600+ attendees interested in advancing role of women in STEM fields. Plus, our Community Zone is THE place to hang out with fellow attendees and community leaders and learn how to stay involved year-round.
You mentioned the networking events for first-time attendees. With everything going on at Summit, how can new attendees get the most out of their experience?
Our First-Timers Program takes the hard work out of conference prep and is designed specifically to help new attendees make the most of their time at Summit. We connect first-timers with conference alumni, take them inside the week with community webinars, help them sharpen their networking skills through fun onsite workshops, and share inside advice during our First Timers orientation meeting.
In addition, in our “Get to Know Your Community Sessions,” longtime PASS members share how to get involved with PASS and the worldwide #sqlfamily, including encouraging those new to PASS to connect with their local SQL Server communities through PASS Chapters and continue their learning through Virtual Chapters, SQLSaturdays, and other free channels.
How can you learn more about sessions and the overall PASS Summit experience?
A great way to get a taste of Summit is by watching PASS Summit 2013 sessions, interviews, and more on PASStv. You can also check out the best of last year’s Community blogs.
Plus, stay tuned for 24 Hours of PASS: Summit Preview Edition on September 9 to get a free sneak peek at some of the top sessions and speakers coming to PASS Summit this year. Make sure you follow us on Twitter at @PASS24HOP / #pass24hop for the latest updates on these 24 back-to-back webinars.
Where can you register for PASS Summit?
To register, just go to Register Now – and remember to take advantage of the $150 discount code from your local or Virtual PASS Chapter. We also have a great group discount for companies sending 5 or more employees. And don’t forget to purchase the session recordings for year-round learning on all aspects of SQL Server.
Once you get a taste for the learning and networking waiting for you at PASS Summit, we invite you to join the conversation by following us on Twitter (watch the #sqlpass #summit 14 hashtags) and joining our Facebook and LinkedIn groups. We’re looking forward to an amazing, record-breaking event, and can’t wait to see everyone there!
Please stay tuned for regular updates and highlights on Microsoft and PASS activities planned for this year’s conference.
It is often generalized that seeks are better than scans in terms of retrieving data from SQL Server. The index hint FORCESCAN was recently introduced so that you could coerce the optimizer to perform a scan instead of a seek. Which might lead you to wonder: Why would I ever want a scan instead of a seek?
This blog post was authored by: Sahaj Saini, PM on the Microsoft Analytics Platform System (APS) team.
In this blog post, we’ll provide a quick overview of Symmetric Multi-Processing (SMP) vs. Massively Parallel Processing (MPP) systems, how to identify triggers for migrating from SMP to MPP, key considerations when moving to Microsoft Analytics Platform System (APS), and a discussion about how to take advantage of the power of an MPP solution such as APS.
Let us begin with a scenario. Emma is the Database Administrator at Adventure Works Cycles, a bicycle manufacturing company. At Adventure Works, Emma and her team are using traditional SQL Server SMP as their data warehousing solution. The company has been growing rapidly and with growing competition in the bicycle industry, the business analysts at Adventure Works Cycles would like quicker insight into their data. Emma is now facing the following challenges with the SMP deployment –
- High Data Volume and Data Growth: With increasing sales and a growing customer base, the data volume has grown rapidly to cross 10 TB.
- Longer Data Loading/ETL times: With the need to produce daily reports to management, Emma finds the current ETL speed inadequate to intake and process the increasing quantity of data flowing from other OLTP and non-relational systems.
- Slow Query Execution: Query execution times are slowing down due to the increase of data and it is becoming increasingly difficult to generate insights for daily reporting in a timely manner.
- Long Cube Processing Time: With the current cube processing time, it is difficult to meet the real-time reporting needs of the company.
In order to overcome these challenges, Emma and her team evaluate the purchase of a larger, expensive and more powerful set of server and storage hardware to their datacenter. This approach would solve their problem but only for the short-term as the data growth is expected to explode in the next 12 months. With data growth that Adventure Works is expecting to see, even the bigger and more powerful SMP solutions would hit a wall very quickly. Emma would like to see a solution that scales as their data needs grow.
What’s the difference between SMP and MPP?
Before we jump into solving Emma’s problems, let’s quickly define what SMP and MPP are. Symmetric Multi-Processing (SMP) is a tightly coupled multiprocessor system where processors share resources – single instances of the Operating System (OS), memory, I/O devices and connected using a common bus. SMP is the primary parallel architecture employed in servers and is depicted in the following image.
Massively Parallel Processing (MPP) is the coordinated processing of a single task by multiple processors, each processor using its own OS and memory and communicating with each other using some form of messaging interface. MPP can be setup with a shared nothing or shared disk architecture.
In a shared nothing architecture, there is no single point of contention across the system and nodes do not share memory or disk storage. Data is horizontally partitioned across nodes, such that each node has a subset of rows from each table in the database. Each node then processes only the rows on its own disks. Systems based on this architecture can achieve massive scale as there is no single bottleneck to slow down the system. This is what Emma is looking for.
MPP with shared-nothing architecture is depicted in the following image.
Microsoft Parallel Data Warehouse (PDW) running on a Microsoft Analytics Platform System appliance is implemented as an MPP shared-nothing architecture. It consists of one control node and storage attached compute nodes inter-connected by Ethernet and Infiniband. The control node hosts the PDW engine – the brains of the MPP system – that creates parallel query plans, co-ordinates query execution on compute nodes, and data aggregation across the entire appliance. All nodes, including control and compute, host a Data Movement Service (DMS) to transfer data between nodes.
For more details on PDW architecture, you can read the Architecture of the Microsoft Analytics Platform System post.
Transitioning to MPP
To realize the value offered by MPP, Emma and her team purchase a Microsoft APS appliance and begin transitioning to MPP. Let’s take a look at how they adapt their solution to take full advantage of APS’s shared nothing MPP architecture.
As previously mentioned, APS is based on a shared nothing MPP architecture which means that nodes are self-sufficient and do not share memory or disks. The architecture, therefore, requires you to distribute your large tables across nodes to get the benefits of the massively parallel processing. APS allows the definition of a table as either distributed or replicated. The decision to choose one versus the other depends on the volume of data and the need for access to all of the data on a single node.
A distributed table is one where row data within the table is distributed across the nodes within the appliance to allow for massive scale. Each row ends up in a one distribution in one compute node as depicted by the image below.
To take advantage of the distributed nature of APS, Emma modifies the large tables, typically Fact and large dimension tables, to be distributed in APS as follows:
CREATE TABLE [dbo].[FactInternetSales] ( [ProductKey] [int] NOT NULL, [OrderDateKey] [int] NOT NULL, . . [ShipDate] [datetime] NULL ) WITH ( DISTRIBUTION = HASH(ProductKey), CLUSTERED COLUMNSTORE INDEX );
As you can see, this is a typical DDL statement for table creation with a minor addition for distributed tables. Tables are distributed by a deterministic hash function applied to the Distribution Column chosen for that table. Emma chooses Product Key as the distribution column in the FactInternetSales table because of the high cardinality and absence of skew, therefore distributing the table evenly across nodes.
If all tables were distributed, however, it would require a great deal of data movement between nodes before performing join operations for all operations. Therefore, for smaller dimension tables such as language, countries etc. it makes sense to replicate the entire table on each compute node. That is to say, the benefits of enabling local join operations with these tables outweigh the cost of extra storage consumed. A replicated table is one that is replicated across all compute nodes as depicted below.
Emma designs the small tables, typically dimension tables, to be replicated as follows:
CREATE TABLE [dbo].[DimDate]( [DateKey] [int] NOT NULL, . . [SpanishDayNameOfWeek] [nvarchar](10) NOT NULL, ) WITH ( CLUSTERED COLUMNSTORE INDEX );
By appropriately designing distributed and replicated tables, Emma aligns her solution with common MPP design best practices and enables efficient processing of high volumes of data. For example, a query against 100 billion rows in a SQL Server SMP environment would require the processing of all of the data in a single execution space. With MPP, the work is spread across many nodes to break the problem into more manageable and easier ways to execute tasks. In a four node appliance (see the picture above), each node is only asked to process roughly 25 billion rows – a much quicker task. As a result, Emma observes significant improvements to the query execution time and her business can now make better decisions, faster. Additionally, Emma can grow the data warehouse to anywhere from a few terabytes to over 6 petabytes of data in by adding “scale units” to APS.
With SQL Server SMP, Emma and her team were using ETL processes via a set of SSIS packages to load data into the data warehouse – (1) Extracting data from the OLTP and other systems; (2) Transforming the data into dimensional format; and (3) Loading the data to target dimension or fact tables in the Data Warehouse. With increasing volumes of data, the SSIS sever in the middle becomes a bottleneck while performing transformations, resulting in slow data loading.
With APS, Emma and her team can use ELT instead, to Extract the data from the OLTP and other systems and Load it to a staging location on APS. Then, the data can be Transformed into dimensional format not with SSIS but with the APS Engine utilizing the distributed nature of the appliance and the power of parallel processing. In a 4-node appliance, four servers would be doing the transformations on subsets of data versus the single node SSIS server.
This parallel processing results in a significant boost in data loading performance. Emma can then use the Create Table As Select (CTAS) statement to create the table from the staging table as follows.
CREATE TABLE [dbo].[DimCustomer] WITH ( CLUSTERED COLUMN INDEX, DISTRIBUTION = HASH (CustomerKey) ) AS SELECT * FROM [staging].[DimCustomer];
By switching to an ELT process, Emma utilizes the parallel processing power of APS to see performance gains in data loading.
In conclusion, Emma and her team have found answers to their SMP woes with MPP. They can now feel confident handling the data volume and growth at Adventure Works with the ability to scale the data warehouse as needed. With ELT and the power of parallel processing in APS, they can load data into APS faster and within the expected time-window. And by aligning with APS’s MPP design, they can achieve breakthrough query performance, allowing for real-time reporting and insight into their data.
Visit the Analytics Platform System page to access more resources including: datasheet, video, solution brief, and more..
To learn more about migration from SQL Server to the Analytics Platform System
And we are alive!
I am proud to announce the PASS SQL Saturday #356 Slovenia event. The date is December 13th. In December, Ljubljana is even more beautiful than usually, and it is definitely a party city. Therefore, together with great sessions, a lot of reasons to come. Welcome!
Unprotected sensitive data in your SQL Server databases is a significant risk.��Sensitive data in the wrong hands can harm your customers, organization and career.��Knowing all of your data needs is challenging for current databases and future development, but it is imperative to protect sensitive data.��Join us for this session to learn how to uncover the sensitive data lurking in your SQL Server databases.
You want to reduce the load time of your SSIS packages, but you found that the source of the problem is that you are doing full comparisons between your source system and your destination. Perhaps you’ve come to find that your source system has been difficult for you to detect when data changes have occurred.
I and Alberto Ferrari started delivering training on Power Pivot in 2010, initially in classrooms and then also online. We also recorded videos for Project Botticelli, where you can find content about Microsoft tools and services for Business Intelligence. In the last months, we produced a recorded video course for people that want to learn Power Pivot without attending a scheduled course.
We split the entire Power Pivot course training in three editions, offering at a lower price the more introductive modules:
- Beginner: introduces Power Pivot to any user who knows Excel and want to create reports with more complex and large data structures than a single table.
- Intermediate: improves skills on Power Pivot for Excel, introducing the DAX language and important features such as CALCULATE and Time Intelligence functions.
- Advanced: includes a depth coverage of the DAX language, which is required for writing complex calculations, and other advanced features of both Excel and Power Pivot.
There are also two bundles, that includes two or three editions at a lower price.
Most important, we have a special 40% launch discount on all published video courses using the coupon SQLBI-FRNDS-14 valid until August 31, 2014. Just follow the link to see a more complete description of the editions available and their discounted prices. Regular prices start at $29, which means that you can start a training with less than $18 using the special promotion.
P.S.: we recently launched a new responsive version of the SQLBI web site, and now we also have a page dedicated to all videos available about our sessions in conferences around the world. You can find more than 30 hours of free videos here: http://www.sqlbi.com/tv.
One issue with importing data into SQL Server is that the import file has rows that have too many or too few columns and therefore the imported data doesn’t insert correctly or doesn’t import at all. In this tip we look at how this can be handled using PowerShell.
Unless you have been living under a rock, you’ve heard about the concept of contained users within a SQL Server database (hit the link if you have not). In this article I’d like to show you that you can, indeed, script out contained database users and recreate them on another database, as either contained users or as good old fashioned logins/server principals as well.
Why would this be useful? Well, because you would not need to know the password for the user in order to recreate it on another instance. I know there is a limited number of scenarios where this would be necessary, but nonetheless I figured I’d throw this blog post to show how it can be done.
A more obscure use case: with the password hash (which I’m about to show you how to obtain) you could also crack the password using a utility like hashcat, as highlighted on this SQLServerCentral article.
SQL Server uses System Base Tables to save the password hashes of logins and contained database users. For logins it uses sys.sysxlgns, whereas for contained database users it leverages sys.sysowners.
I’ll show you what I do to figure this stuff out: I create a login/contained user, and then I immediately browse the transaction log with, for example, fn_dblog.
It’s pretty obvious that only two base tables touched by the operation are sys.sysxlgns, and also sys.sysprivs – the latter is used to track permissions. If I connect to the DAC on my instance, I can query for the password hash of this login I’ve just created.
A few interesting things about this hash. This was taken on my laptop, and I happen to be running SQL Server 2014 RTM CU2, which is the latest public build of SQL Server 2014 as of time of writing. In 2008 R2 and prior versions (back to 2000), the password hashes would start with 0x0100. The reason why this changed is because starting with SQL Server 2012 password hashes are kept using a SHA512 algorithm, as opposed to SHA-1 (used since 2000) or Snefru (used in 6.5 and 7.0). SHA-1 is nowadays deemed unsafe and is very easy to crack.
For regular SQL logins, this information is exposed through the sys.sql_logins catalog view, so there is really no need to connect to the DAC to grab an SID/password hash pair. For contained database users, there is (currently) no method of obtaining SID or password hashes without connecting to the DAC.
If we create a contained database user, this is what we get from the transaction log:
Note that the System Base Table used in this case is sys.sysowners. sys.sysprivs is used as well, and again this is to track permissions.
To query sys.sysowners, you would have to connect to the DAC, as I mentioned previously. And this is what you would get:
There are other ways to figure out what SQL Server uses under the hood to store contained database user password hashes, like looking at the execution plan for a query to sys.dm_db_uncontained_entities (Thanks, Robert Davis!)
SIDs, Logins, Contained Users, and Why You Care…Or Not.
One of the reasons behind the existence of Contained Users was the concept of portability of databases: it is really painful to maintain Server Principals (Logins) synced across most shared-nothing SQL Server HA/DR technologies (Mirroring, Availability Groups, and Log Shipping). Often times you would need the Security Identifier (SID) of these logins to match across instances, and that meant that you had to fetch whatever SID was assigned to the login on the principal instance so you could recreate it on a secondary. With contained users you normally wouldn’t care about SIDs, as the users are always available (and synced, as long as synchronization takes place) across instances.
Now you might be presented some particular requirement that might specify that SIDs synced between logins on certain instances and contained database users on other databases. How would you go about creating a contained database user with a specific SID?
The answer is that you can’t do it directly, but there’s a little trick that would allow you to do it.
Create a login with a specified SID and password hash, create a user for that server principal on a partially contained database, then migrate that user to contained using the system stored procedure sp_user_migrate_to_contained, then drop the login.
CREATE LOGIN <login_name> WITH PASSWORD = <password_hash> HASHED, SID = <sid> ;
CREATE USER <user_name> FROM LOGIN <login_name>;
EXEC sp_migrate_user_to_contained @username = <user_name>, @rename = N’keep_name’, @disablelogin = N‘disable_login’;
DROP LOGIN <login_name>;
Here’s how this skeleton would look like in action:
And now I have a contained user with a specified SID and password hash. In my example above, I renamed the user after migrated it to contained so that it is, hopefully, easier to understand.