Whether you are an experienced DBA or a brand new one, the first week on a new job is stressful.
Igår, den 28 november, hade Microsofts fastighetschef Lotta Bergius det stora nöjet att ta emot priset för Sveriges Snyggaste Kontor 2013, på en prisutdelningsceremoni på Oscarsteatern i Stockholm. Priset består av, förutom äran och uppmärksamheten, en skulptur av Ernst Billgren som nu står i Microsofts reception.
Tidigare i år blev Microsoft Sverige utnämnt till Sveriges bästa arbetsplats av Great Place To Work. Båda dessa pris är tillsammans en bekräftelse på att vi tänker rätt, inte bara när det gäller tekniken, utan även när det de övriga två delarna – platsen och personerna – som krävs för framgångsrika förändringsprojekt.
– Det är fantastiskt kul att vi uppmärksammas för vårt långsiktiga arbete med att forma arbetsplatsen till Det nya arbetslivet. Intresset är enormt stort från andra verksamheter, vi har redan haft tusentals besökare som inspirerats sedan vi invigde det nya kontoret i april, säger Heléne Lidström, projektledare för Det nya arbetslivet.
Tävlingen Sveriges Snyggaste Kontor arrangeras av Lokalnytt och i år är det tredje året i rad som man uppmärksammar det kontor som bedöms bäst utifrån kriterierna stil, arbetsmiljö och nytänkande.
Under hösten har företag över hela Sverige kunnat nominera sina kontor till tävlingen. Tre finalbidrag har röstats fram, där en kvalificerad jury, under ledning av arkitekten Magnus Månsson, har utsett vinnaren.
Priset vanns i hård konkurrens av fastighetsbolaget Vasakronan i Stockholm som kom tvåa, trea kom AMF Fastigheter, också i Stockholm.
Due to the fact that for Memory-Optimized tables statistics are not automatically updated, we must manually update statistics in order to take advantage of new statistics for In-Memory OLTP. In this tip we cover how this can be done.
Software changes, new versions not only add new features, they also change internal behavior of old features, and not all the changes are documented! And how often, after a software upgrade, do you go through and test everything you knew was true, to make sure it is STILL true after the upgrade? I write books about the SQL Server software, and although I do verify all the code in my books against the new version, I admit that I don’t always check every single fact mentioned in the entire book to…(read more)
Today we released Security Advisory 2914486 regarding a local elevation of privilege (EoP) issue that affects customers using Microsoft Windows XP and Server 2003. Windows Vista and later are not affected by this local EoP issue. A member of the Microsoft Active Protections Program (MAPP) found this issue being used on systems compromised by a third-party remote code execution vulnerability. These limited, targeted attacks require users to open a malicious PDF file. The issues described by the advisory cannot be used to gain access to a remote system alone.
While we are actively working to develop a security update to address this issue, we encourage customers running Windows XP and Server 2003 to deploy the following workarounds as described in the advisory:
Delete NDProxy.sys and reroute to Null.sys
For environments with non-default, limited user privileges, Microsoft has verified that the following workaround effectively blocks the attacks that have been observed in the wild.
We also always encourage people to follow the “Protect Your Computer” guidance of enabling a firewall, applying all software updates and installing anti-virus and anti-spyware software. We also encourage folks to exercise caution when visiting websites and avoid clicking suspicious links, or opening email messages from unfamiliar senders. Additional information can be found at www.microsoft.com/protect.
We hope this doesn’t disrupt any holiday plans you may have, but we did want to provide you with actionable information to help protect your systems. We continue to monitor the threat landscape closely and will take appropriate action to help protect customers.
Group Manager, Response Communications
Because of multi-valued concurrency control aka MVCC, there are many differences in queries’ behavior with different isolation levels. Before running examples, let us set up test data. Setting up test data The following script sets up the data we shall be playing with: DROP TABLE test ; CREATE TABLE test ( ID INT NOT NULL, CONSTRAINT PK_test PRIMARY KEY ( ID ), col2 INT NOT NULL, col3 FLOAT ); INSERT INTO test ( ID , col2 , col3 ) VALUES ( 1 , 0 , 0 ),( 8 , 1 , 1 ); There are no dirty reads. In one…(read more)
The problem of credit card fraud detection is not trivial. With every transaction processed, only a limited amount of data is available, making it difficult if not impossible to distinguish between a “good” transaction and a (potentially) fraudulent one. In addition, there are literally millions of points of sales and web sites where a single credit card can be used. Even additional properties that could be available in the card owner’s profile, such as demographical data, probably will not make things much clearer. Do we focus on the customer or on the credit card? From experience, it seems that the customers who use multiple credit cards typically use each card for a specific purpose. Although this means that we might start by profiling the card, it might also be worth checking the customer’s profile, as this might reveal different habits in different cultures.
It is also quite hard to request specific card properties and/or customer profile properties in advance. Different companies collect, maintain, and have access to different data sets. On the other hand, some common data, like geographical location, time of usage, type of a product purchased, type of a transaction (purchase, cash advance), and similar, can be available to us. A good overview of data, useful for credit card and online banking fraud detection, can be found in
Hand D.J., & Blunt G. (2001, October). Prospecting gems in credit card data. IMA Journal of Management Mathematics.
In addition to the source variables, many calculated variables can be extremely handy. If geographic data is not available, it can often be extracted from IP addresses, ZIP codes, and similar source data. Web addresses also might contain country of origin, or, alternatively, business type. Universal product codes (UPCs) contain country of origin and the company that produces it. Many variables can be calculated from the time of the transaction and other data, for example:
- A flag designating whether multiple transactions have been issued from different IPs and the same person in a particular time frame
- A flag designating whether transactions from multiple persons and the same IP have been issued in a particular time frame
- Whether there are multiple persons using the same credit card or user account
- Whether the total amount of a transaction is near the maximum amount allowed for a particular type of transaction, or whether it is nearer the minimum amount
- The time of day could be significant: is the day a holiday, a weekday, or was the transaction issued on a weekend, or a particular day during the month
- The frequency of transactions in a moving time frame
- The number of distinct transactions in a moving time frame (often, the same kind of a transaction could be repeated regularly)
- The quantity of deviations from a moving average for the type of the transaction
We could address the problem with number of detected frauds that is too low in all of the transactions by oversampling, by repeating, or copying, known fraudulent transactions, or by undersampling, by lowering the number of non-fraudulent transactions in the sample used for model training. From experience we have learned to prefer the approach of undersampling. We select non-fraudulent transactions either with simple random sampling, or when we have clearly determined groups of transactions, with stratified sampling. For example, it might be obvious that there are significant differences in the patterns across different countries; by using countries or regions we can define different strata and then randomly select appropriate numbers of cases from each stratum separately. You can read more about sampling at
Wikipedia. (n.d.). Sample size determination. Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Sample_size_determination.
Different data mining algorithms are more or less prone to giving erroneous predictions when the target state is presented with a low enough frequency. With SSAS algorithms, we usually observe the following:
- The Microsoft Neural Networks algorithm works best when you have about 50% of frauds in the total sample data set
- The Microsoft Naïve Bayes algorithm already works well with 10% frauds
- The Microsoft Decision Trees algorithm even works well with only 1% of frauds
During the data preparation, we also have to take care of missing values and outliers. Missing values can have a seriously negative effect on a data mining project. However, if the number of missing values is small enough, they can be handled by using a variety of methods:
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows containing the missing data (note that we could also filter out too many rows and lose a pattern)
- Ignore the column (note that we could also ignore too many columns)
- Predict the missing values with data mining algorithms, like Decision Trees (note that we could lose variability)
- Build separate models, for example, one model for all the data (including missing values if the algorithm can handle them), one model for known data, and one model for rows with missing values (this does represent quite a lot of additional work)
- Modify the operational systems so that the missing values can be collected later (this represents the best alternative, but unfortunately it is also the most difficult to achieve)
- Replace the missing data with a mean (this is a very popular technique, although we could lose variability again)
Whenever we make any changes to the data, we are influencing the analysis. Before making any changes, we should determine whether there are any patterns in the missing data. We use data mining for this analysis. For instance, we add a flag with a value of 1 for a row that includes missing values or the value of 0 for rows where all of the variable values for the case are known and present. Then we use a predictive algorithm like Decision Trees to explain this new flag variable with other input variables. The resulting tree should be very shallow, without any strong patterns; otherwise, we have identified a pattern in the missing values. If a pattern has been found, it should be explained, and then we should use an appropriate missing value handling technique that does not alter the data (e.g., build separate models). Otherwise, we prefer to filter the rows with missing values.
Outliers are rare and far out-of-bound values. They are so far out of bound that they can influence the results of the analyses. Similarly to handling missing values, we determine whether there is any pattern in the outliers before handling them. We can do one of the following to address outliers:
- Check if the outlier is an erroneous value, and if is, correct it (the best possibility)
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows with the outliers (note that we could end up filtering out too many rows and lose a pattern)
- Ignore the column (note that we could ignore too many columns)
- Replace outliers with common (mean) values (note that, of course, we are losing variability)
- Bin values into equal height ranges (this is a good approach, especially for algorithms that use discrete input variables)
- Normalize the data values in predefined limited ranges
Some well-known methods of normalization include:
- Range normalization.
- Z-score normalization
- The logistic (sigmoid) function normalization
- The hyperbolic tangent function normalization
Discretization (or binning, or categorization, or recoding) is also useful for other purposes, not only for dealing with outliers. For example, some algorithms, for instance the Microsoft Naïve Bayes algorithm, can accept only discrete input variables. Discretization is performed on a single column. Examples of discretization include:
- Transaction amount
Note that with proper discretization we can compare otherwise incomparable data. For example, if one country income per capita is significantly higher than in another country, then it is difficult to compare the amounts of the transactions directly. However, if we discretize the income into three groups, like low, average and high, and appropriately tailor the group boundaries per country, we get comparable data. Please refer to
Pyle D. (1999). Data Preparation for Data Mining. Morgan Kaufmann
for further reading about data preparation for data mining.
As already mentioned, data overview activities interleave with the data preparation. In order to find outliers, we must get the idea of the distribution of a variable. We can use Microsoft Office Excel Pivot Tables and Pivot Graphs for this task. However, many times it is faster to use statistical computations and interpret the results. With Transact-SQL queries, we can calculate a lot of useful statistical information.
For a quick overview of discrete variables, we can use frequency tables. In a frequency table, we can show values, the absolute frequency of those values, absolute percentages, cumulative frequency, cumulative percent, and a histogram of the absolute percentage. OLAP cubes can be used to establish an overview of the frequency distribution for tens, if not hundreds of variables, very quickly.
For continuous variables, we can use descriptive statistics and calculate the first four population moments: Mean, Standard Deviation, Skewness, and Kurtosis. This gives us a quick impression of the distribution of values of those variables.
It is also worth checking linear dependencies between pairs of variables. Some algorithms, like the Microsoft Decision Trees algorithm, tend to exclude one variable from the dependent pair in the analysis, while other algorithms, like the Microsoft Clustering algorithm, might find too good a clusters if they use pairs of dependent variables. There are multiple methods for calculating these dependencies:
- Chi-Squared test for pairs of discrete variables
- Correlation Coefficient for pairs of continuous variables
- Analysis of variance (ANOVA) for pairs where one variable is continuous per one discrete variable
Based on our experience in the field of fraud detection, we developed a much faster method to test all possible linear dependencies between multiple variables, not just pairs. We use the Microsoft Naïve Bayes data mining algorithm, where we declare all variables as input and predictable at the same time; this way, we can determine all important dependencies with a single analysis. Of course, the Naïve Bayes algorithm expects discrete inputs only; however, SSAS can discretize variables on the fly, using different discretization methods. We usually use the Equal Heights method, even though it is typically not useful for the final analysis, because it changes the shape of the distribution; however, it is extremely useful for the overview of linear dependencies, because it retains the maximum amount of information in the variables.
Variables with an insignificant amount of information are useless in analysis. We measure the amount of information in a variable by calculating the Entropy. As this calculation is quite slow, we do it only for variables, which we suspect contain a low amount of information.
For more details about the statistics mentioned, please refer to
Wonnacott T.H., & Wonnacott R.J. (1990). Introductory Statistics, 5th Edition . Wiley.
For more information on the information theory, please refer to
Kullback S. (1997). Information Theory and Statistics (Dover Books on Mathematics). Dover Publications.
In the webinar I presented entitled Taking on your First SQL Server Business Intelligence Project, a couple of questions from the audience asked about defining and reusing dimensions, particularly the date dimension. These reusable dimensions are also known as role-playing dimensions because the single dimension can play multiple roles in the Analysis Services cube.
As the proliferation of devices continues to capture the imagination of consumers, and has ignited what is referred to as bring your own device (BYOD) revolution, many IT departments across the globe are now facing increased security considerations. While organizations encourage BYOD for cost savings and productivity, it is also important to have robust security policies supporting BYOD.
Last week, several media reports surfaced of an attack on the European Parliament in which some members allegedly had their email unlawfully accessed. Initial media speculations inaccurately implied that the attack used a vulnerability in Microsoft’s Exchange ActiveSync. While details and specifics of this attack unfold, based on our initial assessment, we have determined this is not a vulnerability in the ActiveSync protocol; the issue is how third party devices handle authentication of certificates.
This type of attack has been previously discussed at the Black Hat 2012 Conference. Enhancements to newer versions of Windows Phone block this type of attack automatically. In fact, Microsoft’s implementation of Exchange ActiveSync on Windows Phone regularly protects customers from this type of attack, as it does not allow a malicious certificate to be trusted by the device.
Third party software developers license, and can modify, Exchange ActiveSync from Microsoft to ensure that customers can receive their email on any device. Third party developers are responsible for ensuring that their implementation of the Exchange ActiveSync protocol is secure. That said, there are also ways in which customers can help protect themselves from similar types of attacks:
- Become familiar with “Understanding security for Exchange ActiveSync”
- Configure Exchange ActiveSync to use a trusted certificate
- Set restrictions based on device model and device type to only allow well-implemented clients
- Clearly define policy to ensure devices support the security functionality required and only use devices that do not accept automatic or prompted certificate renewal
We strongly encourage all customers evaluating a BYOD business strategy to ensure they fully understand the various security features and capabilities of the devices that are brought into their organization.
Trustworthy Computing Security
PostgreSql has simple and consistent error handling, which can be roughly explained in Sql Server terms as follows: XACT_ABORT is Always ON. In other words, error handling in PostgreSql has substantially less features, but it does have all the features which we actually use in the project being migrated. Simplicity of error handling in PostgreSql is very good news – we need to learn much less before we are productive, and there are less chances to make a mistake. Let us consider some examples. On…(read more)