This is the third part of the fraud detection whitepaper. You can find the first part and the second part in my previous blog posts about this topic.
The problem of credit card fraud detection is not trivial. With every transaction processed, only a limited amount of data is available, making it difficult if not impossible to distinguish between a “good” transaction and a (potentially) fraudulent one. In addition, there are literally millions of points of sales and web sites where a single credit card can be used. Even additional properties that could be available in the card owner’s profile, such as demographical data, probably will not make things much clearer. Do we focus on the customer or on the credit card? From experience, it seems that the customers who use multiple credit cards typically use each card for a specific purpose. Although this means that we might start by profiling the card, it might also be worth checking the customer’s profile, as this might reveal different habits in different cultures.
It is also quite hard to request specific card properties and/or customer profile properties in advance. Different companies collect, maintain, and have access to different data sets. On the other hand, some common data, like geographical location, time of usage, type of a product purchased, type of a transaction (purchase, cash advance), and similar, can be available to us. A good overview of data, useful for credit card and online banking fraud detection, can be found in
Hand D.J., & Blunt G. (2001, October). Prospecting gems in credit card data. IMA Journal of Management Mathematics.
In addition to the source variables, many calculated variables can be extremely handy. If geographic data is not available, it can often be extracted from IP addresses, ZIP codes, and similar source data. Web addresses also might contain country of origin, or, alternatively, business type. Universal product codes (UPCs) contain country of origin and the company that produces it. Many variables can be calculated from the time of the transaction and other data, for example:
- A flag designating whether multiple transactions have been issued from different IPs and the same person in a particular time frame
- A flag designating whether transactions from multiple persons and the same IP have been issued in a particular time frame
- Whether there are multiple persons using the same credit card or user account
- Whether the total amount of a transaction is near the maximum amount allowed for a particular type of transaction, or whether it is nearer the minimum amount
- The time of day could be significant: is the day a holiday, a weekday, or was the transaction issued on a weekend, or a particular day during the month
- The frequency of transactions in a moving time frame
- The number of distinct transactions in a moving time frame (often, the same kind of a transaction could be repeated regularly)
- The quantity of deviations from a moving average for the type of the transaction
We could address the problem with number of detected frauds that is too low in all of the transactions by oversampling, by repeating, or copying, known fraudulent transactions, or by undersampling, by lowering the number of non-fraudulent transactions in the sample used for model training. From experience we have learned to prefer the approach of undersampling. We select non-fraudulent transactions either with simple random sampling, or when we have clearly determined groups of transactions, with stratified sampling. For example, it might be obvious that there are significant differences in the patterns across different countries; by using countries or regions we can define different strata and then randomly select appropriate numbers of cases from each stratum separately. You can read more about sampling at
Wikipedia. (n.d.). Sample size determination. Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Sample_size_determination.
Different data mining algorithms are more or less prone to giving erroneous predictions when the target state is presented with a low enough frequency. With SSAS algorithms, we usually observe the following:
- The Microsoft Neural Networks algorithm works best when you have about 50% of frauds in the total sample data set
- The Microsoft Naïve Bayes algorithm already works well with 10% frauds
- The Microsoft Decision Trees algorithm even works well with only 1% of frauds
During the data preparation, we also have to take care of missing values and outliers. Missing values can have a seriously negative effect on a data mining project. However, if the number of missing values is small enough, they can be handled by using a variety of methods:
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows containing the missing data (note that we could also filter out too many rows and lose a pattern)
- Ignore the column (note that we could also ignore too many columns)
- Predict the missing values with data mining algorithms, like Decision Trees (note that we could lose variability)
- Build separate models, for example, one model for all the data (including missing values if the algorithm can handle them), one model for known data, and one model for rows with missing values (this does represent quite a lot of additional work)
- Modify the operational systems so that the missing values can be collected later (this represents the best alternative, but unfortunately it is also the most difficult to achieve)
- Replace the missing data with a mean (this is a very popular technique, although we could lose variability again)
Whenever we make any changes to the data, we are influencing the analysis. Before making any changes, we should determine whether there are any patterns in the missing data. We use data mining for this analysis. For instance, we add a flag with a value of 1 for a row that includes missing values or the value of 0 for rows where all of the variable values for the case are known and present. Then we use a predictive algorithm like Decision Trees to explain this new flag variable with other input variables. The resulting tree should be very shallow, without any strong patterns; otherwise, we have identified a pattern in the missing values. If a pattern has been found, it should be explained, and then we should use an appropriate missing value handling technique that does not alter the data (e.g., build separate models). Otherwise, we prefer to filter the rows with missing values.
Outliers are rare and far out-of-bound values. They are so far out of bound that they can influence the results of the analyses. Similarly to handling missing values, we determine whether there is any pattern in the outliers before handling them. We can do one of the following to address outliers:
- Check if the outlier is an erroneous value, and if is, correct it (the best possibility)
- Do nothing (a simple, but rarely a valid approach)
- Filter out the rows with the outliers (note that we could end up filtering out too many rows and lose a pattern)
- Ignore the column (note that we could ignore too many columns)
- Replace outliers with common (mean) values (note that, of course, we are losing variability)
- Bin values into equal height ranges (this is a good approach, especially for algorithms that use discrete input variables)
- Normalize the data values in predefined limited ranges
Some well-known methods of normalization include:
- Range normalization.
- Z-score normalization
- The logistic (sigmoid) function normalization
- The hyperbolic tangent function normalization
Discretization (or binning, or categorization, or recoding) is also useful for other purposes, not only for dealing with outliers. For example, some algorithms, for instance the Microsoft Naïve Bayes algorithm, can accept only discrete input variables. Discretization is performed on a single column. Examples of discretization include:
- Transaction amount
Note that with proper discretization we can compare otherwise incomparable data. For example, if one country income per capita is significantly higher than in another country, then it is difficult to compare the amounts of the transactions directly. However, if we discretize the income into three groups, like low, average and high, and appropriately tailor the group boundaries per country, we get comparable data. Please refer to
Pyle D. (1999). Data Preparation for Data Mining. Morgan Kaufmann
for further reading about data preparation for data mining.
As already mentioned, data overview activities interleave with the data preparation. In order to find outliers, we must get the idea of the distribution of a variable. We can use Microsoft Office Excel Pivot Tables and Pivot Graphs for this task. However, many times it is faster to use statistical computations and interpret the results. With Transact-SQL queries, we can calculate a lot of useful statistical information.
For a quick overview of discrete variables, we can use frequency tables. In a frequency table, we can show values, the absolute frequency of those values, absolute percentages, cumulative frequency, cumulative percent, and a histogram of the absolute percentage. OLAP cubes can be used to establish an overview of the frequency distribution for tens, if not hundreds of variables, very quickly.
For continuous variables, we can use descriptive statistics and calculate the first four population moments: Mean, Standard Deviation, Skewness, and Kurtosis. This gives us a quick impression of the distribution of values of those variables.
It is also worth checking linear dependencies between pairs of variables. Some algorithms, like the Microsoft Decision Trees algorithm, tend to exclude one variable from the dependent pair in the analysis, while other algorithms, like the Microsoft Clustering algorithm, might find too good a clusters if they use pairs of dependent variables. There are multiple methods for calculating these dependencies:
- Chi-Squared test for pairs of discrete variables
- Correlation Coefficient for pairs of continuous variables
- Analysis of variance (ANOVA) for pairs where one variable is continuous per one discrete variable
Based on our experience in the field of fraud detection, we developed a much faster method to test all possible linear dependencies between multiple variables, not just pairs. We use the Microsoft Naïve Bayes data mining algorithm, where we declare all variables as input and predictable at the same time; this way, we can determine all important dependencies with a single analysis. Of course, the Naïve Bayes algorithm expects discrete inputs only; however, SSAS can discretize variables on the fly, using different discretization methods. We usually use the Equal Heights method, even though it is typically not useful for the final analysis, because it changes the shape of the distribution; however, it is extremely useful for the overview of linear dependencies, because it retains the maximum amount of information in the variables.
Variables with an insignificant amount of information are useless in analysis. We measure the amount of information in a variable by calculating the Entropy. As this calculation is quite slow, we do it only for variables, which we suspect contain a low amount of information.
For more details about the statistics mentioned, please refer to
Wonnacott T.H., & Wonnacott R.J. (1990). Introductory Statistics, 5th Edition . Wiley.
For more information on the information theory, please refer to
Kullback S. (1997). Information Theory and Statistics (Dover Books on Mathematics). Dover Publications.