Cover Page

Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques

A Guide to Data Science for Fraud Detection

Bart Baesens

Véronique Van Vlasselaer

Wouter Verbeke

 

Title Page

Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.

  1. Titles in the Wiley & SAS Business Series include:
  2. Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens
  3. Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian
  4. Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst
  5. Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs
  6. Business Analytics for Customer Intelligence by Gert Laursen
  7. Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron
  8. Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron
  9. Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid
  10. Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner
  11. Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen
  12. Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs
  13. Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase
  14. Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain by Robert A. Davis
  15. Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments by Gene Pease, Barbara Beresford, and Lew Walker
  16. The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and Mike Barlow
  17. Economic and Business Forecasting: Analyzing and Interpreting Econometric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard
  18. Financial Institution Advantage and The Optimization of Information Processing by Sean C. Keenan
  19. Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications by Robert Rowan
  20. Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models by Keith Holdaway
  21. Health Analytics: Gaining the Insights to Transform Health Care by Jason Burke
  22. Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeill
  23. Human Capital Analytics: How to Harness the Potential of Your Organization's Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz
  24. Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education by Jamie McQuiggan and Armistead Sapp
  25. Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark Brown
  26. Predictive Analytics for Human Resources by Jac Fitz-enz and John Mattox II
  27. Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance by Lawrence Maisel and Gary Cokins
  28. Retail Analytics: The Secret Weapon by Emmett Cox
  29. Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro
  30. Statistical Thinking: Improving Business Performance, second edition by Roger W. Hoerl and Ronald D. Snee
  31. Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics by Bill Franks
  32. Too Big to Ignore: The Business Case for Big Data by Phil Simon
  33. The Value of Business Analytics: Identifying the Path to Profitability by Evan Stubbs
  34. The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions by Phil Simon
  35. Understanding the Predictive Analytics Lifecycle by Al Cordoba
  36. Unleashing Your Inner Leader: An Executive Coach Tells All by Vickie Bevenour
  37. Using Big Data Analytics: Turning Big Data into Big Money by Jared Dean
  38. Win with Advanced Business Analytics: Creating Business Value from Your Data by Jean Paul Isson and Jesse Harriott

For more information on any of the above titles, please visit www.wiley.com.

To my wonderful wife, Katrien, and kids, Ann-Sophie, Victor, and Hannelore.

To my parents and parents-in-law.

To my husband and soul mate, Niels, for his never-ending support.

To my parents, parents-in-law, and siblings-in-law.

To Luit and Titus.

List of Figures

  1. Figure 1.1 Fraud Triangle
  2. Figure 1.2 Fire Incident Claim-Handling Process
  3. Figure 1.3 The Fraud Cycle
  4. Figure 1.4 Outlier Detection at the Data Item Level
  5. Figure 1.5 Outlier Detection at the Data Set Level
  6. Figure 1.6 The Fraud Analytics Process Model
  7. Figure 1.7 Profile of a Fraud Data Scientist
  8. Figure 1.8 Screenshot of Web of Science Statistics for Scientific Publications on Fraud between 1996 and 2014
  9. Figure 2.1 Aggregating Normalized Data Tables into a Non-Normalized Data Table
  10. Figure 2.2 Pie Charts for Exploratory Data Analysis
  11. Figure 2.3 Benford's Law Describing the Frequency Distribution of the First Digit
  12. Figure 2.4 Multivariate Outliers
  13. Figure 2.5 Histogram for Outlier Detection
  14. Figure 2.6 Box Plots for Outlier Detection
  15. Figure 2.7 Using the z-Scores for Truncation
  16. Figure 2.8 Default Risk Versus Age
  17. Figure 2.9 Illustration of Principal Component Analysis in a Two-Dimensional Data Set
  18. Figure 3.1 3D Scatter Plot for Detecting Outliers
  19. Figure 3.2 OLAP Cube for Fraud Detection
  20. Figure 3.3 Example Pivot Table for Credit Card Fraud Detection
  21. Figure 3.4 Break-Point Analysis
  22. Figure 3.5 Peer-Group Analysis
  23. Figure 3.6 Cluster Analysis for Fraud Detection
  24. Figure 3.7 Hierarchical Versus Nonhierarchical Clustering Techniques
  25. Figure 3.8 Euclidean Versus Manhattan Distance
  26. Figure 3.9 Divisive Versus Agglomerative Hierarchical Clustering
  27. Figure 3.10 Calculating Distances between Clusters
  28. Figure 3.11 Example for Clustering Birds. The Numbers Indicate the Clustering Steps
  29. Figure 3.12 Dendrogram for Birds Example. The Thick Black Line Indicates the Optimal Clustering
  30. Figure 3.13 Screen Plot for Clustering
  31. Figure 3.14 Scatter Plot of Hierarchical Clustering Data
  32. Figure 3.15 Output of Hierarchical Clustering Procedures
  33. Figure 3.16 k-Means Clustering: Start from Original Data
  34. Figure 3.17 k-Means Clustering Iteration 1: Randomly Select Initial Cluster Centroids
  35. Figure 3.18 k-Means Clustering Iteration 1: Assign Remaining Observations
  36. Figure 3.19 k-Means Iteration Step 2: Recalculate Cluster Centroids
  37. Figure 3.20 k-Means Clustering Iteration 2: Reassign Observations
  38. Figure 3.21 k-Means Clustering Iteration 3: Recalculate Cluster Centroids
  39. Figure 3.22 k-Means Clustering Iteration 3: Reassign Observations
  40. Figure 3.23 Rectangular Versus Hexagonal SOM Grid
  41. Figure 3.24 Clustering Countries Using SOMs
  42. Figure 3.25 Component Plane for Literacy
  43. Figure 3.26 Component Plane for Political Rights
  44. Figure 3.27 Must-Link and Cannot-Link Constraints in Semi-Supervised Clustering
  45. Figure 3.28 δ-Constraints in Semi-Supervised Clustering
  46. Figure 3.29 ε-Constraints in Semi-Supervised Clustering
  47. Figure 3.30 Cluster Profiling Using Histograms
  48. Figure 3.31 Using Decision Trees for Clustering Interpretation
  49. Figure 3.32 One-Class Support Vector Machines
  50. Figure 4.1 A Spider Construction in Tax Evasion Fraud
  51. Figure 4.2 Regular Versus Fraudulent Bankruptcy
  52. Figure 4.3 OLS Regression
  53. Figure 4.4 Bounding Function for Logistic Regression
  54. Figure 4.5 Linear Decision Boundary of Logistic Regression
  55. Figure 4.6 Other Transformations
  56. Figure 4.7 Fraud Detection Scorecard
  57. Figure 4.8 Calculating the p-Value with a Student's t-Distribution
  58. Figure 4.9 Variable Subsets for Four Variables V1, V2, V3, and V4
  59. Figure 4.10 Example Decision Tree
  60. Figure 4.11 Example Data Sets for Calculating Impurity
  61. Figure 4.12 Entropy Versus Gini
  62. Figure 4.13 Calculating the Entropy for Age Split
  63. Figure 4.14 Using a Validation Set to Stop Growing a Decision Tree
  64. Figure 4.15 Decision Boundary of a Decision Tree
  65. Figure 4.16 Example Regression Tree for Predicting the Fraud Percentage
  66. Figure 4.17 Neural Network Representation of Logistic Regression
  67. Figure 4.18 A Multilayer Perceptron (MLP) Neural Network
  68. Figure 4.19 Local Versus Global Minima
  69. Figure 4.20 Using a Validation Set for Stopping Neural Network Training
  70. Figure 4.21 Example Hinton Diagram
  71. Figure 4.22 Backward Variable Selection
  72. Figure 4.23 Decompositional Approach for Neural Network Rule Extraction
  73. Figure 4.24 Pedagogical Approach for Rule Extraction
  74. Figure 4.25 Two-Stage Models
  75. Figure 4.26 Multiple Separating Hyperplanes
  76. Figure 4.27 SVM Classifier for the Perfectly Linearly Separable Case
  77. Figure 4.28 SVM Classifier in Case of Overlapping Distributions
  78. Figure 4.29 The Feature Space Mapping
  79. Figure 4.30 SVMs for Regression
  80. Figure 4.31 Representing an SVM Classifier as a Neural Network
  81. Figure 4.32 One-Versus-One Coding for Multiclass Problems
  82. Figure 4.33 One-Versus-All Coding for Multiclass Problems
  83. Figure 4.34 Training Versus Test Sample Set Up for Performance Estimation
  84. Figure 4.35 Cross-Validation for Performance Measurement
  85. Figure 4.36 Bootstrapping
  86. Figure 4.37 Calculating Predictions Using a Cut-Off
  87. Figure 4.38 The Receiver Operating Characteristic Curve
  88. Figure 4.39 Lift Curve
  89. Figure 4.40 Cumulative Accuracy Profile
  90. Figure 4.41 Calculating the Accuracy Ratio
  91. Figure 4.42 The Kolmogorov-Smirnov Statistic
  92. Figure 4.43 A Cumulative Notch Difference Graph
  93. Figure 4.44 Scatter Plot: Predicted Fraud Versus Actual Fraud
  94. Figure 4.45 CAP Curve for Continuous Targets
  95. Figure 4.46 Regression Error Characteristic (REC) Curve
  96. Figure 4.47 Varying the Time Window to Deal with Skewed Data Sets
  97. Figure 4.48 Oversampling the Fraudsters
  98. Figure 4.49 Undersampling the Nonfraudsters
  99. Figure 4.50 Synthetic Minority Oversampling Technique (SMOTE)
  100. Figure 5.1a Köningsberg Bridges
  101. Figure 5.1b Schematic Representation of the Köningsberg Bridges
  102. Figure 5.2 Identity Theft. The Frequent Contact List of a Person is Suddenly Extended with Other Contacts (Light Gray Nodes). This Might Indicate that a Fraudster (Dark Gray Node) Took Over that Customer's Account and “shares” his/her Contacts
  103. Figure 5.3 Network Representation
  104. Figure 5.4 Example of a (Un)Directed Graph
  105. Figure 5.5 Follower–Followee Relationships in a Twitter Network
  106. Figure 5.6 Edge Representation
  107. Figure 5.7 Example of a Fraudulent Network
  108. Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters, of Whom Two are Legitimate (White Nodes) and Four are Fraudulent (Gray Nodes)
  109. Figure 5.9 Toy Example of Credit Card Fraud
  110. Figure 5.10 Mathematical Representation of (a) a Sample Network: (b) the Adjacency or Connectivity Matrix; (c) the Weight Matrix; (d) the Adjacency List; and (e) the Weight List
  111. Figure 5.11 A Real-Life Example of a Homophilic Network
  112. Figure 5.12 A Homophilic Network
  113. Figure 5.13 Sample Network
  114. Figure 5.14a Degree Distribution
  115. Figure 5.14b Illustration of the Degree Distribution for a Real-Life Network of Social Security Fraud. The Degree Distribution Follows a Power Law (log-log axes)
  116. Figure 5.15 A 4-regular Graph
  117. Figure 5.16 Example Social Network for a Relational Neighbor Classifier
  118. Figure 5.17 Example Social Network for a Probabilistic Relational Neighbor Classifier
  119. Figure 5.18 Example of Social Network Features for a Relational Logistic Regression Classifier
  120. Figure 5.19 Example of Featurization with Features Describing Intrinsic Behavior and Behavior of the Neighborhood
  121. Figure 5.20 Illustration of Dijkstra's Algorithm
  122. Figure 5.21 Illustration of the Number of Connecting Paths Between Two Nodes
  123. Figure 5.22 Illustration of Betweenness Between Communities of Nodes
  124. Figure 5.23 Pagerank Algorithm
  125. Figure 5.24 Illustration of Iterative Process of the PageRank Algorithm
  126. Figure 5.25 Sample Network
  127. Figure 5.26 Community Detection for Credit Card Fraud
  128. Figure 5.27 Iterative Bisection
  129. Figure 5.28 Dendrogram of the Clustering of Figure 5.27 by the Girvan-Newman Algorithm. The Modularity Q is Maximized When Splitting the Network into Two Communities ABC –DEFG
  130. Figure 5.29 Complete (a) and Partial (b) Communities
  131. Figure 5.30 Overlapping Communities
  132. Figure 5.31 Unipartite Graph
  133. Figure 5.32 Bipartite Graph
  134. Figure 5.33 Connectivity Matrix of a Bipartite Graph
  135. Figure 5.34 A Multipartite Graph
  136. Figure 5.35 Sample Network of Gotcha!
  137. Figure 5.36 Exposure Score of the Resources Derived by a Propagation Algorithm. The Results are Based on a Real-life Data Set in Social Security Fraud
  138. Figure 5.37 Egonet in Social Security Fraud. A Company Is Associated with its Resources
  139. Figure 5.38 ROC Curve of the Gotcha! Model, which Combines both Intrinsic and Relational Features
  140. Figure 6.1 The Analytical Model Life Cycle
  141. Figure 6.2 Traffic Light Indicator Approach
  142. Figure 6.3 SAS Social Network Analysis Dashboard
  143. Figure 6.4 SAS Social Network Analysis Claim Detail Investigation
  144. Figure 6.5 SAS Social Network Analysis Link Detection
  145. Figure 6.6 Distribution of Claim Amounts and Average Claim Value
  146. Figure 6.7 Geographical Distribution of Claims
  147. Figure 6.8 Zooming into the Geographical Distribution of Claims
  148. Figure 6.9 Measuring the Efficiency of the Fraud-Detection Process
  149. Figure 6.10 Evaluating the Efficiency of Fraud Investigators
  150. Figure 7.1 RACI Matrix
  151. Figure 7.2 Anonymizing a Database
  152. Figure 7.3 Different SQL Views Defined for a Database
  153. Figure 7.4 Aggregate Loss Distribution with Indication of Expected Loss, Value at Risk (VaR) at 99.9 Percent Confidence Level and Unexpected Loss
  154. Figure 7.5 Snapshot of a Credit Card Fraud Time Series Data Set and Associated Histogram of the Fraud Amounts
  155. Figure 7.6 Aggregate Loss Distribution Resulting from a Monte Carlo Simulation with Poisson Distributed Monthly Fraud Frequency and Associated Pareto Distributed Fraud Loss

Foreword

Fraud will always be with us. It is linked both to organized crime and to terrorism, and it inflicts substantial economic damage. The perpetrators of fraud play a dynamic cat and mouse game with those trying to stop them. Preventing a particular kind of fraud does not mean the fraudsters give up, but merely that they change their tactics: they are constantly on the lookout for new avenues for fraud, for new weaknesses in the system. And given that our social and financial systems are forever developing, there are always new opportunities to be exploited.

This book is a clear and comprehensive outline of the current state-of-the-art in fraud-detection and prevention methodology. It describes the data necessary to detect fraud, and then takes the reader from the basics of fraud-detection data analytics, through advanced pattern recognition methodology, to cutting-edge social network analysis and fraud ring detection.

If we cannot stop fraud altogether, an awareness of the contents of this book will at least enable readers to reduce the extent of fraud, and make it harder for criminals to take advantage of the honest. The readers' organizations, be they public or private, will be better protected if they implement the strategies described in this book. In short, this book is a valuable contribution to the well-being of society and of the people within it.

Professor David J. Hand
Imperial College, London

Preface

It is estimated that a typical organization loses about 5 percent of its revenues due to fraud each year. In this book, we will discuss how state-of-the-art descriptive, predictive and social network analytics can be used to fight fraud by learning fraud patterns from historical data.

The focus of this book is not on the mathematics or theory, but on the practical applications. Formulas and equations will only be included when absolutely needed from a practitioner's perspective. It is also not our aim to provide exhaustive coverage of all analytical techniques previously developed but, rather, give coverage of the ones that really provide added value in a practical fraud detection setting.

Being targeted at the business professional in the first place, the book is written in a condensed, focused way. Prerequisite knowledge consists of some basic exposure to descriptive statistics (e.g., mean, standard deviation, correlation, confidence intervals, hypothesis testing), data handling (using for example, Microsoft Excel, SQL, etc.), and data visualization (e.g., bar plots, pie charts, histograms, scatter plots, etc.). Throughout the discussion, many examples of real-life fraud applications will be included in, for example, insurance fraud, tax evasion fraud, and credit card fraud. The authors will also integrate both their research and consulting experience throughout the various chapters. The book is aimed at (senior) data analysts, (aspiring) data scientists, consultants, analytics practitioners, and researchers (e.g., PhD candidates) starting to explore the field.

Chapter 1 sets the stage on fraud detection, prevention, and analytics. It starts by defining fraud and then zooms into fraud detection and prevention. The impact of big data for fraud detection and the fraud analytics process model are reviewed next. The chapter concludes by summarizing the key skills of a fraud data scientist.

Chapter 2 provides extensive discussion on the basic ingredient of any fraud analytical model: data! It introduces various types of data sources and discusses how to merge and sample them. The next sections discuss the different types of data elements, visual exploration, Benford's law, and descriptive statistics. These are all essential tools to start understanding the characteristics and limitations of the data available. Data preprocessing activities are also extensively covered: handling missing values, detecting and treating outliers, defining red flags, standardizing data, categorizing variables, weights of evidence coding, and variable selection. Principal component analysis is outlined as a technique to reduce the dimensionality of the input data. This is then further illustrated with RIDIT and PRIDIT analysis. The chapter ends by reviewing segmentation and the risks thereof.

Chapter 3 continues by exploring the use of descriptive analytics for fraud detection. The idea here is to look for unusual patterns or outliers in a fraud data set. Both graphical and statistical outlier detection procedures are reviewed first. This is followed by an overview of break-point analysis, peer group analysis, association rules, clustering, and one-class SVMs.

Chapter 4 zooms into predictive analytics for fraud detection. We start from a labeled data set of transactions whereby each transaction has a target of interest that can either be binary (e.g., fraudulent or not) or continuous (e.g., amount of fraud). We then discuss various analytical techniques to build predictive models: linear regression, logistic regression, decision trees, neural networks, support vector machines, ensemble methods, and multiclass classification techniques. A next section reviews how to measure the performance of a predictive analytical model by first deciding on the data set split-up and then on the performance metric. The class imbalance problem is also extensively elaborated. The chapter concludes by giving some performance benchmarks.

Chapter 5 introduces the reader to social network analysis and its use for fraud detection. Stating that the propensity to fraud is often influenced by the social neighborhood, we describe the main components of a network and illustrate how transactional data sources can be transformed in networks. In the next section, we elaborate on featurization, the process on how to extract a set of meaningful features from the network. We distinguish between three main types of features: neighborhood metrics, centrality metrics, and collective inference algorithms. We then zoom into community mining, where we aim at finding groups of fraudsters closely connected in the network. By introducing multipartite graphs, we address the fact that fraud often depends on a multitude of different factors and that the inclusion of all these factors in a network representation contribute to a better understanding and analysis of the detection problem at hand. The chapter is concluded with a real-life example of social security fraud.

Chapter 6 deals with the postprocessing of fraud analytical models. It starts by giving an overview of the analytical fraud model lifecycle. It then discusses the traffic light indicator approach and decision tables as two popular model representations. This is followed by a set of guidelines to appropriately select the fraud sample to investigate. Fraud alert and case management are covered next. We also illustrate how visual analytics can contribute to the postprocessing activities. We describe how to backtest analytical fraud models by considering data stability, model stability, and model calibration. The chapter concludes by giving some guidelines about model design and documentation.

Chapter 7 provides a broader perspective on fraud analytics. We provide some guidelines for setting up and managing data quality programs. We zoom into privacy and discuss various ways to ensure appropriate access to both internal and external data. We discuss how analytical fraud estimates can be used to calculate both expected and unexpected losses, which can then help to determine provisioning and capital buffers. A discussion of total cost of ownership and return on investment provides an economic perspective on fraud analytics. This is followed by a discussion of in- versus outsourcing of analytical model development. We briefly zoom into some interesting modeling extensions, such as forecasting and text analytics. The potential and danger of the Internet of Things for fraud analytics is also covered. The chapter concludes by giving some recommendations for corporate fraud governance.

Acknowledgments

It is a great pleasure to acknowledge the contributions and assistance of various colleagues, friends, and fellow analytics lovers to the writing of this book. This book is the result of many years of research and teaching in analytics, risk management, and fraud. We first would like to thank our publisher, John Wiley & Sons, for accepting our book proposal less than one year ago.

We are grateful to the active and lively analytics and fraud detection community for providing various user fora, blogs, online lectures, and tutorials, which proved very helpful.

We would also like to acknowledge the direct and indirect contributions of the many colleagues, fellow professors, students, researchers, and friends with whom we collaborated during the past years.

Last but not least, we are grateful to our partners, parents, and families for their love, support, and encouragement.

We have tried to make this book as complete, accurate, and enjoyable as possible. Of course, what really matters is what you, the reader, think of it. Please let us know your views by getting in touch. The authors welcome all feedback and comments—so do not hesitate to let us know your thoughts!

Bart Baesens
Véronique Van Vlasselaer
Wouter Verbeke
August 2015