Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:
Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens
Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst
Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs
Business Analytics for Customer Intelligence by Gert Laursen
Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron
Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron
Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner
Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen
Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs
Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase
Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain by Robert A. Davis
Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments by Gene Pease, Barbara Beresford, and Lew Walker
The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and Mike Barlow
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard
Financial Institution Advantage and The Optimization of Information Processing by Sean C. Keenan
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications by Robert Rowan
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models by Keith Holdaway
Health Analytics: Gaining the Insights to Transform Health Care by Jason Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization's Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education by Jamie McQuiggan and Armistead Sapp
Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark Brown
Predictive Analytics for Human Resources by Jac Fitz-enz and John Mattox II
Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance by Lawrence Maisel and Gary Cokins
Retail Analytics: The Secret Weapon by Emmett Cox
Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro
Statistical Thinking: Improving Business Performance, second edition by Roger W. Hoerl and Ronald D. Snee
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics by Bill Franks
Too Big to Ignore: The Business Case for Big Data by Phil Simon
The Value of Business Analytics: Identifying the Path to Profitability by Evan Stubbs
The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions by Phil Simon
Understanding the Predictive Analytics Lifecycle by Al Cordoba
Unleashing Your Inner Leader: An Executive Coach Tells All by Vickie Bevenour
Using Big Data Analytics: Turning Big Data into Big Money by Jared Dean
Win with Advanced Business Analytics: Creating Business Value from Your Data by Jean Paul Isson and Jesse Harriott

For more information on any of the above titles, please visit www.wiley.com.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Baesens, Bart.

Fraud analytics using descriptive, predictive, and social network techniques : a guide to data science for fraud detection / Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke.

pages cm. — (Wiley & SAS business series)

Includes bibliographical references and index.

ISBN 978-1-119-13312-4 (cloth) — ISBN 978-1-119-14682-7 (epdf) — ISBN 978-1-119-14683-4 (epub)

1. Fraud— Statistical methods. 2. Fraud— Prevention. 3. Commercial crimes— Prevention. I. Title.

HV6691.B34 2015

364.16′3015195—dc23

2015017861

Cover Design: Wiley

Cover Image: ©iStock.com/aleksandarvelasevic

List of Figures

Figure 1.1 Fraud Triangle
Figure 1.2 Fire Incident Claim-Handling Process
Figure 1.3 The Fraud Cycle
Figure 1.4 Outlier Detection at the Data Item Level
Figure 1.5 Outlier Detection at the Data Set Level
Figure 1.6 The Fraud Analytics Process Model
Figure 1.7 Profile of a Fraud Data Scientist
Figure 1.8 Screenshot of Web of Science Statistics for Scientific Publications on Fraud between 1996 and 2014
Figure 2.1 Aggregating Normalized Data Tables into a Non-Normalized Data Table
Figure 2.2 Pie Charts for Exploratory Data Analysis
Figure 2.3 Benford's Law Describing the Frequency Distribution of the First Digit
Figure 2.4 Multivariate Outliers
Figure 2.5 Histogram for Outlier Detection
Figure 2.6 Box Plots for Outlier Detection
Figure 2.7 Using the z-Scores for Truncation
Figure 2.8 Default Risk Versus Age
Figure 2.9 Illustration of Principal Component Analysis in a Two-Dimensional Data Set
Figure 3.1 3D Scatter Plot for Detecting Outliers
Figure 3.2 OLAP Cube for Fraud Detection
Figure 3.3 Example Pivot Table for Credit Card Fraud Detection
Figure 3.4 Break-Point Analysis
Figure 3.5 Peer-Group Analysis
Figure 3.6 Cluster Analysis for Fraud Detection
Figure 3.7 Hierarchical Versus Nonhierarchical Clustering Techniques
Figure 3.8 Euclidean Versus Manhattan Distance
Figure 3.9 Divisive Versus Agglomerative Hierarchical Clustering
Figure 3.10 Calculating Distances between Clusters
Figure 3.11 Example for Clustering Birds. The Numbers Indicate the Clustering Steps
Figure 3.12 Dendrogram for Birds Example. The Thick Black Line Indicates the Optimal Clustering
Figure 3.13 Screen Plot for Clustering
Figure 3.14 Scatter Plot of Hierarchical Clustering Data
Figure 3.15 Output of Hierarchical Clustering Procedures
Figure 3.16 k-Means Clustering: Start from Original Data
Figure 3.17 k-Means Clustering Iteration 1: Randomly Select Initial Cluster Centroids
Figure 3.18 k-Means Clustering Iteration 1: Assign Remaining Observations
Figure 3.19 k-Means Iteration Step 2: Recalculate Cluster Centroids
Figure 3.20 k-Means Clustering Iteration 2: Reassign Observations
Figure 3.21 k-Means Clustering Iteration 3: Recalculate Cluster Centroids
Figure 3.22 k-Means Clustering Iteration 3: Reassign Observations
Figure 3.23 Rectangular Versus Hexagonal SOM Grid
Figure 3.24 Clustering Countries Using SOMs
Figure 3.25 Component Plane for Literacy
Figure 3.26 Component Plane for Political Rights
Figure 3.27 Must-Link and Cannot-Link Constraints in Semi-Supervised Clustering
Figure 3.28 δ-Constraints in Semi-Supervised Clustering
Figure 3.29 ε-Constraints in Semi-Supervised Clustering
Figure 3.30 Cluster Profiling Using Histograms
Figure 3.31 Using Decision Trees for Clustering Interpretation
Figure 3.32 One-Class Support Vector Machines
Figure 4.1 A Spider Construction in Tax Evasion Fraud
Figure 4.2 Regular Versus Fraudulent Bankruptcy
Figure 4.3 OLS Regression
Figure 4.4 Bounding Function for Logistic Regression
Figure 4.5 Linear Decision Boundary of Logistic Regression
Figure 4.6 Other Transformations
Figure 4.7 Fraud Detection Scorecard
Figure 4.8 Calculating the p-Value with a Student's t-Distribution
Figure 4.9 Variable Subsets for Four Variables V₁, V₂, V₃, and V₄
Figure 4.10 Example Decision Tree
Figure 4.11 Example Data Sets for Calculating Impurity
Figure 4.12 Entropy Versus Gini
Figure 4.13 Calculating the Entropy for Age Split
Figure 4.14 Using a Validation Set to Stop Growing a Decision Tree
Figure 4.15 Decision Boundary of a Decision Tree
Figure 4.16 Example Regression Tree for Predicting the Fraud Percentage
Figure 4.17 Neural Network Representation of Logistic Regression
Figure 4.18 A Multilayer Perceptron (MLP) Neural Network
Figure 4.19 Local Versus Global Minima
Figure 4.20 Using a Validation Set for Stopping Neural Network Training
Figure 4.21 Example Hinton Diagram
Figure 4.22 Backward Variable Selection
Figure 4.23 Decompositional Approach for Neural Network Rule Extraction
Figure 4.24 Pedagogical Approach for Rule Extraction
Figure 4.25 Two-Stage Models
Figure 4.26 Multiple Separating Hyperplanes
Figure 4.27 SVM Classifier for the Perfectly Linearly Separable Case
Figure 4.28 SVM Classifier in Case of Overlapping Distributions
Figure 4.29 The Feature Space Mapping
Figure 4.30 SVMs for Regression
Figure 4.31 Representing an SVM Classifier as a Neural Network
Figure 4.32 One-Versus-One Coding for Multiclass Problems
Figure 4.33 One-Versus-All Coding for Multiclass Problems
Figure 4.34 Training Versus Test Sample Set Up for Performance Estimation
Figure 4.35 Cross-Validation for Performance Measurement
Figure 4.36 Bootstrapping
Figure 4.37 Calculating Predictions Using a Cut-Off
Figure 4.38 The Receiver Operating Characteristic Curve
Figure 4.39 Lift Curve
Figure 4.40 Cumulative Accuracy Profile
Figure 4.41 Calculating the Accuracy Ratio
Figure 4.42 The Kolmogorov-Smirnov Statistic
Figure 4.43 A Cumulative Notch Difference Graph
Figure 4.44 Scatter Plot: Predicted Fraud Versus Actual Fraud
Figure 4.45 CAP Curve for Continuous Targets
Figure 4.46 Regression Error Characteristic (REC) Curve
Figure 4.47 Varying the Time Window to Deal with Skewed Data Sets
Figure 4.48 Oversampling the Fraudsters
Figure 4.49 Undersampling the Nonfraudsters
Figure 4.50 Synthetic Minority Oversampling Technique (SMOTE)
Figure 5.1a Köningsberg Bridges
Figure 5.1b Schematic Representation of the Köningsberg Bridges
Figure 5.2 Identity Theft. The Frequent Contact List of a Person is Suddenly Extended with Other Contacts (Light Gray Nodes). This Might Indicate that a Fraudster (Dark Gray Node) Took Over that Customer's Account and “shares” his/her Contacts
Figure 5.3 Network Representation
Figure 5.4 Example of a (Un)Directed Graph
Figure 5.5 Follower–Followee Relationships in a Twitter Network
Figure 5.6 Edge Representation
Figure 5.7 Example of a Fraudulent Network
Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters, of Whom Two are Legitimate (White Nodes) and Four are Fraudulent (Gray Nodes)
Figure 5.9 Toy Example of Credit Card Fraud
Figure 5.10 Mathematical Representation of (a) a Sample Network: (b) the Adjacency or Connectivity Matrix; (c) the Weight Matrix; (d) the Adjacency List; and (e) the Weight List
Figure 5.11 A Real-Life Example of a Homophilic Network
Figure 5.12 A Homophilic Network
Figure 5.13 Sample Network
Figure 5.14a Degree Distribution
Figure 5.14b Illustration of the Degree Distribution for a Real-Life Network of Social Security Fraud. The Degree Distribution Follows a Power Law (log-log axes)
Figure 5.15 A 4-regular Graph
Figure 5.16 Example Social Network for a Relational Neighbor Classifier
Figure 5.17 Example Social Network for a Probabilistic Relational Neighbor Classifier
Figure 5.18 Example of Social Network Features for a Relational Logistic Regression Classifier
Figure 5.19 Example of Featurization with Features Describing Intrinsic Behavior and Behavior of the Neighborhood
Figure 5.20 Illustration of Dijkstra's Algorithm
Figure 5.21 Illustration of the Number of Connecting Paths Between Two Nodes
Figure 5.22 Illustration of Betweenness Between Communities of Nodes
Figure 5.23 Pagerank Algorithm
Figure 5.24 Illustration of Iterative Process of the PageRank Algorithm
Figure 5.25 Sample Network
Figure 5.26 Community Detection for Credit Card Fraud
Figure 5.27 Iterative Bisection
Figure 5.28 Dendrogram of the Clustering of Figure 5.27 by the Girvan-Newman Algorithm. The Modularity Q is Maximized When Splitting the Network into Two Communities ABC –DEFG
Figure 5.29 Complete (a) and Partial (b) Communities
Figure 5.30 Overlapping Communities
Figure 5.31 Unipartite Graph
Figure 5.32 Bipartite Graph
Figure 5.33 Connectivity Matrix of a Bipartite Graph
Figure 5.34 A Multipartite Graph
Figure 5.35 Sample Network of Gotcha!
Figure 5.36 Exposure Score of the Resources Derived by a Propagation Algorithm. The Results are Based on a Real-life Data Set in Social Security Fraud
Figure 5.37 Egonet in Social Security Fraud. A Company Is Associated with its Resources
Figure 5.38 ROC Curve of the Gotcha! Model, which Combines both Intrinsic and Relational Features
Figure 6.1 The Analytical Model Life Cycle
Figure 6.2 Traffic Light Indicator Approach
Figure 6.3 SAS Social Network Analysis Dashboard
Figure 6.4 SAS Social Network Analysis Claim Detail Investigation
Figure 6.5 SAS Social Network Analysis Link Detection
Figure 6.6 Distribution of Claim Amounts and Average Claim Value
Figure 6.7 Geographical Distribution of Claims
Figure 6.8 Zooming into the Geographical Distribution of Claims
Figure 6.9 Measuring the Efficiency of the Fraud-Detection Process
Figure 6.10 Evaluating the Efficiency of Fraud Investigators
Figure 7.1 RACI Matrix
Figure 7.2 Anonymizing a Database
Figure 7.3 Different SQL Views Defined for a Database
Figure 7.4 Aggregate Loss Distribution with Indication of Expected Loss, Value at Risk (VaR) at 99.9 Percent Confidence Level and Unexpected Loss
Figure 7.5 Snapshot of a Credit Card Fraud Time Series Data Set and Associated Histogram of the Fraud Amounts
Figure 7.6 Aggregate Loss Distribution Resulting from a Monte Carlo Simulation with Poisson Distributed Monthly Fraud Frequency and Associated Pareto Distributed Fraud Loss

Preface

It is estimated that a typical organization loses about 5 percent of its revenues due to fraud each year. In this book, we will discuss how state-of-the-art descriptive, predictive and social network analytics can be used to fight fraud by learning fraud patterns from historical data.

The focus of this book is not on the mathematics or theory, but on the practical applications. Formulas and equations will only be included when absolutely needed from a practitioner's perspective. It is also not our aim to provide exhaustive coverage of all analytical techniques previously developed but, rather, give coverage of the ones that really provide added value in a practical fraud detection setting.

Being targeted at the business professional in the first place, the book is written in a condensed, focused way. Prerequisite knowledge consists of some basic exposure to descriptive statistics (e.g., mean, standard deviation, correlation, confidence intervals, hypothesis testing), data handling (using for example, Microsoft Excel, SQL, etc.), and data visualization (e.g., bar plots, pie charts, histograms, scatter plots, etc.). Throughout the discussion, many examples of real-life fraud applications will be included in, for example, insurance fraud, tax evasion fraud, and credit card fraud. The authors will also integrate both their research and consulting experience throughout the various chapters. The book is aimed at (senior) data analysts, (aspiring) data scientists, consultants, analytics practitioners, and researchers (e.g., PhD candidates) starting to explore the field.

Chapter 1 sets the stage on fraud detection, prevention, and analytics. It starts by defining fraud and then zooms into fraud detection and prevention. The impact of big data for fraud detection and the fraud analytics process model are reviewed next. The chapter concludes by summarizing the key skills of a fraud data scientist.

Chapter 2 provides extensive discussion on the basic ingredient of any fraud analytical model: data! It introduces various types of data sources and discusses how to merge and sample them. The next sections discuss the different types of data elements, visual exploration, Benford's law, and descriptive statistics. These are all essential tools to start understanding the characteristics and limitations of the data available. Data preprocessing activities are also extensively covered: handling missing values, detecting and treating outliers, defining red flags, standardizing data, categorizing variables, weights of evidence coding, and variable selection. Principal component analysis is outlined as a technique to reduce the dimensionality of the input data. This is then further illustrated with RIDIT and PRIDIT analysis. The chapter ends by reviewing segmentation and the risks thereof.

Chapter 3 continues by exploring the use of descriptive analytics for fraud detection. The idea here is to look for unusual patterns or outliers in a fraud data set. Both graphical and statistical outlier detection procedures are reviewed first. This is followed by an overview of break-point analysis, peer group analysis, association rules, clustering, and one-class SVMs.

Chapter 4 zooms into predictive analytics for fraud detection. We start from a labeled data set of transactions whereby each transaction has a target of interest that can either be binary (e.g., fraudulent or not) or continuous (e.g., amount of fraud). We then discuss various analytical techniques to build predictive models: linear regression, logistic regression, decision trees, neural networks, support vector machines, ensemble methods, and multiclass classification techniques. A next section reviews how to measure the performance of a predictive analytical model by first deciding on the data set split-up and then on the performance metric. The class imbalance problem is also extensively elaborated. The chapter concludes by giving some performance benchmarks.

Chapter 5 introduces the reader to social network analysis and its use for fraud detection. Stating that the propensity to fraud is often influenced by the social neighborhood, we describe the main components of a network and illustrate how transactional data sources can be transformed in networks. In the next section, we elaborate on featurization, the process on how to extract a set of meaningful features from the network. We distinguish between three main types of features: neighborhood metrics, centrality metrics, and collective inference algorithms. We then zoom into community mining, where we aim at finding groups of fraudsters closely connected in the network. By introducing multipartite graphs, we address the fact that fraud often depends on a multitude of different factors and that the inclusion of all these factors in a network representation contribute to a better understanding and analysis of the detection problem at hand. The chapter is concluded with a real-life example of social security fraud.

Chapter 6 deals with the postprocessing of fraud analytical models. It starts by giving an overview of the analytical fraud model lifecycle. It then discusses the traffic light indicator approach and decision tables as two popular model representations. This is followed by a set of guidelines to appropriately select the fraud sample to investigate. Fraud alert and case management are covered next. We also illustrate how visual analytics can contribute to the postprocessing activities. We describe how to backtest analytical fraud models by considering data stability, model stability, and model calibration. The chapter concludes by giving some guidelines about model design and documentation.

Chapter 7 provides a broader perspective on fraud analytics. We provide some guidelines for setting up and managing data quality programs. We zoom into privacy and discuss various ways to ensure appropriate access to both internal and external data. We discuss how analytical fraud estimates can be used to calculate both expected and unexpected losses, which can then help to determine provisioning and capital buffers. A discussion of total cost of ownership and return on investment provides an economic perspective on fraud analytics. This is followed by a discussion of in- versus outsourcing of analytical model development. We briefly zoom into some interesting modeling extensions, such as forecasting and text analytics. The potential and danger of the Internet of Things for fraud analytics is also covered. The chapter concludes by giving some recommendations for corporate fraud governance.

Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques

A Guide to Data Science for Fraud Detection

Wiley & SAS Business Series

List of Figures

Foreword

Preface

Acknowledgments