Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques
A Guide to Data Science for Fraud Detection
Bart Baesens
Véronique Van Vlasselaer
Wouter Verbeke
Wiley & SAS Business Series
The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.
Titles in the Wiley & SAS Business Series include:
Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens
Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian
Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst
Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs
Business Analytics for Customer Intelligence by Gert Laursen
Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron
Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron
Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid
Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner
Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen
Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs
Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase
Demand-Driven Inventory Optimization and Replenishment: Creating a More Efficient Supply Chain by Robert A. Davis
Developing Human Capital: Using Analytics to Plan and Optimize Your Learning and Development Investments by Gene Pease, Barbara Beresford, and Lew Walker
The Executive's Guide to Enterprise Social Media Strategy: How Social Networks Are Radically Transforming Your Business by David Thomas and Mike Barlow
Economic and Business Forecasting: Analyzing and Interpreting Econometric Results by John Silvia, Azhar Iqbal, Kaylyn Swankoski, Sarah Watt, and Sam Bullard
Financial Institution Advantage and The Optimization of Information Processing by Sean C. Keenan
Foreign Currency Financial Reporting from Euros to Yen to Yuan: A Guide to Fundamental Concepts and Practical Applications by Robert Rowan
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data Driven Models by Keith Holdaway
Health Analytics: Gaining the Insights to Transform Health Care by Jason Burke
Heuristics in Analytics: A Practical Perspective of What Influences Our Analytical World by Carlos Andre Reis Pinheiro and Fiona McNeill
Human Capital Analytics: How to Harness the Potential of Your Organization's Greatest Asset by Gene Pease, Boyce Byerly, and Jac Fitz-enz
Implement, Improve and Expand Your Statewide Longitudinal Data System: Creating a Culture of Data in Education by Jamie McQuiggan and Armistead Sapp
Killer Analytics: Top 20 Metrics Missing from Your Balance Sheet by Mark Brown
Predictive Analytics for Human Resources by Jac Fitz-enz and John Mattox II
Predictive Business Analytics: Forward-Looking Capabilities to Improve Business Performance by Lawrence Maisel and Gary Cokins
Retail Analytics: The Secret Weapon by Emmett Cox
Social Network Analysis in Telecommunications by Carlos Andre Reis Pinheiro
Statistical Thinking: Improving Business Performance, second edition by Roger W. Hoerl and Ronald D. Snee
Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics by Bill Franks
Too Big to Ignore: The Business Case for Big Data by Phil Simon
The Value of Business Analytics: Identifying the Path to Profitability by Evan Stubbs
The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions by Phil Simon
Understanding the Predictive Analytics Lifecycle by Al Cordoba
Unleashing Your Inner Leader: An Executive Coach Tells All by Vickie Bevenour
Using Big Data Analytics: Turning Big Data into Big Money by Jared Dean
Win with Advanced Business Analytics: Creating Business Value from Your Data by Jean Paul Isson and Jesse Harriott
For more information on any of the above titles, please visit www.wiley.com.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the Web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Baesens, Bart.
Fraud analytics using descriptive, predictive, and social network techniques : a guide to data science for fraud detection / Bart Baesens, Veronique Van Vlasselaer, Wouter Verbeke.
pages cm. — (Wiley & SAS business series)
Includes bibliographical references and index.
ISBN 978-1-119-13312-4 (cloth) — ISBN 978-1-119-14682-7 (epdf) — ISBN 978-1-119-14683-4 (epub)
Figure 5.1b Schematic Representation of the Köningsberg Bridges
Figure 5.2 Identity Theft. The Frequent Contact List of a Person is Suddenly Extended with Other Contacts (Light Gray Nodes). This Might Indicate that a Fraudster (Dark Gray Node) Took Over that Customer's Account and “shares” his/her Contacts
Figure 5.3 Network Representation
Figure 5.4 Example of a (Un)Directed Graph
Figure 5.5 Follower–Followee Relationships in a Twitter Network
Figure 5.6 Edge Representation
Figure 5.7 Example of a Fraudulent Network
Figure 5.8 An Egonet. The Ego is Surrounded by Six Alters, of Whom Two are Legitimate (White Nodes) and Four are Fraudulent (Gray Nodes)
Figure 5.9 Toy Example of Credit Card Fraud
Figure 5.10 Mathematical Representation of (a) a Sample Network: (b) the Adjacency or Connectivity Matrix; (c) the Weight Matrix; (d) the Adjacency List; and (e) the Weight List
Figure 5.11 A Real-Life Example of a Homophilic Network
Figure 5.12 A Homophilic Network
Figure 5.13 Sample Network
Figure 5.14a Degree Distribution
Figure 5.14b Illustration of the Degree Distribution for a Real-Life Network of Social Security Fraud. The Degree Distribution Follows a Power Law (log-log axes)
Figure 5.15 A 4-regular Graph
Figure 5.16 Example Social Network for a Relational Neighbor Classifier
Figure 5.17 Example Social Network for a Probabilistic Relational Neighbor Classifier
Figure 5.18 Example of Social Network Features for a Relational Logistic Regression Classifier
Figure 5.19 Example of Featurization with Features Describing Intrinsic Behavior and Behavior of the Neighborhood
Figure 5.20 Illustration of Dijkstra's Algorithm
Figure 5.21 Illustration of the Number of Connecting Paths Between Two Nodes
Figure 5.22 Illustration of Betweenness Between Communities of Nodes
Figure 5.23 Pagerank Algorithm
Figure 5.24 Illustration of Iterative Process of the PageRank Algorithm
Figure 5.25 Sample Network
Figure 5.26 Community Detection for Credit Card Fraud
Figure 5.27 Iterative Bisection
Figure 5.28 Dendrogram of the Clustering of Figure 5.27 by the Girvan-Newman Algorithm. The Modularity Q is Maximized When Splitting the Network into Two Communities ABC –DEFG
Figure 5.29 Complete (a) and Partial (b) Communities
Figure 5.30 Overlapping Communities
Figure 5.31 Unipartite Graph
Figure 5.32 Bipartite Graph
Figure 5.33 Connectivity Matrix of a Bipartite Graph
Figure 5.34 A Multipartite Graph
Figure 5.35 Sample Network of Gotcha!
Figure 5.36 Exposure Score of the Resources Derived by a Propagation Algorithm. The Results are Based on a Real-life Data Set in Social Security Fraud
Figure 5.37 Egonet in Social Security Fraud. A Company Is Associated with its Resources
Figure 5.38 ROC Curve of the Gotcha! Model, which Combines both Intrinsic and Relational Features
Figure 6.1 The Analytical Model Life Cycle
Figure 6.2 Traffic Light Indicator Approach
Figure 6.3 SAS Social Network Analysis Dashboard
Figure 6.4 SAS Social Network Analysis Claim Detail Investigation
Figure 6.5 SAS Social Network Analysis Link Detection
Figure 6.6 Distribution of Claim Amounts and Average Claim Value
Figure 6.7 Geographical Distribution of Claims
Figure 6.8 Zooming into the Geographical Distribution of Claims
Figure 6.9 Measuring the Efficiency of the Fraud-Detection Process
Figure 6.10 Evaluating the Efficiency of Fraud Investigators
Figure 7.1 RACI Matrix
Figure 7.2 Anonymizing a Database
Figure 7.3 Different SQL Views Defined for a Database
Figure 7.4 Aggregate Loss Distribution with Indication of Expected Loss, Value at Risk (VaR) at 99.9 Percent Confidence Level and Unexpected Loss
Figure 7.5 Snapshot of a Credit Card Fraud Time Series Data Set and Associated Histogram of the Fraud Amounts
Figure 7.6 Aggregate Loss Distribution Resulting from a Monte Carlo Simulation with Poisson Distributed Monthly Fraud Frequency and Associated Pareto Distributed Fraud Loss
Foreword
Fraud will always be with us. It is linked both to organized crime and to terrorism, and it inflicts substantial economic damage. The perpetrators of fraud play a dynamic cat and mouse game with those trying to stop them. Preventing a particular kind of fraud does not mean the fraudsters give up, but merely that they change their tactics: they are constantly on the lookout for new avenues for fraud, for new weaknesses in the system. And given that our social and financial systems are forever developing, there are always new opportunities to be exploited.
This book is a clear and comprehensive outline of the current state-of-the-art in fraud-detection and prevention methodology. It describes the data necessary to detect fraud, and then takes the reader from the basics of fraud-detection data analytics, through advanced pattern recognition methodology, to cutting-edge social network analysis and fraud ring detection.
If we cannot stop fraud altogether, an awareness of the contents of this book will at least enable readers to reduce the extent of fraud, and make it harder for criminals to take advantage of the honest. The readers' organizations, be they public or private, will be better protected if they implement the strategies described in this book. In short, this book is a valuable contribution to the well-being of society and of the people within it.
Professor David J. Hand Imperial College, London
Preface
It is estimated that a typical organization loses about 5 percent of its revenues due to fraud each year. In this book, we will discuss how state-of-the-art descriptive, predictive and social network analytics can be used to fight fraud by learning fraud patterns from historical data.
The focus of this book is not on the mathematics or theory, but on the practical applications. Formulas and equations will only be included when absolutely needed from a practitioner's perspective. It is also not our aim to provide exhaustive coverage of all analytical techniques previously developed but, rather, give coverage of the ones that really provide added value in a practical fraud detection setting.
Being targeted at the business professional in the first place, the book is written in a condensed, focused way. Prerequisite knowledge consists of some basic exposure to descriptive statistics (e.g., mean, standard deviation, correlation, confidence intervals, hypothesis testing), data handling (using for example, Microsoft Excel, SQL, etc.), and data visualization (e.g., bar plots, pie charts, histograms, scatter plots, etc.). Throughout the discussion, many examples of real-life fraud applications will be included in, for example, insurance fraud, tax evasion fraud, and credit card fraud. The authors will also integrate both their research and consulting experience throughout the various chapters. The book is aimed at (senior) data analysts, (aspiring) data scientists, consultants, analytics practitioners, and researchers (e.g., PhD candidates) starting to explore the field.
Chapter 1 sets the stage on fraud detection, prevention, and analytics. It starts by defining fraud and then zooms into fraud detection and prevention. The impact of big data for fraud detection and the fraud analytics process model are reviewed next. The chapter concludes by summarizing the key skills of a fraud data scientist.
Chapter 2 provides extensive discussion on the basic ingredient of any fraud analytical model: data! It introduces various types of data sources and discusses how to merge and sample them. The next sections discuss the different types of data elements, visual exploration, Benford's law, and descriptive statistics. These are all essential tools to start understanding the characteristics and limitations of the data available. Data preprocessing activities are also extensively covered: handling missing values, detecting and treating outliers, defining red flags, standardizing data, categorizing variables, weights of evidence coding, and variable selection. Principal component analysis is outlined as a technique to reduce the dimensionality of the input data. This is then further illustrated with RIDIT and PRIDIT analysis. The chapter ends by reviewing segmentation and the risks thereof.
Chapter 3 continues by exploring the use of descriptive analytics for fraud detection. The idea here is to look for unusual patterns or outliers in a fraud data set. Both graphical and statistical outlier detection procedures are reviewed first. This is followed by an overview of break-point analysis, peer group analysis, association rules, clustering, and one-class SVMs.
Chapter 4 zooms into predictive analytics for fraud detection. We start from a labeled data set of transactions whereby each transaction has a target of interest that can either be binary (e.g., fraudulent or not) or continuous (e.g., amount of fraud). We then discuss various analytical techniques to build predictive models: linear regression, logistic regression, decision trees, neural networks, support vector machines, ensemble methods, and multiclass classification techniques. A next section reviews how to measure the performance of a predictive analytical model by first deciding on the data set split-up and then on the performance metric. The class imbalance problem is also extensively elaborated. The chapter concludes by giving some performance benchmarks.
Chapter 5 introduces the reader to social network analysis and its use for fraud detection. Stating that the propensity to fraud is often influenced by the social neighborhood, we describe the main components of a network and illustrate how transactional data sources can be transformed in networks. In the next section, we elaborate on featurization, the process on how to extract a set of meaningful features from the network. We distinguish between three main types of features: neighborhood metrics, centrality metrics, and collective inference algorithms. We then zoom into community mining, where we aim at finding groups of fraudsters closely connected in the network. By introducing multipartite graphs, we address the fact that fraud often depends on a multitude of different factors and that the inclusion of all these factors in a network representation contribute to a better understanding and analysis of the detection problem at hand. The chapter is concluded with a real-life example of social security fraud.
Chapter 6 deals with the postprocessing of fraud analytical models. It starts by giving an overview of the analytical fraud model lifecycle. It then discusses the traffic light indicator approach and decision tables as two popular model representations. This is followed by a set of guidelines to appropriately select the fraud sample to investigate. Fraud alert and case management are covered next. We also illustrate how visual analytics can contribute to the postprocessing activities. We describe how to backtest analytical fraud models by considering data stability, model stability, and model calibration. The chapter concludes by giving some guidelines about model design and documentation.
Chapter 7 provides a broader perspective on fraud analytics. We provide some guidelines for setting up and managing data quality programs. We zoom into privacy and discuss various ways to ensure appropriate access to both internal and external data. We discuss how analytical fraud estimates can be used to calculate both expected and unexpected losses, which can then help to determine provisioning and capital buffers. A discussion of total cost of ownership and return on investment provides an economic perspective on fraud analytics. This is followed by a discussion of in- versus outsourcing of analytical model development. We briefly zoom into some interesting modeling extensions, such as forecasting and text analytics. The potential and danger of the Internet of Things for fraud analytics is also covered. The chapter concludes by giving some recommendations for corporate fraud governance.
Acknowledgments
It is a great pleasure to acknowledge the contributions and assistance of various colleagues, friends, and fellow analytics lovers to the writing of this book. This book is the result of many years of research and teaching in analytics, risk management, and fraud. We first would like to thank our publisher, John Wiley & Sons, for accepting our book proposal less than one year ago.
We are grateful to the active and lively analytics and fraud detection community for providing various user fora, blogs, online lectures, and tutorials, which proved very helpful.
We would also like to acknowledge the direct and indirect contributions of the many colleagues, fellow professors, students, researchers, and friends with whom we collaborated during the past years.
Last but not least, we are grateful to our partners, parents, and families for their love, support, and encouragement.
We have tried to make this book as complete, accurate, and enjoyable as possible. Of course, what really matters is what you, the reader, think of it. Please let us know your views by getting in touch. The authors welcome all feedback and comments—so do not hesitate to let us know your thoughts!
Bart Baesens Véronique Van Vlasselaer Wouter Verbeke August 2015