Data Driven Transaction Fraud Detection

jean-baptiste charraud
21 min readMay 3, 2021

Combine Gaussian Mixture Model and Gradient Boosting to evaluate transaction fraud risks

I. Challenge

Credit cards are nowadays essential tools to realize transactions and their use doesn’t stop growing. In 2019, they represent in volume 48 percent of the non-cash transactions in Europe (see [1]). These later has been strongly accelerated due to COVID-19 pandemic an the necessity to increase the use of internet purchases as well as contactless payments. Consequently, credit cards are becoming the principal payment solutions for daily purchases. In this context, controlling the risk of credit card frauds becomes a real concern. In this work, we propose a model evaluating the transaction fraud risk. It combines a Gaussian Mixture Model (GMM) algorithm with a Gradient Boosting Tree Classifier (GBTC) one to help confirming “unsure” predictions. The following approach is thus followed:

  1. Build training and testing databases having balanced label distributions.
  2. Build each model independently and select the most meaningful features.
  3. Combine the GMM and GBTC algorithms to obtain a better model.

II. Data Preprocessing

In this work, we analyze a database collecting 284 807 transactions across Europe realized in September 2013 [2]. This database is related to the works cited in [3] to [10]. Due to confidentiality purpose, the original features associated to each of them are not directly given. Indeed, the features are the result of a Principal Component Analysis (PCA). A given transaction is thus characterized by 28 PCA components in addition to the amount involved in the transaction and the time. A challenging question related to this database is the unbalanced nature of the data. Indeed, among the 284 807 data, just 492 correspond to frauds.

In this paragraph, we first propose to remind what is exactly PCA. The way we deal with an unbalanced database to build machine learning models is in a second time presented.

I.1 Principal Component Analysis

Principal Component Analysis (PCA) aims at identifying directions which explain the best the the data variability (ie directions interpreting the best the data variance). The obtained directions, called the “components”, correspond to combinations of the initial features. When they are identified, PCA can be employed to reduce the dimensionality of a problem. Indeed, the data can be projected to the subspace formed by the top most meaningful directions. In FIG 1 for example is represented an illustration in 2D. The first PCA component of the data is shown in red. It can be verified it is the direction where the points’ distribution extends the most. The dimensionality reduction corresponds here to the projection to the red line.

FIG 1: Example of a PCA in a 2D space.

More precisely, if X is the matrix gathering all the data (It is called the feature matrix where each column represents a feature. Eventually it can be made of the normalized data by subtracting features’ mean values and dividing by standard deviations). PCA consists in diagonalizing :

The eigenvectors are ranged in descending order following their associated eigenvalues. The first vectors correspond to the directions explaining the best the variance and thus determine the best subspace to project the data and conserving the most information about their variability. The projected data P are obtained by the operation:

Where the T matrix is the one gathering in each column all the selected eigenvectors (ie the selected PCA directions for data projection).

This is these operations which have been applied to the transactions’ initial features to get the database considered in this work.

I.2 Build Balanced Datasets

A first glimpse to the data reveals the deep unbalanced nature of them when considering the labels. The normal transactions correspond to label zero and the frauds to label one. In FIG 2, it can be seen just a few hundreds of the transactions are associated to frauds.

import numpy as np
import pandas as pd
import seaborn as sn
import pylab as plt
import matplotlib as matplotlib
from matplotlib import cm
from scipy.cluster import hierarchy
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix,plot_confusion_matrix
from sklearn.mixture import GaussianMixture
from sklearn.metrics import accuracy_score

# Download and shuffle the data
data_df=pd.read_csv(“creditcard.csv”).sample(frac=1)
# Plot labels’ distribution
label_fig,ax=plt.subplots(1,figsize=(10,8),)
seafig=sn.histplot(data_df,x=”Class”)
ax.set_xlabel(“Class”,fontsize=15)
ax.set_ylabel(“Count”,fontsize=15)
label_fig.savefig(“label.png”)
FIG 2: Labels’ distribution of the initial dataset.

If we don’t build more balanced datasets, we risk to develop machine learning models which will learn essentially data corresponding to normal transactions and thus unable to recognize a fraud. In other words the model might give a lot of false negative predictions. We thus decide to construct more balanced subdatabases for both training and testing sets. To realize it the following process in enforced:

  1. Select the data corresponding to frauds.
  2. Eventually duplicate these data.
  3. Select randomly data corresponding to normal transactions, the number of selected data is equal to that of the frauds times a factor above one.
  4. Split the fraud and non fraud data between training and testing sets.
  5. Merge the training (resp. testing) fraud and training (resp. testing) non fraud data to obtained the final training and testing sets.

The following function makes this balanced database construction possible.

def buildeqdtb(data_df,nrepeat=1,trainrate1=0.8,\
rate0labeltrain=1.2,rate0labeltest=2):
"""
Build balanced training and testing data sets
Input:
data_df [DataFrame]: All the data
nrepeat [Float]: Number of duplication of the one\
labeled data in the training set
trainrate1 [Float]: Rate between the one labeled data\
of the training set over all the one
labeled data
rate0labeltrain [Float]: Rate between the number of zero \
labeled data over the one labeled\
ones in the training set
rate0labeltest [Float]: Rate between the number of \
zero labeled data over the one\
labeled ones in the testing set
Output:
data_df_train [DataFrame]: Training data set
data_df_test [DataFrame]: Testing data set

"""
# Split between one and zero labels
data0_df=data_df[data_df[“Class”]==0]
data1_df=data_df[data_df[“Class”]==1]
# Determine dataset sizes
trainsize1=int(trainrate1*len(data1_df.index))
testsize1=int((1-trainrate1)*len(data1_df.index))
trainsize2=int(rate0labeltrain*trainsize1)
testsize2=int(rate0labeltest*testsize1)
# Select the data
data1_df_train=data1_df.iloc[:trainsize1]
# Possibly duplicate the one labeled data in the training set
data1_df_train=pd.concat([data1_df_train]*nrepeat,axis=0)
data0_df_train=data0_df.iloc[:trainsize2]
data1_df_test=data1_df.iloc[trainsize1:]
data0_df_test=data0_df.iloc[trainsize2:trainsize2+testsize2]
# Concatenate the zero and one labeled databases for training\
# and testing
data_df_train=pd.concat([data1_df_train,data0_df_train],\
axis=0).sample(frac=1)
data_df_test=pd.concat([data1_df_test,data0_df_test],axis=0)\
.sample(frac=1)
# Plot the labels’ distributions
label_train,ax=plt.subplots(1,figsize=(10,8),)
seafig=sn.histplot(data_df_train,x=”Class”)
ax.set_xlabel(“Class”,fontsize=15)
ax.set_ylabel(“Count”,fontsize=15)
label_train.savefig(“label_train.png”)
label_test,ax=plt.subplots(1,figsize=(10,8),)
seafig=sn.histplot(data_df_test,x=”Class”)
ax.set_xlabel(“Class”,fontsize=15)
ax.set_ylabel(“Count”,fontsize=15)
label_test.savefig(“label_test.png”)
return data_df_train,data_df_test

Using a rate between the zero labeled data and one labeled ones of 1.2 for training and 2.0 for testing, the FIG 3 and FIG 4 show the labels’ distributions for the built training and testing sets.

data_df_train,data_df_test=buildeqdtb(data_df,nrepeat=1,\
trainrate1=0.8,rate0labeltrain=1.2,rate0labeltest=2)
# Separate features and labels
label_df_train=data_df_train[“Class”]
feat_df_train=data_df_train.drop(“Class”,axis=1)
label_df_test=data_df_test[“Class”]
feat_df_test=data_df_test.drop(“Class”,axis=1)
FIG 3: Labels’ distribution of the training dataset.
FIG 4: Label’s distribution of the testing dataset.

These databases will thus be used to fit the different machine learning models.

III. A First Try with Gaussian Mixture Models

In this paragraph, we propose to realize a first try in data classification between frauds and normal transactions by using all the problem’s dimensions and a Gaussian Mixture Model (GMM). Firstly we sum up the theory behind GMM and present in a second time how it is implemented to classify the transaction data.

III.1 Gaussian Mixture Model, Theory

Gaussian Mixture Model (GMM) is an unsupervised machine learning algorithm. The term “unsupervised” means no label is used to fit the model which is asked to infer them by itself. The main objective of a GMM is to approximate the true data probability distribution thanks to the combination of gaussians. The number N of these gaussians is a model’s hyperperameter and needs to be fixed beforehand. The optimum number can be estimated by different quantities like the silhouette score analysis, the AIC-BIC criteria, the Davies-Bouldin score or the Calinsky-Harabasz score …

If we now consider a converged GMM model which gives an estimated probability distribution f(x). It is expressed as:

From this expression of f(x), it can be deduced each gaussian is associated to a label yi inferred by the model and thus identified as the probability to obtain the data x given the hypothesis it is associated to label yi. In other words:

Thanks to this identification, the weights alpha_i can thus be defined as the labels’ probabilities and as a consequence:

For a given data x, the gaussian having the highest contribution in f(x) will give the inferred label of x. In other words the label y of a fixed x data is obtained by:

In a geometrical point of view, a converged GMM model leads to different ellipses whose centers represent the gaussian distributions’ means and the axis the gaussian distributions’ covariance matrices. Each of these ellipses is a cluster of data. In FIG 5 is an example of a GMM clustering. In the initial data you can see visually four groups. The GMM model converged to gaussians centered at each of these groups. It leads to clusters where each color is associated to one gaussian. By this way the GMM model succeeds in recognizing the four distinct groups of data present initially.

FIG 5: An example of a 2D GMM clustering.

Training a GMM model consists of updating successively the labels’ probabilities as well as the parameters of each gaussian distribution so as to reach the “best” f(x) function. The term “best” here corresponds to the maximization of the log-likelihood defined as:

This optimization is performed thanks to the Expectation-Minimization algorithm. This algorithm is two step wise:

  1. Expectation : From the current probability distribution, determine for each data its label (ie the gaussian distribution influencing the most the global estimated probability f(x)). This step consists of applying for each data x the operation:

2. Minimization : From the new associated labels, update the different gaussian distributions’ parameters as well as the labels’ probabilities. More precisely, for each label, it consists of gathering all the data associated to it in step 1 (this number of data is noted by Nyi). From these data, a mean vector and covariance matrix are estimated by estimators like:

Finally, the label’s probability is estimated by:

To sum up, a GMM model enables two thinks:

  1. Deduce by itself clusters (ie labels) of close or similar data.
  2. Approximate the true data probability distribution. Such estimation allows to determine a probability of the GMM inferred labels. An other application is to generate from the built GMM probability distribution new data. Such data generation enables to use the inherent structure of the database learned by the model.

You may find more details about GMM in [11] or [12].

We propose here to apply this model using as a first try the all 30 dimensions of the problem.

III.2 GMM’s Fit and Predictions

The aim here is to use GMM to identify two clusters of data. A first one which will be associated to the normal transactions, the second one to the frauds. As we want only two clusters, the number of gaussians (n_components in the code) is fixed to two. As GMM model deduces its labels by itself, it might be possible the 0 GMM inferred label can be identified to that of the frauds which corresponds to the 1 label in the database. That’s why we verify in the code whether such permutation occurs by the boolean “permutelabel”. The following functions aim at training a GMM model and realizing labels’ predictions.

def trainGmm(feat_df_train,label_df_train,feat_list\
,normalize=False):
“””
Train a Gaussian Mixture Model (GMM)
Input:
feat_df_train [DataFrame]: All the training data features
label_df_train [DataFrame]: Training labels
feat_list [List]: List of the features’ names considered
normalize [Bool]: If the features are normalized before\
GMM’s training

Output:
Gmm [Sklearn Object]: The trained GMM model
permutelabel [Bool]: False if the labels’ indices deduced \
from GMM correspond to those of \
label_df_train
True if they are permuted
StdScale [Sklearn Object]: (if normalize is True)
The means and standard deviations \
associated to all the features

“””
permutelabel=False
Gmm=GaussianMixture(n_components=2)
# If data need to be normalize beforehand
if normalize:
StdScale=StandardScaler()
feat_df_trainSN=\
StdScale.fit_transform(feat_df_train[feat_list])
labelgmm_list=Gmm.fit_predict(feat_df_trainSN)
else:
labelgmm_list=Gmm.fit_predict(feat_df_train[feat_list])
# Compute accuracy
acc=accuracy_score(label_df_train.values,labelgmm_list)

# Compute the accuracy by permuting the GMM’s labels
labelgmmperm_list=[0 if label==1 else 1 for label in \
labelgmm_list]
accperm=accuracy_score(label_df_train.values,\
labelgmmperm_list)

# If the accucary with the permuted GMM’s labels is ten\
# times that of non permuted ones\
# it means the GMM’s labels and those of label_df_train\
# are inverted
if accperm>10*acc:
permutelabel=True
if normalize:
return Gmm, permutelabel, StdScale
else:
return Gmm, permutelabel
def predGmm(Gmm,feat_df,feat_list,permutelabel,\
distribproba=False,label_df=None,normalize=False,StdScale=False):
“””
Realize predictions with a trained GMM model
Input:
Gmm [Sklearn Object]: A trained GMM model
feat_df [DataFrame]: All the features of input data
feat_list: The features’ names list considered for predictions
permutelabel [Bool]: False if the labels’ indices deduced \
from GMM correspond to those of\
label_df_train
True if they are permuted
distribproba [Bool]: True if plot the probabilities’\
distribution associated to the predictions
( Need to know the true labels)
label_df [DataFrame]: The true labels
normalize [Bool]: If the features are normalized before\
GMM’s training
StdScale [Sklearn Object]: The means and standard deviations\
associated to all the features
Output:
labelpred_list [List]: List of the predictions
probapred_list [List]: List of the probabilities \
associated to each prediction
“””
# If normalization needs to be enforced beforehand
if normalize:
feat_dfN=StdScale.transform(feat_df[feat_list])
labelpred_list=Gmm.predict(feat_dfN)
probapred_list=Gmm.predict_proba(feat_dfN)
else:
labelpred_list=Gmm.predict(feat_df[feat_list])
probapred_list=Gmm.predict_proba(feat_df[feat_list])
# If the GMM’s labels have to be permuted
if permutelabel:
labelpred_list=[0 if label==1 else 1 for label in \
labelpred_list]
# In the case of known true labels, plot the probabilities’\
# distribution associated to the predictions \
# for both true and false predictions
if distribproba:
falseproba_list=probapred_list[np.where\
(labelpred_list!=label_df.values)]
correctproba_list=probapred_list[np.where(labelpred_list\
==label_df.values)]
falsepred_fig,ax=plt.subplots(1,figsize=(10,8))
plt.hist(falseproba_list[:,0],bins\
=np.arange(0,1,0.01),density=True,color=”red”,label=”False”)
plt.hist(correctproba_list[:,0],bins\
=np.arange(0,1,0.01),density=True,color=”navy”,label=”True”)
plt.yticks(np.arange(0,100,10),fontsize=15)
plt.xticks(np.arange(0,1.01,0.1),fontsize=15)
plt.xlabel(“ Predictions’ Probability (Label 0)”,fontsize=15)
plt.ylabel(“Number of Predictions”,fontsize=15)
plt.legend()
falsepred_fig.savefig(“falsepred.pdf”)
return labelpred_list,probapred_list

We now train and test a GMM model using all the 30 problem’s dimensions. After verifying whether GMM inferred labels are permuted or not, we evaluate the testing accuracy.

# Realize GMM clustering using all the features
feat_list=list(feat_df_train.columns)
# Train GMM model
GmmAll, permutelabel, StdScaleAll=\
trainGmm(feat_df_train,label_df_train,feat_list,normalize=True)
# Make predictions
labelpred_list,probapred_list=predGmm(GmmAll,\
feat_df_test,feat_list,permutelabel,distribproba=False,\
label_df=None,normalize=True,StdScale=StdScaleAll)
# Compute accuracy
accgmmall=accuracy_score(label_df_test,labelpred_list)
print("Testing GMM-30D Accuracy")
print(accgmmall)

Testing GMM-30D Accuracy
0.7791864406779661

In addition we compute the confusion matrix on the testing set. Despite the relatively good accuracy reached, it can be seen the model is bad at predicting frauds. Indeed, just 34 percent of the data corresponding to frauds are correctly classified (see FIG 6). It can be concluded despite the fact the training database was built to obtain a balanced proportion of fraud and non fraud data. The model has the tendency to make a lot of false negative results. This behavior is probably due to the important number of dimensions considered (30 dimensions), leading to a strong dispersion of the data associated to frauds. As a result the model is not able to determine properly a cluster associated to the fraud data.

def plotconfmat(truelabel_df,pred_list,figname=”confmat.png”):
“””
Plot confusion matrix
Input:
truelabel_df [DataFrame]: True labels
pred_list [List]: List of the predictions

“””
fig, ax = plt.subplots(1,figsize=(10,8),)
confmat=confusion_matrix(truelabel_df,pred_list\
,normalize=”true”)
im=ax.matshow(confmat)
plt.colorbar(im,label=”Accuracy”)
for (i, j), z in np.ndenumerate(confmat):
ax.text(j, i, ‘{:0.4f}’.format(z), ha=’center’,\
va=’center’,c=”red”,fontsize=20)
plt.xlabel(“Predicted Label”,fontsize=20)
plt.ylabel(“True Label”,fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
fig.savefig(figname)
plt.show()
plotconfmat(label_df_test,labelpred_list,\
figname=”gmm30dconfmat.png”)
FIG 6: Confusion matrix of the GMM model using the all 30 dimensions.

We propose in what follows to improve the model by reducing the problem’s dimensionality thanks to feature selection.

IV. Gradient Boosting Tree and Feature Selection

We propose here to identify the most influencing features by using feature importances’ evaluation deduced from a Gradient Boosting Tree Classifier (GBTC) and Hierarchical Clustering (HC) algorithms. We first determine an optimum number of estimators involved in GBTC to realize the best data classification. In a second time we construct clusters of correlated features using HC and feature importances using GBTC to select the two most meaningful features.

The principles of GBTC and HC algorithms have been yet described in a former article (see [4]).

IV.1 Set the Best Number of Estimators in GBTC

We first introduce a function that trains a GBTC model and evaluates the best number of estimators.

def trainGbc(param_dict,feat_df_train,\
label_df_train,fixnestim=False,feat_df_test=None,\
label_df_test=None,crossvalidate=False,plotfeatimp=False):
“””
Train a Gradient Boosting Tree Classifier
Input:
param_dict [Dict]: Parameters’ dictionnary for \
the GBTC estimator
feat_df_train [DataFrame]: The training data (features)
label_df_train [DataFrame]: Labels of the training data
fixnestim [Bool]: True if the best number of \
estimators has to be fixed
feat_df_test [DataFrame]: (if fixnestim is True) \
The testing data (features)
label_df_test [DataFrame]: (if fixnestim is True)\
Labels of the testing data
crossvalidate [Bool]: True if a cross validation if performed
plotfeatimp [Bool]: True if plot the feature importances
Output:
Gbc [Sklearn Object]: A trained GBTC estimator
featimp_list [List]: Features’ list rearranged following\
their importances
“””
# Train a GBTC model
Gbc=GradientBoostingClassifier(**param_dict)
Gbc.fit(feat_df_train,label_df_train)
featimp_list=feat_df_train.columns[\
np.argsort(Gbc.feature_importances_)]

# If a cross-validation is enforced
if crossvalidate:
score=cross_validate(Gbc,feat_df_train,_
label_df_train,cv=10,scoring=”accuracy”)
print(“Mean CV Accuracy, Testing”)
print(np.mean(score[“test_score”]))

# If the best number of estimators has to be fixed
if fixnestim:
acctrain_list=[]
acctest_list=[]
# Compute the training accuracies
for predtrain in Gbc.staged_predict(feat_df_train):
acctrain=accuracy_score(label_df_train,predtrain)
acctrain_list.append(acctrain)
# Compute the testing accuracies
for predtest in Gbc.staged_predict(feat_df_test):
acctest=accuracy_score(label_df_test,predtest)
acctest_list.append(acctest)
nestim_range=np.arange(len(acctest_list))
acc_fig,ax=plt.subplots(1,figsize=(10,8))
plt.plot(nestim_range,acctrain_list,label=\
”Training Accuracy”,color=”red”)
plt.plot(nestim_range,acctest_list,label=\
”Testing Accuracy”,color=”blue”)
plt.xlabel(“n_estim”,fontsize=20)
plt.ylabel(“Accuracy”,fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
acc_fig.show()
plt.legend()
acc_fig.savefig(“accgbc.png”)
# Plot feature importances
if plotfeatimp:
impfeat_fig,ax=plt.subplots(1,figsize=(10,8))
pos=np.arange(len(featimp_list))+0.5
plt.barh(pos,np.sort(Gbc.feature_importances_))
plt.yticks(pos,featimp_list)
plt.yticks(fontsize=14)
plt.xticks(fontsize=15)
plt.xlabel(“Importance”,fontsize=15)
impfeat_fig.savefig(“impfeat.pdf”)

return Gbc,featimp_list

We plot the training and testing accuracies as a number of estimators’ function. It can be seen fixing an estimator number of 50 enables to reach training and testing accuracies in the range of 0.95.

param_dict=\
{“n_estimators”:300,”min_samples_split”:4,”max_depth”:3,”\
learning_rate”:0.01}
Gbc,featimp_list=trainGbc(param_dict,feat_df_train,label_df_train,fixnestim=True,\
feat_df_test=feat_df_test,label_df_test=label_df_test)
FIG 7: Training and testing accuracies’ evolution for a GBTC algorithm as a number of estimators’ function.

We checked by fixing n_estimators to 50, the testing cross validation accuracy is 0.935 (see FIG 7).

param_dict=\
{“n_estimators”:50,”min_samples_split”:4,”max_depth”:3,\
”learning_rate”:0.01}
Gbc,featimp_list=trainGbc(param_dict,feat_df_train,\
label_df_train,crossvalidate=True,plotfeatimp=False)
Mean CV Accuracy, Testing
0.9351911253675487

We computed also the accuracy on the testing set which reaches 0.942.

predGbc_test=Gbc.predict(feat_df_test)
accgbctest=accuracy_score(label_df_test,predGbc_test)
print(“Testing GBTC Accuracy”)
print(accgbctest)
Mean CV Accuracy, Testing
0.9427294840951618

As with GMM, we plot the confusion matrix.

gbcpred_list=Gbc.predict(feat_df_test)
plotconfmat(label_df_test,gbcpred_list,figname=”gbcconfmat.png”)
FIG 8: Confusion matrix obtained on the testing set with the GBTC model.

It can be deduced the GBTC classification is better as the one enforced with GMM beforehand. The accuracy goes from 0.779 to 0.942. Moreover, the confusion matrix shows GBTC classifies 0.89 percent of the fraud data correctly (see FIG 8). We thus don’t have the problem of numerous false negative predictions encountered with GMM.

IV.3 Feature Selections

We now realize the feature selection, a first step is to regroup the features in clusters by evaluating the correlations between each others. This feature clustering approach is enforced with a HC algorithm (see [13] to have more details about this methodology)

def correlanalyze(feat_df):
‘’’
HC clustering based on features’ correlations
Input:
feat_df [DataFrame]: All the features of input data
‘’’
hctree_fig,ax1=plt.subplots(1,1,figsize=(10,8),)
correl_df=feat_df.corr()
Links=hierarchy.ward(correl_df)
Tree=hierarchy.dendrogram(Links,labels=feat_df.columns,ax=ax1)
hctree_fig.tight_layout()
hctree_fig.savefig(“hctree.png”)
heatmap_fig,ax2=plt.subplots(1,1,figsize=(10,8),)
dendro_index=np.arange(0,len(Tree[“ivl”]))
sn.heatmap(correl_df,cmap=”viridis”)
ax2.set_xticklabels(Tree[“ivl”])
ax2.set_yticklabels(Tree[“ivl”])
ax2.set_xticks(dendro_index)
ax2.set_yticks(dendro_index)
heatmap_fig.savefig(“heatmap.png”)

correlanalyze(feat_df_train)
FIG 9: Up: the HC tree deduced from the correlations between the features. Down: the features’ correlation heat map.

Fixing the level of 4 in the HC (see FIG 9 Up) tree enables to identify three feature clusters. In addition, we evaluate the feature importances with the GBTC algorithm.

param_dict=\
{“n_estimators”:50,”min_samples_split”:4,”max_depth”:3,\
”learning_rate”:0.01}
Gbc,featimp_list=trainGbc(param_dict,feat_df_train,\
label_df_train,crossvalidate=False,plotfeatimp=True)
FIG 10: Feature importances deduced from GBTC algorithm.

Similarly as enforced in [13], it can be seen the V14 and V17 features are the two most important ones according to GBTC. However the HC algorithm shows V14 and V17 belong to the same feature clusters (see FIG 9). On the contrary, V14 and V4 are in different clusters. That’s why they are the two features selected.

V. GMM with the Selected Features

We now decide to apply again GMM not this time on all the features but just on the two selected ones ( V14 and V4). As mentioned in II.1, GMM fits a probability density. We represent thus in the plan (V4, V14) the estimated fraud probability.

def plotgmmpred(Gmm,data_df_train,data_df_test,\
feat_list,nbpoints):
“””
Plot the GMM’s prediction in the 2D space formed by\
the two most important features
Input:
Gmm [Sklearn Object]: A trained GMM model
data_df_train [DataFrame]: Training data \
(features and labels)
data_df_test [DataFrame]: Testing data \
(features and labels)
feat_list [List]: List of the features’ names\
considered for GMM’s predictions
nbpoints [Int]: Number of sampling points \
in each feature dimension

“””
# Determine maximal and mininal features' values and\
# normalize
feat1max=np.max(data_df_train[feat_list[0]])
feat1min=2*np.min(data_df_train[feat_list[0]])
feat2max=np.max(data_df_train[feat_list[1]])
feat2min=np.min(data_df_train[feat_list[1]])
featnorm1_arr=np.arange(feat1min,feat1max,\
(feat1max-feat1min)/nbpoints)
featnorm2_arr=np.arange(feat2min,feat2max,\
(feat2max- feat2min)/nbpoints)
# The points in the 2D space where GMM will be evaluated
pointstopred_arr=np.array([[feat1,feat2] for feat1\
in featnorm1_arr for feat2 in featnorm2_arr])
pointstopred_df=pd.DataFrame(pointstopred_arr,\
columns=feat_list)
# Realize GMM predictions
finalpred_list,finalproba_list=predGmm(Gmm,pointstopred_df\
feat_list,permutelabel,normalize=True,StdScale=StdScale)

colors=[“navy”,”red”]
# Plot the GMM probability predictions in the 2D space
gmmcontpred_fig,ax=plt.subplots(1,figsize=(10,8),)
# If the GMM inferred labels are permuted, take the\
# probabilities associated to label 0 instead
if permutelabel:
im=plt.scatter(pointstopred_arr[:,0],\
pointstopred_arr[:,1],c=np.array(finalproba_list)\
[:,0],cmap=cm.coolwarm,alpha=1.0)
else:
im=plt.scatter(pointstopred_arr[:,0],\
pointstopred_arr[:,1],c=np.array(finalproba_list)\
[:,1],cmap=cm.coolwarm,alpha=1.0)
plt.colorbar(im,label=”Fraud Probability”)
plt.scatter(data_df_train[data_df_train[“Class”]==0]\
[feat_list[0]],data_df_train[data_df_train[“Class”]==0]\
[feat_list[1]],c=”navy”,s=12,label=”Train No Fraud”)
plt.scatter(data_df_train[data_df_train[“Class”]==1]\
[feat_list[0]],data_df_train[data_df_train[“Class”]==1]\
[feat_list[1]],c=”red”,s=12,label=”Train Fraud”)
plt.scatter(data_df_test[data_df_test[“Class”]==1]\
[feat_list[0]],data_df_test[data_df_test[“Class”]==1]\
[feat_list[1]],c=”orange”,s=12,marker=”>”,label=”Test\
Fraud”)
plt.scatter(data_df_test[data_df_test[“Class”]==0]\
[feat_list[0]],data_df_test[data_df_test[“Class”]==0]\
[feat_list[1]],c=”blue”,s=12,marker=”>”,label=”Test No \
Fraud”)
plt.xlim(feat1min,feat1max)
plt.ylim(feat2min,feat2max)
plt.xlabel(feat_list[0],fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.ylabel(feat_list[1],fontsize=12)
plt.legend()
gmmcontpred_fig.savefig(“gmmcontpred.png”)


nbpoints=250
feat_list=[“V4”,”V14"]
Gmm, permutelabel,StdScale\
=trainGmm(feat_df_train,label_df_train,feat_list,normalize=True)
gmmpred_test,probapred_list=\
predGmm(Gmm,feat_df_test,feat_list,permutelabel,\
distribproba=False,label_df=None,normalize=True,StdScale=StdScale)
FIG 11: Probability predicted in the plan (V4,V14) by GMM to have a fraud. The points of both training and testing sets are also given with their true labels.

The FIG 11 presents the GMM predicted probability map to have a fraud. The points of both training and testing sets are also given with their true labels. We can see a cluster found by the GMM model concentrates the majority of normal transactions. This same cluster is associated to the lowest predicted probability to have a fraud. As hypothesized in II.1, the cluster gathering mainly the frauds is made of sparse data. We compute the accuracy on the testing set:

accgmm2dtest=accuracy_score(label_df_test,gmmpred_test)
print(“Testing GMM-2D Accuracy”)
print(accgmm2dtest)
Mean CV Accuracy, Testing
0.9352365846792345

We plot the confusion matrix.

plotconfmat(label_df_test,gmmpred_test,figname=”gmm2dconfmat.png”)
FIG 12: Confusion matrix on the testing set with the GMM model in the 2D space.

We can see the new GMM model (called GMM-2D in what follows) built in the 2D space formed by V14 and V4 is better as the first one using all the 30 dimensions. The accuracy goes up from 0.779 to 0.935 and we can see 0.89 percent of the fraud data are correctly classified (see FIG 12).

VI. Combine GMM and GBTC

In this part, the way GMM and GBTC algorithms are combined is described. Comparisons of the different models involved in this work are in a second time given.

VI.1 Improvement Strategy

When looking at the testing accuracies, it can be observed the GBTC algorithm remains better as the GMM one using the two selected features. It uses however all the 30 features and can not evaluate a probability associated to the predictions. An idea is thus to combine the two complementary advantages of these models (better accuracy for GBTC, simpler model and probability estimation with GMM). For this purpose, the probabilities given by the GMM model can be employed. The following strategy could thus be considered when predicting the label of a data:

  1. Realize a prediction with the GMM model.
  2. Assess the GMM estimated probability this data is associated to a normal transaction (label 0).
  3. If this probability is sufficiently low or high, it means the GMM model is “sure” of its prediction an doesn't need any confirmation.
  4. On the contrary, it means the GMM model is not so “sure” of its prediction. The GBTC model is as a result employed to confirm it. If it predicts an opposite result, it is the GBTC’s prediction which is finally considered.

To clarify what means “sufficiently low” or “sufficiently high” , the predictions realized by the GMM-2D model for the training set are represented in FIG 14. It corresponds to the GMM evaluated probabilities’ distributions the data are labeled by 0. The blue color points out correct predictions, the red wrong ones.

It can be observed the wrong predictions are mainly concentrated in an evaluated probability range between 0.02 and 0.5. It means when the model associates to a predicted label a probability in this range, this prediction needs to be confirmed by the GBTC algorithm. On the contrary when the probability is close to zero, the model seems to correctly predict the 1-label. Similarly when the probability is close to one, the 0-label seems to be correctly given. This study helps us to determine the minimum (0.02) and maximum (0.5) probability thresholds to apply so as to use the GBTC algorithm for confirmation. The FIG 13 sums up this model combination considered.

FIG 13: Combination of the GMM-2D and GBTC models.
feat_list=[“V4”,”V14"]# Train GMM model
Gmm, permutelabel,StdScale=trainGmm(feat_df_train,\
label_df_train,feat_list,normalize=True)
# Realize predictions
labelpred_list,probapred_list=predGmm(Gmm,feat_df_train,feat_list,permutelabel,distribproba=True,label_df=label_df_train)
FIG 14: Distribution of the probability associated to the 0 label by the GMM-2D model on the training set. The blue bars correspond to the data correctly classified, the red to the wrong predictions.

We thus implement the strategy presented above. The code bellow realizes this model combination.

def finalpred(Gmm,Gbc,permutelabel,feat_df_topred,feat_list,\
normalize=False,StdScale=False,evalacc=False,\
label_df_topred=None):

“””
Final predictions combining GMM and GBC models
Input:
Gmm [Sklearn Object]: A trained GMM model
Gbc [Sklearn Object]: A trained GBC model
feat_df_topred [DataFrame]: All the features’ data to predict
feat_list [List]: List of the features’ names considered\
for GMM’s predictions
normalize [Bool]: True if features have to be \
normalized before GMM’s predictions
StdScale [Sklearn Object]: (if normalize is True) \
Means and standard features’\
deviations to apply for\
(de)normalization
evalacc [Bool]: True if accuracy is evaluated
label_df_topred [DataFrame]: (if evalacc is True) \
the true labels associated \
to the data in feat_df_topred
Output:
finalpred_list [List]: List of the predictions \
with the model combining GMM and GBC
finalproba_list [List]: List of the probabilities\
associated to the fraud label
accgmm [Float]: (if evalacc is True)\
accuracy associated to the GMM model
accgbc [Float]: (if evalacc is True) accuracy\
associated to the GBC model
accfinal [Float]:(if evalacc is True) accuracy \
associated to the combined model

“””
# Realize independant predictions from both GMM and GBC models
predgmm_list,probapred_list=predGmm(Gmm,feat_df_topred,\
feat_list,permutelabel,normalize=normalize,StdScale=StdScale)
predgbc_list=Gbc.predict(feat_df_topred)

# If accuracies have to be evaluated
if evalacc:
accgmm=accuracy_score(label_df_topred.values,predgmm_list)
accgbc=accuracy_score(label_df_topred.values,predgbc_list)
# else fix the GBTC accuracy in the range of that reach on\
# testing set
else:
accgbc=0.95
finalpred_list=[]
finalproba_list=[]
for ipred,pred in enumerate(predgmm_list):
finalpred=pred
probapred=probapred_list[ipred][0]
# If the fraud probability determined by GMM\
# is between 0.05 and 0.5, confirm with
# GBTC model and change the probability\
# prediction to the GBTC’s accuracy value
if probapred<0.5:
if probapred>0.05:
finalpred=predgbc_list[ipred]
if finalpred!=pred:
probapred=accgbc
finalproba_list.append(probapred)
finalpred_list.append(finalpred)
# Determine final accuracies
if evalacc:
accfinal=accuracy_score(label_df_topred.values,\
finalpred_list)
print(“Accuracy GMM”)
print(accgmm)
print(“Accuracy GBC”)
print(accgbc)
print(“Testing Accuracy GMM+GBC”)
print(accfinal)

return finalpred_list,finalproba_list,accgmm,accgbc,accfinal
else:
return finalpred_list,finalproba_listfeat_list=[“V4”,”V14"]# Train a GMM-2D model
Gmm, permutelabel,StdScale=trainGmm(feat_df_train,label_df_train,\
feat_list,normalize=True)
# Train a GBTC model
param_dict={“n_estimators”:50,”min_samples_split”\
:4,”max_depth”:3,”learning_rate”:0.01}
Gbc,featimp_list=trainGbc(param_dict,feat_df_train,\
label_df_train,crossvalidate=False,plotfeatimp=False)
# Make final predictions combining GMM-2D and GBTC
finalpred_test,finalproba_test,accgmm,accgbc,accfinal=finalpred\
(Gmm,Gbc,permutelabel,feat_df_test,feat_list,normalize=True,\
StdScale=StdScale,evalacc=True,label_df_topred=label_df_test)

The confusion matrix on the testing set is computed using the final model combining GMM and GBTC.

plotconfmat(label_df_test,finalpred_test,\
figname=”finalconfmat.png”)
FIG 15: Confusion matrix on the testing set with the final model combining GMM-2D and GBTC.

VI.2 Models’ Comparisons

We now realize a comparison of the different introduced models which are:

  1. The GMM model using all the 30 features (GMM-30D)
  2. The GBTC algorithm
  3. The GMM model using the two selected features (GMM-2D)
  4. The model combining GMM-2D and GBTC

The FIG 16 presents the labels predicted by each model, for both training and testing sets, in the plan of the two selected features, V4 and V14.

def plot_labelpred(feat_df_train,feat_df_test,feat_list,\
label_df_train,label_df_test,pred_train,pred_test,\
modelname,figname):
"""
Plot the true labels and the predicted ones in the 2D plan
formed by feat_list
Input:
feat_df_train [DataFrame]: Training data features
feat_df_test [DataFrame]: Testing data features
feat_list [List]: List of the features’ names considered\
for GMM’s predictions
label_df_train [DataFrame]: Training labels
label_df_test [DataFrame]: Testing labels
pred_train [List]: Training predictions by modelname
pred_test [List]: Testing predictions by modelname
modelname [String]: Model's name realizing the predictions
figname [String]: Path of the figure
"""
fig,ax=plt.subplots(nrows=1,ncols=2,figsize=(10,8),) ax[0].scatter(feat_df_train[feat_list[0]],\
feat_df_train[feat_list[1]],c=label_df_train,\
cmap=matplotlib.colors.ListedColormap([“navy”,”red”]))
ax[0].scatter(feat_df_test[feat_list[0]],\
feat_df_test[feat_list[1]],c=label_df_test,\
cmap=matplotlib.colors.ListedColormap([“navy”,”red”]),\
marker=”>”)
ax[0].set_title(“True Labels”,fontsize=18) ax[1].scatter(feat_df_train[feat_list[0]],\
feat_df_train[feat_list[1]],c=pred_train,\
cmap=matplotlib.colors.ListedColormap([“navy”,”red”]))
ax[1].scatter(feat_df_test[feat_list[0]],\
feat_df_test[feat_list[1]],c=pred_test,\
cmap=matplotlib.colors.ListedColormap([“navy”,”red”]),\
marker=”>”)
ax[1].set_title(modelname+” Predictions”,fontsize=18) # Plot legends ax[0].scatter(0,0,c=”navy”,label=”No Fraud (Training)”)
ax[0].scatter(4,-5,c=”red”,label=”Fraud (Training)”)
ax[0].scatter(0,0,c=”navy”,\
label=”No Fraud (Testing)”,marker=”>”)
ax[0].scatter(4,-5,c=”red”,label=”Fraud\
(Testing)”,marker=”>”)
ax[0].legend()
ax[1].scatter(0,0,c=”navy”,label=”No Fraud (Training)”)
ax[1].scatter(4,-5,c=”red”,label=”Fraud (Training)”)
ax[1].scatter(0,0,c=”navy”,\
label=”No Fraud (Testing)”,marker=”>”)
ax[1].scatter(4,-5,c=”red”,label=”Fraud\
(Testing)”,marker=”>”)
ax[1].legend()
fig.tight_layout()
fig.savefig(figname)
# Prediction with GMM-30D
featall_list=list(feat_df_test.columns)
feat_list=[“V4”,”V14"]
# Testing predictions
gmmallpred_test,probapred_list=predGmm(GmmAll,feat_df_test,\
featall_list,permutelabel,distribproba=False,label_df=None,\
normalize=True,StdScale=StdScaleAll)
# Training predictions
gmmallpred_train,probapred_list=predGmm(GmmAll,feat_df_train,\
featall_list,permutelabel,distribproba=False,\
label_df=None,normalize=True,StdScale=StdScaleAll)
# Prediction with GMM-2D# Testing predictions
gmmpred_test,probapred_list=predGmm(Gmm,feat_df_test,\
feat_list,permutelabel,distribproba=False,\
label_df=None,normalize=True,StdScale=StdScale)
# Training predictions
gmmpred_train,probapred_list=predGmm(Gmm,feat_df_train,\
feat_list,permutelabel,distribproba=False,label_df=None,\
normalize=True,StdScale=StdScale)
# Prediction with GBTC
gbcpred_test=Gbc.predict(feat_df_test)
gbcpred_train=Gbc.predict(feat_df_train)
# Prediction with GMM-2D+GBTC
finalpred_test,finalproba_list=finalpred(Gmm,Gbc,permutelabel,\
feat_df_test,feat_list,normalize=True,StdScale=StdScale)
finalpred_train,finalproba_train=finalpred(Gmm,Gbc,\
permutelabel,feat_df_train,feat_list,normalize=True,\
StdScale=StdScale)

# Plot the different models' predictions
figname="gmmallpred.png"
modelname="GMM-30D"
plot_labelpred(feat_df_train,feat_df_test,feat_list,label_df_train,label_df_test,gmmallpred_train,gmmallpred_test,modelname,figname)
figname="gmmpred.png"
modelname="GMM-2D"
plot_labelpred(feat_df_train,feat_df_test,feat_list,label_df_train,label_df_test,gmmpred_train,gmmpred_test,modelname,figname)
figname="gbcpred.png"
modelname="GBC"
plot_labelpred(feat_df_train,feat_df_test,feat_list,label_df_train,label_df_test,gbcpred_train,gbcpred_test,modelname,figname)
figname="finalpred.png"
modelname="GMM-2D+GBC"
plot_labelpred(feat_df_train,feat_df_test,feat_list,label_df_train,label_df_test,finalpred_train,finalpred_test,modelname,figname)
FIG 16: Models’ predictions for training and testing sets in the plan (V4,V14).

It can be observed the combination GMM-2D+GBTC enables to catch more subtle details for label determination as just using GMM-2D or GBTC. This observation is confirmed when comparing the testing accuracies of all the models (see FIG 17).

histacc_fig,ax=plt.subplots(1,figsize=(10,8))
pos=np.arange(4)+0.5
acc_list=[accgmmall,accgmm,accgbc,accfinal]
plt.bar(pos,acc_list)
plt.xticks(pos,[“GMM-30D”,”GMM-2D”,”GBC”,”GMM-\
2D+GBC”],fontsize=20)
plt.ylabel(“Testing Accuracy”,fontsize=20)
plt.yticks(fontsize=20)
plt.ylim(0.5,1)
for ival,val in enumerate(acc_list):
ax.text(ival+0.3,val+0.01, str(val)[:5], color=’blue’,\
fontweight=’bold’,fontsize=15)
histacc_fig.savefig(“histacc.png”)
FIG 17: Accuracies’ comparison of the different models.

Conclusion

In this study, we proposed to quantify the risk of transaction fraud thanks to the combination of unsupervised and supervised models. With the use of feature selection, we where able to build a reliable Gaussian Mixture Model which is helped by a Gradient Boosting Tree Classifier when predictions are “unsure”. Such combination gives rise to a more accurate model able to predict near 90 percent of the frauds and gathering the advantages of both GMM and GBTC algorithms.

Sources

[1] https://www.ecb.europa.eu/press/pr/stats/paysec/html/ecb.pis2019~71119b94d1.en.html

[2] https://www.kaggle.com/mlg-ulb/creditcardfraud

[3] Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

[4] Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915–4928,2014, Pergamon

[5] Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784–3797,2018,IEEE

[6] Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

[7] Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182–194,2018,Elsevier

[8] Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285–300,2018,Springer International Publishing

[9] Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78–88, 2019

[10] Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

[11] https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html

[12] https://scikit-learn.org/stable/modules/mixture.html

[13] https://towardsdatascience.com/ai-computed-breast-cancer-risk-map-b29195b477a

--

--

jean-baptiste charraud

Centrale Paris Engineer, PhD Candidate in AI and Quantum Simulation