Comparison of RandomForest with SMOTE vs Augmented Data
In this blog I'd like to show the difference deep tabular augmentation can have when training a Random Forest on a highly biased data base. In this case, we have a look at credit card fraud, where fraud itself is is way less represented than non-fraud. I want to compare the popular SMOTE technique with the Deep Learning Augmentation.
import pandas as pd
import numpy as np
import torch
from torch import nn
from torch import optim
from sklearn.preprocessing import StandardScaler
from functools import partial
import mlprepare as mlp
import deep_tabular_augmentation as dta
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
DATA_PATH = 'data/creditcard.csv'
df = pd.read_csv(DATA_PATH)
Let's have a short look at the data:
df.head()
Also, let's have a look of how many more non-fraud cases we have compared to fraud cases:
difference_in_class_occurences = df['Class'].value_counts()[0]-df['Class'].value_counts()[1]
difference_in_class_occurences
In order to make use of the deep tabular augmentation we need to scale the data and then use only those cases, in which class we are interested in, in this case "Class" is equal to 1.
X_train, X_test, y_train, y_test = mlp.split_df(df, dep_var='Class', test_size=0.3, split_mode='random')
x_scaler = StandardScaler()
X_train_scaled = x_scaler.fit_transform(X_train)
X_test_scaled = x_scaler.transform(X_test)
X_train_fraud = X_train_scaled[np.where(y_train==1)[0]]
X_test_fraud = X_test_scaled[np.where(y_test==1)[0]]
For our model to work we need to put our data in a DataLoader (here I use the DataBunch Class from deep data augmentation).
datasets = dta.create_datasets(X_train_fraud, y_train.values[np.where(y_train==1)], X_test_fraud, y_test.values[np.where(y_test==1)])
data = dta.DataBunch(*dta.create_loaders(datasets, bs=1024))
Now we're already good to go. We can define our Variational Encoder Architecture (here: 50->12->12->5->12->12->50) and then use the LearningRate Finder to tell us the best Learning rate:
D_in = X_train_fraud.shape[1]
VAE_arch = [50, 12, 12]
target_name = 'Class'
target_class = 1
df_cols = list(df.columns)
model = dta.Autoencoder(D_in, VAE_arch, latent_dim=5).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
loss_func = dta.customLoss()
learn = dta.Learner(model, opt, loss_func, data, target_name, target_class, df_cols)
run = dta.Runner(cb_funcs=[dta.LR_Find, dta.Recorder])
run.fit(100, learn)
run.recorder.plot(skip_last=5)
We set up a desirable learning rate and scheduler for our learning rate:
sched = dta.combine_scheds([0.3, 0.7], [dta.sched_cos(0.01, 0.1), dta.sched_cos(0.1, 0.01)])
Now, let's train the model:
cbfs = [partial(dta.LossTracker, show_every=50), dta.Recorder, partial(dta.ParamScheduler, 'lr', sched)]
model = dta.Autoencoder(D_in, VAE_arch, latent_dim=20).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
learn = dta.Learner(model, opt, loss_func, data, target_name, target_class, df_cols)
run = dta.Runner(cb_funcs=cbfs)
run.fit(400, learn)
Let's see how the created data looks like:
# take 25% of real standard deviation
sigma = list(df[df['Class']==1][df_cols].std()*0.25)
df_fake_with_noise = run.predict_with_noise_df(learn, no_samples=1000, mu=0, sigma=sigma, scaler=x_scaler)
df_fake_with_noise.describe().loc[['mean']]
Now we use SMOTE for creating data
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
y_train.value_counts()
y_res.value_counts()
So basically SMOTE creates as many entries of the minority class until we have the same amount of cases in both classes. We can do this with the DeepLearning Augmenter as well:
difference_in_trainset = y_train.value_counts()[0]-y_train.value_counts()[1]
df_fake_with_noise = run.predict_with_noise_df(learn, no_samples=difference_in_trainset, mu=0, sigma=sigma, scaler=x_scaler)
df_fake_with_noise.shape
While SMOTE creates these synthetic data in almost no-time, the DeepLearning Augmentation takes about a minute.
We want to compare how the built-in class_weight functionality performs vs the new approach vs SMOTE. So we will build three trainsets: the original one, the one with additional data from SMOTE, and the one with additional data from DeepLearning Augmentation.
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)
train_df_fake_with_noise = pd.concat([train_df, df_fake_with_noise])
To make things easier to understand, let's define the datasets on which to train and on which to assess the results:
X_train, X_test, X_train_aug = train_df.iloc[:,:30].values, test_df.iloc[:,:30].values, train_df_fake_with_noise.iloc[:,:30].values
y_train, y_test, y_train_aug = train_df.iloc[:,30].values, test_df.iloc[:,30].values, train_df_fake_with_noise.iloc[:,30].values
X_train.shape, X_train_aug.shape, X_res.shape
y_train.shape, y_train_aug.shape, y_res.shape
First, let's train model on the original data while using the differences in class occurences as weights.
def rf(xs, y, n_estimators=40, max_samples=500,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestClassifier(n_jobs=-1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True, class_weight={0:1,1:difference_in_class_occurences}).fit(xs, y)
m = rf(X_train, y_train)
confusion_matrix(y_test, np.round(m.predict(X_test)))
Then we use the SMOTE data:
def rf_aug(xs, y, n_estimators=40, max_samples=500,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestClassifier(n_jobs=-1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)
m_smote = rf_aug(X_res.values, y_res)
confusion_matrix(y_test, np.round(m_smote.predict(X_test)))
Finally, we use the augmented dataframe:
m_aug = rf_aug(X_train_aug, y_train_aug)
confusion_matrix(y_test, np.round(m_aug.predict(X_test)))
Let's have a look at the Classification Reports:
from sklearn.metrics import classification_report
target_names = ['no-fraud', 'fraud']
print(classification_report(y_test, np.round(m.predict(X_test)), target_names=target_names))
print(classification_report(y_test, np.round(m_smote.predict(X_test)), target_names=target_names))
print(classification_report(y_test, np.round(m_aug.predict(X_test)), target_names=target_names))
We see quite huge differences between the three approaches. Simply attaching a higher weight to the fraud-class didn't help at all, we were only able to identify 4 fraud cases correctly. The SMOTE approach lead to way more identified fraud cases, it found 125 cases out of 136. This leads to an recall of 0.92. However, this comes at a cost: we have astonishing 1283 missclassified fraud cases, leading to a precision in the fraud case of 0.09. The Deep Learning Augmentation correctly predicted 107 out of the 136 cases while only misclassifying 58 cases - this leads to a precision of 0.65 in the fraud case.
To conclude, I think this blog was able to show the merits of Deep Learning Augmentation. While increasing the correctly identified fraud cases we were also able to keep a high precision in this case, meaning we only have a few misclassified cases. While SMOTE was able to correctly identify a few more fraud cases it did also create a huge amount of misclassified fraud cases which might be costly when it comes to resource allocation.
Lasse