This blogpost should give an overview of how different amount of latent factors affects the data created by the Deep Tabular Augmenter.

This blogpost is basedon version 0.4.0 of deep_tabular_augmentation. Again, we use the credit-card fraud dataset from kaggle.

#!pip install deep_tabular_augmentation==0.4.0
from config import *

import pandas as pd
import numpy as np
import torch
from torch import nn
from torch import optim
from sklearn.preprocessing import StandardScaler
from functools import partial
import mlprepare as mlp
import deep_tabular_augmentation as dta
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

DATA_PATH = 'data/creditcard.csv'

df = pd.read_csv(DATA_PATH, sep=',')

df.head()
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

The deep_tabular_augmentation works on the simple idea, that we want to keep the data in a dedicated class (which we call the Learner) together with the model. The data has to come as a dataloader object, which I store in the DataBunch class. In it are the dataloaders for the training and test data. The runner class then defines the flow.

We first scale the data and only keep the data of the class we want to augment:

X_train, X_test, y_train, y_test = mlp.split_df(df, dep_var='Class', test_size=0.3, split_mode='random')

x_scaler = StandardScaler()

X_train_scaled = x_scaler.fit_transform(X_train)

X_test_scaled = x_scaler.transform(X_test)

X_train_fraud = X_train_scaled[np.where(y_train==1)[0]]
X_test_fraud = X_test_scaled[np.where(y_test==1)[0]]

As mentioned, I then put the train and testloader into a class called DataBunch, which is just a container for the data. You can easily create your own dataloaders and put them in a DataBunch.

device = DEVICE
datasets = dta.create_datasets(X_train_fraud, y_train.values[np.where(y_train==1)], X_test_fraud, y_test.values[np.where(y_test==1)])
data = dta.DataBunch(*dta.create_loaders(datasets, bs=1024))

To make use of the deep_data_augmentation, we need to specify the input shape (so basically how many variables are in the dataset), the column name of the target-class we want to augment and the corresponding number, and lastly the column names of the input variables.

Then, wen can define whatever model architecture we would like to have. We just pass it as a list into the model. We can also define how many latent dimension we would like to add to our Autoencoder. As I want to compare the results of different latent factors, I will build four models only different in how many latent factors we set up.

Also, the model allows to control the variance you want to add per each column. If desired, you can add as many variance so that the variance of the real data matches the variance of the designed data. However, this is most of the time not desired, since this adds so much noise that the model will not support you as good as it can when it comes to predictions. I recommend using around 10%-25% of the real standard deviance of the data.

D_in = X_train_fraud.shape[1]
VAE_arch = [50, 12, 12]
target_name = 'Class'
target_class = 1
df_cols = list(df.columns)

# take 10% of real standard deviation
sigma = list(df[df['Class']==1][df_cols].std()*0.25)

sched = dta.combine_scheds([0.3, 0.7], [dta.sched_cos(0.01, 0.1), dta.sched_cos(0.1, 0.01)])
cbfs = [partial(dta.LossTracker, show_every=500), dta.Recorder, partial(dta.ParamScheduler, 'lr', sched)]
latent_factors = [1, 3, 5, 10, 20, 30]
loss_func = dta.customLoss()
models = [dta.Autoencoder(D_in, VAE_arch, latent_dim=i).to(device) for i in latent_factors]

opts = [optim.Adam(models[i].parameters(), lr=0.01) for i in range(len(models))]
learners = [dta.Learner(models[i], opts[i], loss_func, data, target_name, target_class, df_cols) for i in range(len(models))]
runners = [dta.Runner(cb_funcs=cbfs) for i in range(len(models))]

Let's train all four models and predict fake data from each of them:

runners[0].fit(100, learners[0])
df_fake_with_noise_1l = runners[0].predict_with_noise_df(learners[0], no_samples=1000, mu=0, sigma=sigma, scaler=x_scaler)
epoch: 500
train loss is: 242759.03125
validation loss is: 98488.9609375
runners[1].fit(100, learners[1])
df_fake_with_noise_3l = runners[1].predict_with_noise_df(learners[1], no_samples=1000, mu=0, sigma=sigma, scaler=x_scaler)
epoch: 500
train loss is: 242125.296875
validation loss is: 99636.6484375
runners[2].fit(100, learners[2])
df_fake_with_noise_5l = runners[2].predict_with_noise_df(learners[2], no_samples=1000, mu=0, sigma=sigma, scaler=x_scaler)
epoch: 500
train loss is: 240940.15625
validation loss is: 98626.25
runners[3].fit(100, learners[3])
df_fake_with_noise_10l = runners[3].predict_with_noise_df(learners[3], no_samples=1000, mu=0, sigma=sigma, scaler=x_scaler)
epoch: 500
train loss is: 242187.625
validation loss is: 97300.4609375
runners[4].fit(100, learners[4])
df_fake_with_noise_20l = runners[4].predict_with_noise_df(learners[4], no_samples=1000, mu=0, sigma=sigma, scaler=x_scaler)
epoch: 500
train loss is: 242573.765625
validation loss is: 98810.9453125
runners[5].fit(100, learners[5])
df_fake_with_noise_30l = runners[5].predict_with_noise_df(learners[5], no_samples=1000, mu=0, sigma=sigma, scaler=x_scaler)
epoch: 500
train loss is: 245696.984375
validation loss is: 99198.0546875
fig, axs = plt.subplots(4, 2)
# Defining custom 'xlim' and 'ylim' values.
custom_xlim = (-30, 5)
custom_ylim = (-40, 22)
# Setting the values for all axes.
plt.setp(axs, xlim=custom_xlim, ylim=custom_ylim)
fig.set_size_inches(25.5, 17.5)
axs[0, 0].scatter(df[df['Class']==0]['V1'].values, df[df['Class']==0]['V2'].values, alpha=0.5)
axs[0, 0].set_title('V1 vs V2 for Class==0')
axs[0, 1].scatter(df[df['Class']==1]['V1'].values, df[df['Class']==1]['V2'].values, alpha=0.5)
axs[0, 1].set_title('V1 vs V2 for Class==1')
axs[1, 0].scatter(df_fake_with_noise_1l['V1'].values, df_fake_with_noise_1l['V2'].values, alpha=0.5)
axs[1, 0].set_title('Fake Data with latent_factors=1')
axs[1, 1].scatter(df_fake_with_noise_3l['V1'].values, df_fake_with_noise_3l['V2'].values, alpha=0.5)
axs[1, 1].set_title('Fake Data with latent_factors=3')
axs[2, 0].scatter(df_fake_with_noise_5l['V1'].values, df_fake_with_noise_5l['V2'].values, alpha=0.5)
axs[2, 0].set_title('Fake Data with latent_factors=5')
axs[2, 1].scatter(df_fake_with_noise_10l['V1'].values, df_fake_with_noise_10l['V2'].values, alpha=0.5)
axs[2, 1].set_title('Fake Data with latent_factors=10')
axs[3, 0].scatter(df_fake_with_noise_20l['V1'].values, df_fake_with_noise_20l['V2'].values, alpha=0.5)
axs[3, 0].set_title('Fake Data with latent_factors=20')
axs[3, 1].scatter(df_fake_with_noise_30l['V1'].values, df_fake_with_noise_30l['V2'].values, alpha=0.5)
axs[3, 1].set_title('Fake Data with latent_factors=30')
;
''
fig, axs = plt.subplots(4, 2)
# Defining custom 'xlim' and 'ylim' values.
custom_xlim = (-60, 20)
custom_ylim = (-40, 10)
# Setting the values for all axes.
plt.setp(axs, xlim=custom_xlim, ylim=custom_ylim)
fig.set_size_inches(25.5, 17.5)
axs[0, 0].scatter(df[df['Class']==0]['V2'].values, df[df['Class']==0]['V3'].values, alpha=0.5)
axs[0, 0].set_title('V2 vs V3 for Class==0')
axs[0, 1].scatter(df[df['Class']==1]['V2'].values, df[df['Class']==1]['V3'].values, alpha=0.5)
axs[0, 1].set_title('V2 vs V3 for Class==1')
axs[1, 0].scatter(df_fake_with_noise_1l['V2'].values, df_fake_with_noise_1l['V3'].values, alpha=0.5)
axs[1, 0].set_title('Fake Data with latent_factors=1')
axs[1, 1].scatter(df_fake_with_noise_3l['V2'].values, df_fake_with_noise_3l['V3'].values, alpha=0.5)
axs[1, 1].set_title('Fake Data with latent_factors=3')
axs[2, 0].scatter(df_fake_with_noise_5l['V2'].values, df_fake_with_noise_5l['V3'].values, alpha=0.5)
axs[2, 0].set_title('Fake Data with latent_factors=5')
axs[2, 1].scatter(df_fake_with_noise_10l['V2'].values, df_fake_with_noise_10l['V3'].values, alpha=0.5)
axs[2, 1].set_title('Fake Data with latent_factors=10')
axs[3, 0].scatter(df_fake_with_noise_20l['V2'].values, df_fake_with_noise_20l['V3'].values, alpha=0.5)
axs[3, 0].set_title('Fake Data with latent_factors=20')
axs[3, 1].scatter(df_fake_with_noise_30l['V2'].values, df_fake_with_noise_30l['V3'].values, alpha=0.5)
axs[3, 1].set_title('Fake Data with latent_factors=30')
;
''
fig, axs = plt.subplots(4, 2)
# Defining custom 'xlim' and 'ylim' values.
custom_xlim = (-15, 25)
custom_ylim = (-20, 40)
# Setting the values for all axes.
plt.setp(axs, xlim=custom_xlim, ylim=custom_ylim)
fig.set_size_inches(25.5, 17.5)
axs[0, 0].scatter(df[df['Class']==0]['V10'].values, df[df['Class']==0]['V6'].values, alpha=0.5)
axs[0, 0].set_title('V10 vs V6 for Class==0')
axs[0, 1].scatter(df[df['Class']==1]['V10'].values, df[df['Class']==1]['V6'].values, alpha=0.5)
axs[0, 1].set_title('V10 vs V6 for Class==1')
axs[1, 0].scatter(df_fake_with_noise_1l['V10'].values, df_fake_with_noise_1l['V6'].values, alpha=0.5)
axs[1, 0].set_title('Fake Data with latent_factors=1')
axs[1, 1].scatter(df_fake_with_noise_3l['V10'].values, df_fake_with_noise_3l['V6'].values, alpha=0.5)
axs[1, 1].set_title('Fake Data with latent_factors=3')
axs[2, 0].scatter(df_fake_with_noise_5l['V10'].values, df_fake_with_noise_5l['V6'].values, alpha=0.5)
axs[2, 0].set_title('Fake Data with latent_factors=5')
axs[2, 1].scatter(df_fake_with_noise_10l['V10'].values, df_fake_with_noise_10l['V6'].values, alpha=0.5)
axs[2, 1].set_title('Fake Data with latent_factors=10')
axs[3, 0].scatter(df_fake_with_noise_20l['V10'].values, df_fake_with_noise_20l['V6'].values, alpha=0.5)
axs[3, 0].set_title('Fake Data with latent_factors=20')
axs[3, 1].scatter(df_fake_with_noise_30l['V10'].values, df_fake_with_noise_30l['V6'].values, alpha=0.5)
axs[3, 1].set_title('Fake Data with latent_factors=30')
;
''

These results look really promising. Also, it seems like when comparing 2 variables it does not really matter how many latent factors you choose although 30 latent factors seem to provide the most compelling results. What if we have a look at 3 different variables:

fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(df[df['Class']==0]['V1'].values, df[df['Class']==0]['V2'].values, df[df['Class']==0]['V3'].values)
ax.set_title('V1 vs V2 vs V3 for Class==0')
plt.show()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(df[df['Class']==1]['V1'].values, df[df['Class']==1]['V2'].values, df[df['Class']==1]['V3'].values)
ax.set_title('V1 vs V2 vs V3 for Class==1')
plt.show()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(df_fake_with_noise_1l['V1'].values, df_fake_with_noise_1l['V2'].values, df_fake_with_noise_1l['V3'].values)
ax.set_title('V1 vs V2 vs V3 for latent factors=1')
plt.show()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(df_fake_with_noise_10l['V1'].values, df_fake_with_noise_10l['V2'].values, df_fake_with_noise_10l['V3'].values)
ax.set_title('V1 vs V2 vs V3 for latent factors=10')
plt.show()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(df_fake_with_noise_30l['V1'].values, df_fake_with_noise_30l['V2'].values, df_fake_with_noise_30l['V3'].values)
ax.set_title('V1 vs V2 vs V3 for latent factors=30')
plt.show()

When comparing the 3d plots it becomes more clearly that fewer latent factors seem to further split the data apart. The more latent factors the better it seems the relations between the variables can be better captured.

If you have any questions or want anything added to the package, just ask me.

Lasse