How to use a variational Autoencoder to augment tabular data

When it comes to DeepLearning, the more data we have the better the chances are to get a great performing model. In fields like image recognition research has already came up with quite a few clever ideas how to use the existing data to create more data out of it. This is called data augmentation.

However, when we look at Deep Learning in the tabular data context, there are still many concepts missing. What I would like to show in this blogpost is a way to augment tabular data, what we could use in order to train a DeepLearning Model on more tabular data, or which can be used to create data of underrepresented classes.

I want to show graphically how this newly created data is sampled from the distribution of the underlying data and hence how this data can help to make better Deep Learning models.

I've already created a small library, which I called deep_tabular_augmentation. In here I've created a class, which handles all of the tabular data augmentation.

import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import deep_tabular_augmentation as dta
import warnings; warnings.simplefilter('ignore')

So first, we need to get some data. Here, I've got some data on the infamous wine-dataset.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

DATA_PATH = 'data/wine.csv'

df = pd.read_csv(DATA_PATH, sep=',')

df.head()
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
cols = df.columns

We then build a DataLoader, in which we also standardize our data. We save the scaler in our dataset to make use of it later, when we invert the scaling.

def load_and_standardize_data(path):
    # read in from csv
    df = pd.read_csv(path, sep=',')
    # replace nan with -99
    df = df.fillna(-99)
    df = df.values.reshape(-1, df.shape[1]).astype('float32')
    # randomly split
    X_train, X_test = train_test_split(df, test_size=0.3, random_state=42)
    # standardize values
    scaler = preprocessing.StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)   
    return X_train, X_test, scaler
from torch.utils.data import Dataset, DataLoader
class DataBuilder(Dataset):
    def __init__(self, path, train=True):
        self.X_train, self.X_test, self.standardizer = load_and_standardize_data(DATA_PATH)
        if train:
            self.x = torch.from_numpy(self.X_train)
            self.len=self.x.shape[0]
        else:
            self.x = torch.from_numpy(self.X_test)
            self.len=self.x.shape[0]
        del self.X_train
        del self.X_test 
    def __getitem__(self,index):      
        return self.x[index]
    def __len__(self):
        return self.len
traindata_set=DataBuilder(DATA_PATH, train=True)
testdata_set=DataBuilder(DATA_PATH, train=False)

trainloader=DataLoader(dataset=traindata_set,batch_size=1024)
testloader=DataLoader(dataset=testdata_set,batch_size=1024)
trainloader.dataset.x.shape, testloader.dataset.x.shape
(torch.Size([124, 14]), torch.Size([54, 14]))

We've build our train and test datasets, and with the help of DataLoaders we also turned them into tensors. So, let's use deep_tabular_augmentation now. The class needs seven inputs: trainloader, testloader, device on which to run the traning, the input dimension (in this case: 14), and how many nodes the first and second hidden layers should have. Finally, we can also specify the number of latent factors. These latent factors will contain all the condensed information, meaning that we can use these latent factors to recreate the original 14 input dimensions (e.g. our data).

D_in = traindata_set.x.shape[1]
H = 50
H2 = 12

autoenc_model = dta.AutoencoderModel(trainloader, testloader, device, D_in, H, H2, latent_dim=3)

After we've successfully initiated our model, let's train it and call the trained model "autoenc_model_fit".

autoenc_model_fit = autoenc_model.fit(epochs=600)
====> Epoch: 200 Average training loss: 11.3281
====> Epoch: 200 Average test loss: 11.4239
====> Epoch: 400 Average training loss: 9.7651
====> Epoch: 400 Average test loss: 10.3157
====> Epoch: 600 Average training loss: 9.1283
====> Epoch: 600 Average test loss: 10.5291

Now, all we need is to create some fake data based on the trained model. How this works is the following: we know the learned parameters for the mean and the variance of our latent factors. Then, we use a normal distribution with the mean and variance of each of the latent factors to sample a value for latent factor 1,2 and 3 (because we've got three latent facots in this case). These generated starting points for our latent factors are then used to inflate towards the 14 real input variables. Let's see how it's done:

scaler = trainloader.dataset.standardizer

pred = autoenc_model_fit.predict_df(no_samples=500, cols=cols, scaler=scaler)
df_fake['Wine'] = np.round(df_fake['Wine']).astype(int)
df_fake['Wine'] = np.where(df_fake['Wine']<1, 1, df_fake['Wine'])
df_fake['Wine'] = np.where(df_fake['Wine']>3, 3, df_fake['Wine'])
df_fake.head()
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 2 12.493084 2.045130 2.237029 18.645376 101.227913 2.366858 2.039845 0.310666 1.780123 3.824865 1.053821 2.959614 614.468567
1 2 12.388008 2.028943 2.237992 19.671783 93.292801 2.290553 2.102422 0.337810 1.582847 3.545685 1.047288 2.853665 574.657776
2 2 12.863456 2.061298 2.315192 18.529932 104.701004 2.480082 2.273354 0.298898 1.788156 4.134489 1.071253 2.954578 788.806580
3 2 12.315710 2.164225 2.261593 20.433725 91.778603 2.019248 1.704518 0.383477 1.496545 3.532443 0.991632 2.591641 531.966309
4 2 12.562940 2.135798 2.234170 19.034275 101.889763 2.421708 2.133960 0.296481 1.818705 3.890027 1.052557 2.903332 662.883545

The deep_tabular_augmentation library has another method in its sleeve: predict_with_noise. What this does is the following, sampling from a normal distribution each element (independend of each other element) will be multiplied by 1 plus the sampled number. Why should we do this? The answer is that the Variational Autoencoder works similar to a PCA, resulting in sharper defined relations between variables. So the Variational Autoencoder keeps mean and standard deviance within the variables, however, the trained parameters of the model already find out "hidden" relations between variables. When these relations are linear the Variational Autoencoder de facto performs a PCA. We'll have a look at it in a second.

df_fake_with_noise = autoenc_model_fit.predict_with_noise_df(no_samples=500, scaler=scaler, cols=cols, mu=0, sigma=0.05, group_var='Wine')
df_fake_with_noise['Wine'] = np.round(df_fake_with_noise['Wine']).astype(int)
df_fake_with_noise['Wine'] = np.where(df_fake_with_noise['Wine']<1, 1, df_fake_with_noise['Wine'])
df_fake_with_noise['Wine'] = np.where(df_fake_with_noise['Wine']>3, 3, df_fake_with_noise['Wine'])
df_fake_with_noise.head()
Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
0 2 12.106316 2.142077 2.127616 19.408381 92.179733 2.418110 2.124110 0.353460 1.544483 3.348948 1.057455 2.828610 560.824158
1 2 12.147018 2.009001 2.188961 22.396885 101.379967 2.318827 2.177448 0.336155 1.584168 3.603321 0.984947 2.718268 604.860901
2 2 13.262069 1.864855 2.277317 19.010958 95.259872 2.493481 2.209520 0.347245 1.750477 3.704191 0.989856 2.743370 631.236572
3 2 13.586826 2.270907 2.191374 22.129240 91.323662 1.793574 1.240703 0.437917 1.393128 3.973017 0.862660 2.276499 543.781311
4 2 13.825186 1.984307 2.333792 18.083511 102.420219 2.705270 2.401335 0.312623 1.910688 4.119694 1.008278 3.032777 685.976562

Let's have a look at the descriptives, especially the mean. Can you spot a difference between the real and the fake data?

df.groupby('Wine').describe().loc[:,(slice(None),['mean'])]
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
mean mean mean mean mean mean mean mean mean mean mean mean mean
Wine
1 13.744746 2.010678 2.455593 17.037288 106.338983 2.840169 2.982373 0.290000 1.899322 5.528305 1.062034 3.157797 1115.711864
2 12.278732 1.932676 2.244789 20.238028 94.549296 2.258873 2.080845 0.363662 1.630282 3.086620 1.056282 2.785352 519.507042
3 13.153750 3.333750 2.437083 21.416667 99.312500 1.678750 0.781458 0.447500 1.153542 7.396250 0.682708 1.683542 629.895833
df_fake.groupby('Wine').describe().loc[:,(slice(None),['mean'])]
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
mean mean mean mean mean mean mean mean mean mean mean mean mean
Wine
1 13.756567 1.972927 2.439609 16.679155 110.221558 2.888496 2.933521 0.296994 1.980520 5.707772 1.065832 3.040885 1103.145508
2 12.550643 2.169043 2.292549 19.789505 96.663307 2.250596 2.012712 0.350111 1.627728 3.967775 1.007894 2.744052 616.919006
3 13.225099 3.809655 2.512825 22.554857 101.781906 1.470288 0.664501 0.507955 0.921151 7.329511 0.642636 1.483876 621.194458
df_fake_with_noise.groupby('Wine').describe().loc[:,(slice(None),['mean'])]
Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.phenols Proanth Color.int Hue OD Proline
mean mean mean mean mean mean mean mean mean mean mean mean mean
Wine
1 13.816388 1.980611 2.445411 16.888805 109.618469 2.890874 2.916224 0.297376 1.960559 5.643065 1.068307 3.012105 1106.231079
2 12.554925 2.182181 2.296422 19.779770 96.850357 2.270568 2.013577 0.349725 1.625649 4.029018 1.005889 2.740945 626.561035
3 13.099731 3.869962 2.526824 22.768381 102.265625 1.440060 0.597155 0.517642 0.879244 7.493037 0.621761 1.406803 605.561829

Now let's have a graphical look on how the fake data looks vs the real data.

alt text

This is what I meant by "performing a PCA". One can clearly see how the Variational Autoencoder gave structure to the relation of Alcohol and Hue. If we add noise to it, this relation vanishes. But what happens, if we use more than 3 latent factors? This is the result with 14 (=input variables) latent factors:

alt text

The same pattern emerges. However, when applying random noise to it, the resulting data looks pretty much like the real data.

Now let's have a look at some distributions. The first image always represents the results with 3 latent factors, the second one with 14 latent factors.

3 latent factors

alt text

14 latent factors

alt text

3 latent factors

alt text

14 latent factors

alt text

3 latent factors

alt text

14 latent factors

alt text

We see that when using a Variational Autoencoder to make data augmentation on tabular data, it actually already finds relations between variables. If we want to get rid of this effect and add random noise to the data, the resulting distributions look pretty much like the original, real data points. How can we use these insights to improve machine learning/deep learning models? This I will cover in an upcoming blogpost.

Until then, stay tuned for more! Lasse