So far I did show how to use Deep Learning for Data Augmentation with the idea in mind to create data for an underrepresented class in your dataset. However, the Deep Tabular Augmenter can be used for more than just that. In this blogpost I want to show you how you can use the technique to create data for regression tasks.

First, we need a dataset. Here, I use the infamous Boston Housing Dataset.

import pandas as pd
import numpy as np
import torch
from torch import nn
from torch import optim
from sklearn.preprocessing import StandardScaler
from functools import partial
import deep_tabular_augmentation as dta
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

DATA_PATH = 'data/housing.csv'

header_list = ['CRIM', 'ZN','INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

df = pd.read_csv(DATA_PATH, delim_whitespace=True, names=header_list)
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
df.shape
(506, 14)

As we have only 506 entries, from which we can observe data, this example is a pretty good use case of using the Deep Tabular Augmenter. Usually, people use this dataset for predicting the MEDV, so the Median value of owner-occupied homes in $1000's. Let's first create train and test datasets:

X_train, X_test = train_test_split(df, test_size=0.1, random_state=42)

x_scaler = StandardScaler()

X_train_scaled = x_scaler.fit_transform(X_train)

X_test_scaled = x_scaler.transform(X_test)
pd.DataFrame(X_train_scaled, columns=list(df.columns)).head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 -0.416942 0.344513 -1.117966 -0.270395 -0.960137 0.943640 -1.102673 0.654891 -0.523106 -1.144555 -1.601746 0.398294 -1.108176 1.365122
1 -0.280002 -0.499723 -0.421068 -0.270395 -0.145806 -0.222195 0.832605 0.069475 -0.638367 -0.601866 1.175568 0.448420 0.863237 -0.805235
2 -0.408091 -0.499723 -0.360216 -0.270395 -0.299938 0.679704 0.108207 -0.448063 -0.523106 -0.142668 1.130038 0.434251 -0.678455 0.408875
3 -0.359270 0.344513 -1.025240 -0.270395 0.171021 1.652175 -0.555824 -0.440721 -0.523106 -0.858301 -2.466811 0.377578 -1.307689 2.235413
4 -0.000352 -0.499723 1.021988 -0.270395 0.239524 0.017747 -0.580681 0.076309 1.666847 1.539070 0.811330 0.359545 -0.272453 -0.160575

Then, we'll put the data into dataloaders, so our Deep Tabular Augmenter can do calculations on it. This time, we use a slightly different dataloader-function: dta.create_datasets_no_target_var. This is a convenience function for when we do not have a target-variable. Usually, for example in the fraud-detection case, we have a specific target vaiable at hand, for which we do not want to create fake data. This here is different, as we also want to create fake data for our (in later tasks used) MEDV variable.

Also, make sure to have at least version 0.5.0 of deep_tabular_augmentation, otherwise you won't have this function. If not, just go: pip install deep_tabular_augmentation --upgrade

datasets = dta.create_datasets_no_target_var(X_train_scaled, X_test_scaled)
data = dta.DataBunch(*dta.create_loaders(datasets, bs=1024))

From now on, everything should be very familiar. Let's start with the basic architecture of our model:

D_in = X_train_scaled.shape[1]
VAE_arch = [50, 12, 12]
df_cols = list(df.columns)

model = dta.Autoencoder(D_in, VAE_arch, latent_dim=5).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
loss_func = dta.customLoss()

Put everything data related into the learner class:

learn = dta.Learner(model, opt, loss_func, data, cols=df_cols)
run = dta.Runner(cb_funcs=[dta.LR_Find, dta.Recorder])

run.fit(100, learn)
run.recorder.plot(skip_last=5)

Set up a Learning Rate schema of your choice and train the model:

sched = dta.combine_scheds([0.3, 0.7], [dta.sched_cos(0.01, 0.1), dta.sched_cos(0.1, 0.01)])
cbfs = [partial(dta.LossTracker, show_every=50), dta.Recorder, partial(dta.ParamScheduler, 'lr', sched)]
model = dta.Autoencoder(D_in, VAE_arch, latent_dim=20).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
learn = dta.Learner(model, opt, loss_func, data, cols=df_cols)
run = dta.Runner(cb_funcs=cbfs)
run.fit(400, learn)
epoch: 50
train loss is: 13623.060546875
validation loss is: 612.1049194335938
epoch: 100
train loss is: 5125.99072265625
validation loss is: 439.03076171875
epoch: 150
train loss is: 4169.7734375
validation loss is: 519.6369018554688
epoch: 200
train loss is: 3732.431396484375
validation loss is: 511.1471862792969
epoch: 250
train loss is: 3492.6806640625
validation loss is: 468.2626953125
epoch: 300
train loss is: 3326.837646484375
validation loss is: 430.8256530761719
epoch: 350
train loss is: 3203.124267578125
validation loss is: 402.46844482421875
epoch: 400
train loss is: 3105.912109375
validation loss is: 380.48358154296875

Let's have a look at the created data:

new_data_points = 1000
df_fake = run.predict_df(learn, no_samples=new_data_points, scaler=x_scaler)
std_list = list(df[df_cols].std()/10)
df_fake_with_noise = run.predict_with_noise_df(learn, no_samples=new_data_points, mu=0, sigma=std_list, scaler=x_scaler)
df_fake_with_noise.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 -0.061319 7.058959 8.945721 0.013245 0.488697 6.139548 58.205878 3.804640 4.543752 320.783396 18.403781 398.551691 11.238506 22.612332
1 1.003866 6.856581 8.805585 0.025567 0.516702 6.069803 67.046331 3.949096 5.485925 326.843934 18.556644 381.434741 10.764693 20.467331
2 0.976743 4.248924 8.800449 -0.009408 0.509263 5.971862 54.597878 4.920510 4.968876 295.964594 18.648691 383.313037 12.274177 22.356870
3 0.329469 -2.955746 9.026785 0.000997 0.520971 6.012886 77.822414 3.400364 5.377612 335.715743 19.341018 374.741465 13.768394 19.288477
4 -0.808149 1.374236 8.696051 -0.025238 0.544275 6.043268 77.303222 3.288870 3.969855 363.377079 19.962985 370.265762 15.021694 17.637887

As RAD and CHAS are not continuous, let's just round it to the nearest number.

df_fake_with_noise['RAD'] = df_fake_with_noise['RAD'].round()
df_fake_with_noise['CHAS'] = df_fake_with_noise['CHAS'].round()

Now that we have that, let's come to the evaluation part. Training is all fun and good, but we want to see some results. So let's plot it:

fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df['MEDV'], df['AGE'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('AGE', fontsize=15)
ax.set_title('Real Data: MEDV vs AGE', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()
fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df_fake_with_noise['MEDV'], df_fake_with_noise['AGE'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('AGE', fontsize=15)
ax.set_title('Fake Data: MEDV vs AGE', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()
fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df['MEDV'], df['RM'], c=df['LSTAT'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('RM', fontsize=15)
ax.set_title('Real Data: MEDV vs avg no of rooms by percentage of lower status of population', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()
fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df_fake_with_noise['MEDV'], df_fake_with_noise['RM'], c=df_fake_with_noise['LSTAT'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('RM', fontsize=15)
ax.set_title('Fake Data: MEDV vs avg no of rooms by percentage of lower status of population', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()

I think that looks pretty awesome! So remember, whenever you have just not enough data for your model to train properly, you can try the Deep Tabular Augmenter to create enough fake data, so your model is actually able to learn something. I hope this helps and stay tuned for more!

Lasse