So far I did show how to use Deep Learning for Data Augmentation with the idea in mind to create data for an underrepresented class in your dataset. However, the Deep Tabular Augmenter can be used for more than just that. In this blogpost I want to show you how you can use the technique to create data for regression tasks.

First, we need a dataset. Here, I use the infamous Boston Housing Dataset.

import pandas as pd
import numpy as np
import torch
from torch import nn
from torch import optim
from sklearn.preprocessing import StandardScaler
from functools import partial
import deep_tabular_augmentation as dta
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

DATA_PATH = 'data/housing.csv'

header_list = ['CRIM', 'ZN','INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

df = pd.read_csv(DATA_PATH, delim_whitespace=True, names=header_list)

df.head()

df.shape

(506, 14)

As we have only 506 entries, from which we can observe data, this example is a pretty good use case of using the Deep Tabular Augmenter. Usually, people use this dataset for predicting the MEDV, so the Median value of owner-occupied homes in $1000's. Let's first create train and test datasets:

X_train, X_test = train_test_split(df, test_size=0.1, random_state=42)

x_scaler = StandardScaler()

X_train_scaled = x_scaler.fit_transform(X_train)

X_test_scaled = x_scaler.transform(X_test)

pd.DataFrame(X_train_scaled, columns=list(df.columns)).head()

Then, we'll put the data into dataloaders, so our Deep Tabular Augmenter can do calculations on it. This time, we use a slightly different dataloader-function: dta.create_datasets_no_target_var. This is a convenience function for when we do not have a target-variable. Usually, for example in the fraud-detection case, we have a specific target vaiable at hand, for which we do not want to create fake data. This here is different, as we also want to create fake data for our (in later tasks used) MEDV variable.

Also, make sure to have at least version 0.5.0 of deep_tabular_augmentation, otherwise you won't have this function. If not, just go: pip install deep_tabular_augmentation --upgrade

datasets = dta.create_datasets_no_target_var(X_train_scaled, X_test_scaled)
data = dta.DataBunch(*dta.create_loaders(datasets, bs=1024))

From now on, everything should be very familiar. Let's start with the basic architecture of our model:

D_in = X_train_scaled.shape[1]
VAE_arch = [50, 12, 12]
df_cols = list(df.columns)

model = dta.Autoencoder(D_in, VAE_arch, latent_dim=5).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
loss_func = dta.customLoss()

Put everything data related into the learner class:

learn = dta.Learner(model, opt, loss_func, data, cols=df_cols)
run = dta.Runner(cb_funcs=[dta.LR_Find, dta.Recorder])

run.fit(100, learn)

run.recorder.plot(skip_last=5)

Set up a Learning Rate schema of your choice and train the model:

sched = dta.combine_scheds([0.3, 0.7], [dta.sched_cos(0.01, 0.1), dta.sched_cos(0.1, 0.01)])

cbfs = [partial(dta.LossTracker, show_every=50), dta.Recorder, partial(dta.ParamScheduler, 'lr', sched)]
model = dta.Autoencoder(D_in, VAE_arch, latent_dim=20).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
learn = dta.Learner(model, opt, loss_func, data, cols=df_cols)
run = dta.Runner(cb_funcs=cbfs)
run.fit(400, learn)

epoch: 50
train loss is: 13623.060546875
validation loss is: 612.1049194335938
epoch: 100
train loss is: 5125.99072265625
validation loss is: 439.03076171875
epoch: 150
train loss is: 4169.7734375
validation loss is: 519.6369018554688
epoch: 200
train loss is: 3732.431396484375
validation loss is: 511.1471862792969
epoch: 250
train loss is: 3492.6806640625
validation loss is: 468.2626953125
epoch: 300
train loss is: 3326.837646484375
validation loss is: 430.8256530761719
epoch: 350
train loss is: 3203.124267578125
validation loss is: 402.46844482421875
epoch: 400
train loss is: 3105.912109375
validation loss is: 380.48358154296875

Let's have a look at the created data:

new_data_points = 1000
df_fake = run.predict_df(learn, no_samples=new_data_points, scaler=x_scaler)
std_list = list(df[df_cols].std()/10)
df_fake_with_noise = run.predict_with_noise_df(learn, no_samples=new_data_points, mu=0, sigma=std_list, scaler=x_scaler)
df_fake_with_noise.head()

As RAD and CHAS are not continuous, let's just round it to the nearest number.

df_fake_with_noise['RAD'] = df_fake_with_noise['RAD'].round()
df_fake_with_noise['CHAS'] = df_fake_with_noise['CHAS'].round()

Now that we have that, let's come to the evaluation part. Training is all fun and good, but we want to see some results. So let's plot it:

fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df['MEDV'], df['AGE'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('AGE', fontsize=15)
ax.set_title('Real Data: MEDV vs AGE', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()

fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df_fake_with_noise['MEDV'], df_fake_with_noise['AGE'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('AGE', fontsize=15)
ax.set_title('Fake Data: MEDV vs AGE', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()

fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df['MEDV'], df['RM'], c=df['LSTAT'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('RM', fontsize=15)
ax.set_title('Real Data: MEDV vs avg no of rooms by percentage of lower status of population', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()

fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df_fake_with_noise['MEDV'], df_fake_with_noise['RM'], c=df_fake_with_noise['LSTAT'], alpha=0.5)

ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('RM', fontsize=15)
ax.set_title('Fake Data: MEDV vs avg no of rooms by percentage of lower status of population', fontsize=15)

ax.grid(True)
fig.tight_layout()

plt.show()

I think that looks pretty awesome! So remember, whenever you have just not enough data for your model to train properly, you can try the Deep Tabular Augmenter to create enough fake data, so your model is actually able to learn something. I hope this helps and stay tuned for more!

Lasse

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222.0	18.7	396.90	5.33	36.2

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	-0.416942	0.344513	-1.117966	-0.270395	-0.960137	0.943640	-1.102673	0.654891	-0.523106	-1.144555	-1.601746	0.398294	-1.108176	1.365122
1	-0.280002	-0.499723	-0.421068	-0.270395	-0.145806	-0.222195	0.832605	0.069475	-0.638367	-0.601866	1.175568	0.448420	0.863237	-0.805235
2	-0.408091	-0.499723	-0.360216	-0.270395	-0.299938	0.679704	0.108207	-0.448063	-0.523106	-0.142668	1.130038	0.434251	-0.678455	0.408875
3	-0.359270	0.344513	-1.025240	-0.270395	0.171021	1.652175	-0.555824	-0.440721	-0.523106	-0.858301	-2.466811	0.377578	-1.307689	2.235413
4	-0.000352	-0.499723	1.021988	-0.270395	0.239524	0.017747	-0.580681	0.076309	1.666847	1.539070	0.811330	0.359545	-0.272453	-0.160575

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	-0.061319	7.058959	8.945721	0.013245	0.488697	6.139548	58.205878	3.804640	4.543752	320.783396	18.403781	398.551691	11.238506	22.612332
1	1.003866	6.856581	8.805585	0.025567	0.516702	6.069803	67.046331	3.949096	5.485925	326.843934	18.556644	381.434741	10.764693	20.467331
2	0.976743	4.248924	8.800449	-0.009408	0.509263	5.971862	54.597878	4.920510	4.968876	295.964594	18.648691	383.313037	12.274177	22.356870
3	0.329469	-2.955746	9.026785	0.000997	0.520971	6.012886	77.822414	3.400364	5.377612	335.715743	19.341018	374.741465	13.768394	19.288477
4	-0.808149	1.374236	8.696051	-0.025238	0.544275	6.043268	77.303222	3.288870	3.969855	363.377079	19.962985	370.265762	15.021694	17.637887