Deep Tabular Augmentation for Regression Tasks
So far I did show how to use Deep Learning for Data Augmentation with the idea in mind to create data for an underrepresented class in your dataset. However, the Deep Tabular Augmenter can be used for more than just that. In this blogpost I want to show you how you can use the technique to create data for regression tasks.
First, we need a dataset. Here, I use the infamous Boston Housing Dataset.
import pandas as pd
import numpy as np
import torch
from torch import nn
from torch import optim
from sklearn.preprocessing import StandardScaler
from functools import partial
import deep_tabular_augmentation as dta
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
DATA_PATH = 'data/housing.csv'
header_list = ['CRIM', 'ZN','INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df = pd.read_csv(DATA_PATH, delim_whitespace=True, names=header_list)
df.head()
df.shape
As we have only 506 entries, from which we can observe data, this example is a pretty good use case of using the Deep Tabular Augmenter. Usually, people use this dataset for predicting the MEDV, so the Median value of owner-occupied homes in $1000's. Let's first create train and test datasets:
X_train, X_test = train_test_split(df, test_size=0.1, random_state=42)
x_scaler = StandardScaler()
X_train_scaled = x_scaler.fit_transform(X_train)
X_test_scaled = x_scaler.transform(X_test)
pd.DataFrame(X_train_scaled, columns=list(df.columns)).head()
Then, we'll put the data into dataloaders, so our Deep Tabular Augmenter can do calculations on it. This time, we use a slightly different dataloader-function: dta.create_datasets_no_target_var. This is a convenience function for when we do not have a target-variable. Usually, for example in the fraud-detection case, we have a specific target vaiable at hand, for which we do not want to create fake data. This here is different, as we also want to create fake data for our (in later tasks used) MEDV variable.
Also, make sure to have at least version 0.5.0 of deep_tabular_augmentation, otherwise you won't have this function. If not, just go: pip install deep_tabular_augmentation --upgrade
datasets = dta.create_datasets_no_target_var(X_train_scaled, X_test_scaled)
data = dta.DataBunch(*dta.create_loaders(datasets, bs=1024))
From now on, everything should be very familiar. Let's start with the basic architecture of our model:
D_in = X_train_scaled.shape[1]
VAE_arch = [50, 12, 12]
df_cols = list(df.columns)
model = dta.Autoencoder(D_in, VAE_arch, latent_dim=5).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
loss_func = dta.customLoss()
Put everything data related into the learner class:
learn = dta.Learner(model, opt, loss_func, data, cols=df_cols)
run = dta.Runner(cb_funcs=[dta.LR_Find, dta.Recorder])
run.fit(100, learn)
run.recorder.plot(skip_last=5)
Set up a Learning Rate schema of your choice and train the model:
sched = dta.combine_scheds([0.3, 0.7], [dta.sched_cos(0.01, 0.1), dta.sched_cos(0.1, 0.01)])
cbfs = [partial(dta.LossTracker, show_every=50), dta.Recorder, partial(dta.ParamScheduler, 'lr', sched)]
model = dta.Autoencoder(D_in, VAE_arch, latent_dim=20).to(device)
opt = optim.Adam(model.parameters(), lr=0.01)
learn = dta.Learner(model, opt, loss_func, data, cols=df_cols)
run = dta.Runner(cb_funcs=cbfs)
run.fit(400, learn)
Let's have a look at the created data:
new_data_points = 1000
df_fake = run.predict_df(learn, no_samples=new_data_points, scaler=x_scaler)
std_list = list(df[df_cols].std()/10)
df_fake_with_noise = run.predict_with_noise_df(learn, no_samples=new_data_points, mu=0, sigma=std_list, scaler=x_scaler)
df_fake_with_noise.head()
As RAD and CHAS are not continuous, let's just round it to the nearest number.
df_fake_with_noise['RAD'] = df_fake_with_noise['RAD'].round()
df_fake_with_noise['CHAS'] = df_fake_with_noise['CHAS'].round()
Now that we have that, let's come to the evaluation part. Training is all fun and good, but we want to see some results. So let's plot it:
fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df['MEDV'], df['AGE'], alpha=0.5)
ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('AGE', fontsize=15)
ax.set_title('Real Data: MEDV vs AGE', fontsize=15)
ax.grid(True)
fig.tight_layout()
plt.show()
fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df_fake_with_noise['MEDV'], df_fake_with_noise['AGE'], alpha=0.5)
ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('AGE', fontsize=15)
ax.set_title('Fake Data: MEDV vs AGE', fontsize=15)
ax.grid(True)
fig.tight_layout()
plt.show()
fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df['MEDV'], df['RM'], c=df['LSTAT'], alpha=0.5)
ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('RM', fontsize=15)
ax.set_title('Real Data: MEDV vs avg no of rooms by percentage of lower status of population', fontsize=15)
ax.grid(True)
fig.tight_layout()
plt.show()
fig, ax = plt.subplots()
fig.set_size_inches(15.5, 8.5)
ax.scatter(df_fake_with_noise['MEDV'], df_fake_with_noise['RM'], c=df_fake_with_noise['LSTAT'], alpha=0.5)
ax.set_xlabel('MEDV', fontsize=15)
ax.set_ylabel('RM', fontsize=15)
ax.set_title('Fake Data: MEDV vs avg no of rooms by percentage of lower status of population', fontsize=15)
ax.grid(True)
fig.tight_layout()
plt.show()
I think that looks pretty awesome! So remember, whenever you have just not enough data for your model to train properly, you can try the Deep Tabular Augmenter to create enough fake data, so your model is actually able to learn something. I hope this helps and stay tuned for more!
Lasse