How to pre-process data for machine learning
A common task for almost all data science projects is to preprocess your data. This is especially the case for tabular data. In this blog I want to build a pre-process function which can handle will handle all these pre-processing steps for us. On top of that, I will add enough flexibility to that function to make it useful not just in this example.
Let's start by loading the data.
import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
We later want to save and load data with pickle, so let's define these functions:
def save_obj(obj, name ):
    with open(f'{name}.pkl', 'wb') as f:
        pickle.dump(obj, f)
def load_obj(name ):
    with open(f'{name}.pkl', 'rb') as f:
        return pickle.load(f)
df = pd.read_csv('train.csv')
df.head()
df.shape
We've got 81 columns in our data. That's a lot of data, for this blog we only need a couple of these variables. The function will then work for any length of datasets. We define which columns to keep and then check the data types.
to_keep = ['Id', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'OverallQual', 'YearBuilt', 'CentralAir', 'SaleType', 'SalePrice']
df = df[to_keep]
df.head()
pd.set_option('display.max_rows', 20)
df.dtypes
In tabular data, there are usually four different data types: dates, continuous and categorical variables (of which boolean are a special type). We need to define which column is of which type. This dataset does not contain a datetime column, so I will define some random datetime data to show you how to cope with datetime data too.
df["Fake_date"] = np.random.choice(pd.date_range('1980-01-01', '2000-01-01'), len(df)).astype('str')
df["Fake_date_2"] = np.random.choice(pd.date_range('1980-01-01', '2000-01-01'), len(df)).astype('str')
df.head()
df.dtypes
We now need to define which column should be of which type. We use four lists for this:
date_type = ['Fake_date', 'Fake_date_2']
continuous_type = ['LotFrontage', 'LotArea', 'YearBuilt', 'SalePrice']
categorical_type = ['MSZoning', 'Street', 'OverallQual', 'SaleType','CentralAir']
I defined 'OverallQual' as a category, because even though this variable is in ascending order, we do not know whether the difference from OverallQual of 6 and 7 is the same magnitude than OverallQual from 0 to 1. However, you can define this variable as continous as well.
We then define a small function, which will take our lists as input and transform our data to the correct data types.
def df_to_type(df, cont_type, cat_type):
    if cat_type is not None:
        df[cat_type] = df[cat_type].astype('category')
    for i in date_type:
        df[i] = pd.to_datetime(df[i])
    return df    
Did it work?
df_to_type(df, continuous_type, categorical_type)
df.dtypes
That looks good! We can now make use of pandas various datetime functions to add more informations to our data. I therefore make use of fastai's add_datepart function. However, I do not want to import the whole library, so I simple get the parts for this specific function.
import re
def ifnone(a:any,b:any)->any:
    "`a` if `a` is not None, otherwise `b`."
    return b if a is None else a
    
def make_date(df, date_field):
    "Make sure `df[date_field]` is of the right date type."
    field_dtype = df[date_field].dtype
    if isinstance(field_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        field_dtype = np.datetime64
    if not np.issubdtype(field_dtype, np.datetime64):
        df[date_field] = pd.to_datetime(df[date_field], infer_datetime_format=True)
    
def add_datepart(df, field_name, prefix=None, drop=True, time=False):
    "Helper function that adds columns relevant to a date in the column `field_name` of `df`."
    make_date(df, field_name)
    field = df[field_name]
    prefix = ifnone(prefix, re.sub('[Dd]ate$', '', field_name))
    attr = ['Year', 'Month', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start',
            'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[prefix + n] = getattr(field.dt, n.lower())
    # Pandas removed `dt.week` in v1.1.10
    week = field.dt.isocalendar().week if hasattr(field.dt, 'isocalendar') else field.dt.week
    df.insert(3, prefix+'Week', week)
    mask = ~field.isna()
    df[prefix + 'Elapsed'] = np.where(mask,field.values.astype(np.int64) // 10 ** 9,None)
    if drop: df.drop(field_name, axis=1, inplace=True)
    return df
Let's add to the function, so that after defining the correct data type it takes all of our date-variables and transforms them according to add_datepart.
def df_to_type(df, date_type=None, cont_type=None, cat_type=None):
    if cat_type is not None:
        df[cat_type] = df[cat_type].astype('category')
    if date_type is not None:
        for i in date_type:
            df[i] = pd.to_datetime(df[i])
            df = add_datepart(df, i)
    return df  
df_1 = df_to_type(df, date_type, continuous_type, categorical_type)
df_1.head()
Pretty cool right? The next step is to split the data into train and validation set. There are usually three ways to do this: split randomly, split based on a identifier column (so that the same person is either in train or valid regardless of how many rows she represents in the data) or by date (usually in time series).
I give three examples how to do this given this specific data. I start by randomly splitting the data:
First, we need to define our dependend variable and the rest:
dep_var = 'SalePrice'
cols = list(df_1.columns)
cols.remove(dep_var)
Here's how we can make a random split using the train_test_split:
X_train, X_test, y_train, y_test = train_test_split(df_1[cols], df_1[dep_var], test_size=0.33, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Another common way how to split the data is based on an ID, so that an ID can only be in one group. Next to the dependend variable we need to define the variable on which to split.
split_var = 'Id'
# list of unique_id
unique_id_array = list(df_1[split_var].unique())
# split into train and test data based on uid
test_size=0.33
cnt_uid = len(unique_id_array)
len_test = np.round(cnt_uid*test_size).astype(int)
len_train = cnt_uid - len_test
test_idx = list(np.random.choice(unique_id_array, len_test, replace=False))
train_idx = list(set(unique_id_array) - set(test_idx))
X_train = df_1[df_1[split_var].isin(train_idx)].copy()
y_train = X_train[dep_var]
X_train = X_train[cols]
X_test = df_1[df_1[split_var].isin(test_idx)].copy()
y_test = X_test[dep_var]
X_test = X_test[cols]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
In this case we end up with the exact same shape of all train and test data. This is due to the fact, that in this case the Id is unique. To see that this will change when there is more than one row per Id I create a fake_ID and show you the results. Let's assume we have 10 IDs:
IDS = np.array([1,2,3,4,5,6,7,8,9,10])
n = df_1.shape[0]
df_1['Fake_ID'] = np.resize(IDS, n)
df_1.Fake_ID.head(12)
We now do the same thing as before, only using this Fake_ID as split_ID:
split_var = 'Fake_ID'
# list of unique_id
unique_id_array = list(df_1[split_var].unique())
# split into train and test data based on uid
test_size=0.33
cnt_uid = len(unique_id_array)
len_test = np.round(cnt_uid*test_size).astype(int)
len_train = cnt_uid - len_test
test_idx = list(np.random.choice(unique_id_array, len_test, replace=False))
train_idx = list(set(unique_id_array) - set(test_idx))
X_train = df_1[df_1[split_var].isin(train_idx)].copy()
y_train = X_train[dep_var]
X_train = X_train[cols]
X_test = df_1[df_1[split_var].isin(test_idx)].copy()
y_test = X_test[dep_var]
X_test = X_test[cols]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Beatiful! So let's move on to our last way to split data, defined by the date. We have created the Fake_date variable, resulting in random dates from 01.01.2980 - 01.01.2000. Let's say, we want the last 6 month to be our testset.
cond = (df_1.Fake_Year<1999) | (df_1.Fake_Month<6)
train_idx = np.where( cond)[0]
test_idx = np.where(~cond)[0]
X_train = df_1.iloc[train_idx]
y_train = X_train[dep_var]
X_train = X_train[cols]
X_test = df_1.iloc[test_idx]
y_test = X_test[dep_var]
X_test = X_test[cols]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
We now have the three most common cases for how to split your data. Let's put that into a function.
def split_df(df, x_cols, dep_var, test_size, split_mode='random', split_var=None, cond=None):
    '''
    split_mode can take three values: random, on_split_id, on_condition
    '''
    if split_mode == 'random':
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(df[x_cols], df_1[dep_var], test_size=test_size)
    elif split_mode == 'on_split_id':
        if split_var is None:
            print('Give name of split_var')
        else:
            # list of unique_id
            unique_id_array = list(df[split_var].unique())
            # split into train and test data based on uid
            test_size=0.33
            cnt_uid = len(unique_id_array)
            len_test = np.round(cnt_uid*test_size).astype(int)
            len_train = cnt_uid - len_test
            test_idx = list(np.random.choice(unique_id_array, len_test, replace=False))
            train_idx = list(set(unique_id_array) - set(test_idx))
            X_train = df[df[split_var].isin(train_idx)].copy()
            y_train = X_train[dep_var]
            X_train = X_train[x_cols]
            X_test = df[df[split_var].isin(test_idx)].copy()
            y_test = X_test[dep_var]
            X_test = X_test[x_cols]
    elif split_mode == 'on_condition':
        if cond is None:
            print('You have to specify cond, for example like so: cond = (df_1.Fake_Year<1999) | (df_1.Fake_Month<6)')
        else:
            train_idx = np.where( cond)[0]
            test_idx = np.where(~cond)[0]
            X_train = df_1.iloc[train_idx]
            y_train = X_train[dep_var]
            X_train = X_train[cols]
            X_test = df_1.iloc[test_idx]
            y_test = X_test[dep_var]
            X_test = X_test[cols]
    else:
        print('Something is not working right, did you specify the split_mode?')
    
    return X_train, X_test, y_train, y_test        
Let's check our function for all three split_modes:
X_train, X_test, y_train, y_test   = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='random')
X_train.shape, X_test.shape, y_train.shape, y_test.shape
X_train, X_test, y_train, y_test   = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='on_split_id', split_var=split_var)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
X_train, X_test, y_train, y_test   = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='on_condition', cond=cond)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Absolutely brilliant. And we're almost done now. The last piece of the puzzle is the transformation and standardization of our data. We convert the categorical variables into numbers and save the mapping into a dictionary. We standardize the continuous variables with a standardizer of choice and save it. IMPORTANT: we build the dictionaries and the standardizer on the trainset only and use it on the testset to avoid any spillover effects.
First, we build a function which will transform all of our categorical variables and save it to path. We therefore need a function to save our dicts (and later load them).
def cat_transform(X_train, X_test, cat_type, path=''):
    dict_list = []
    dict_inv_list = []
    for i in cat_type:
        dict_ = dict( enumerate(X_train[i].cat.categories ) )
        dict_inv_ = {v: k for k, v in dict_.items()}
        X_train[i] = X_train[i].map(dict_inv_)
        X_test[i] = X_test[i].map(dict_inv_)
        dict_list.append(dict_)
        dict_inv_list.append(dict_inv_list)
    dict_name = f'{path}dict_list_cat'
    save_obj(dict_list, dict_name)
    dict_inv_name = f'{path}dict_inv_list_cat'
    save_obj(dict_inv_list, dict_inv_name)
    return X_train, X_test, dict_list, dict_inv_list
X_train, X_test, y_train, y_test = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='random')
X_train, X_test, dict_list, dict_inv_list = cat_transform(X_train, X_test, categorical_type)
X_train[categorical_type].head()
With this function we automatically saved our dictionaries and mappings to the corresponding path. Next, we take the continuous variables, use a standardizer of our choice and save the standardizer to path. We have to indicate the ID column, because we do not want to standardize this column. We also have to define whether we want to standardize our target variable.
def cont_standardize(X_train, X_test, y_train, y_test, cat_type=None, id_type=None, transform_y=True, path='', standardizer='StandardScaler'):
    if standardizer =='StandardScaler':
        scaler = StandardScaler()
        if cat_type==None:
            cont_type = list(X_train.columns)
            cont_type.remove(id_type)
        elif id_type==None:
            list(set(X_train.columns) - set(id_type))
        elif cat_type==None and id_type==None:
            cont_type = list(X_train.columns)
        else:
            cont_type = list(set(X_train.columns) - set(cat_type))
            cont_type.remove(id_type)
                                 
        X_train[cont_type] = scaler.fit_transform(X_train[cont_type])
        X_test[cont_type] = scaler.transform(X_test[cont_type])
        scaler_name = f'{path}StandardScaler'
        save_obj(scaler, scaler_name)
        if transform_y:
            scaler_y = StandardScaler()
            y_train = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
            y_test = scaler_y.transform(y_test.values.reshape(-1, 1))
            scaler_y_name = f'{path}StandardScaler_y'
            save_obj(scaler_y, scaler_name)
        else:
            pass
        if transform_y:
            return X_train, X_test, y_train, y_test, scaler, scaler_y
        else:
            return X_train, X_test, y_train, y_test, scaler
    
    elif standardizer =='MinMaxScaler':
        scaler = MinMaxScaler()
        if cat_type==None:
            cont_type = list(X_train.columns)
            cont_type.remove(id_type)
        elif id_type==None:
            list(set(X_train.columns) - set(id_type))
        elif cat_type==None and id_type==None:
            cont_type = list(X_train.columns)
        else:
            cont_type = list(set(X_train.columns) - set(cat_type))
            cont_type.remove(id_type)
                                 
        X_train[cont_type] = scaler.fit_transform(X_train[cont_type])
        X_test[cont_type] = scaler.transform(X_test[cont_type])
        scaler_name = f'{path}MinMaxScaler'
        save_obj(scaler, scaler_name)
        if transform_y:
            scaler_y = MinMaxScaler()
            y_train = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
            y_test = scaler_y.transform(y_test.values.reshape(-1, 1))
            scaler_y_name = f'{path}MinMaxScaler_y'
            save_obj(scaler_y, scaler_name)
        else:
            pass
        if transform_y:
            return X_train, X_test, y_train, y_test, scaler, scaler_y
        else:
            return X_train, X_test, y_train, y_test, scaler
        
    else:
        print('standardizer can either be StandardScaler or MinMaxScaler')    
id_type='Id'
X_train, X_test, y_train, y_test, scaler, scaler_y = cont_standardize(X_train, X_test, y_train, y_test, cat_type=categorical_type, id_type='Id', transform_y=True)
X_train.head()
y_train[:5]
We have now saved our StandardScaler to path and transformed the data the way we wanted it. To see the "real" values of our y_train we simply invert our standardizer:
scaler_y.inverse_transform(y_train[:5])
Beautiful! And now I want to show you how incredibly quickly we can now setup our data. I deleted everything in memory and started from here.
df = pd.read_csv('train.csv')
to_keep = ['Id', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'OverallQual', 'YearBuilt', 'CentralAir', 'SaleType', 'SalePrice']
df = df[to_keep]
df["Fake_date"] = np.random.choice(pd.date_range('1980-01-01', '2000-01-01'), len(df)).astype('str')
df.head()
date_type = ['Fake_date']
continuous_type = ['LotFrontage', 'LotArea', 'YearBuilt', 'SalePrice']
categorical_type = ['MSZoning', 'Street', 'OverallQual', 'SaleType','CentralAir']
df_1 = df_to_type(df, date_type, continuous_type, categorical_type)
dep_var = 'SalePrice'
cols = list(df_1.columns)
cols.remove(dep_var)
X_train, X_test, y_train, y_test = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='random')
X_train, X_test, dict_list, dict_inv_list = cat_transform(X_train, X_test, categorical_type)
X_train, X_test, y_train, y_test, scaler, scaler_y = cont_standardize(X_train, X_test, y_train, y_test, cat_type=categorical_type, id_type='Id', transform_y=True)
X_train.head()
y_train[:5]
Absolutely brilliant. Look how easy and fast we pre-processed our data. We now can start any Machine Learning Algorithm we want. It's super easy to replicate the results, because we saved the mapping and the scaler. We automatically created train and testset, we are flexible to use this in any way we want to split our data. I hope this will come in handy for your next machine learning project.
Stay tuned for the next blogpost!
Lasse