A common task for almost all data science projects is to preprocess your data. This is especially the case for tabular data. In this blog I want to build a pre-process function which can handle will handle all these pre-processing steps for us. On top of that, I will add enough flexibility to that function to make it useful not just in this example.

Let's start by loading the data.

Imports

import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

We later want to save and load data with pickle, so let's define these functions:

def save_obj(obj, name ):
    with open(f'{name}.pkl', 'wb') as f:
        pickle.dump(obj, f)

def load_obj(name ):
    with open(f'{name}.pkl', 'rb') as f:
        return pickle.load(f)
df = pd.read_csv('train.csv')
df.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

df.shape
(1460, 81)

We've got 81 columns in our data. That's a lot of data, for this blog we only need a couple of these variables. The function will then work for any length of datasets. We define which columns to keep and then check the data types.

to_keep = ['Id', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'OverallQual', 'YearBuilt', 'CentralAir', 'SaleType', 'SalePrice']
df = df[to_keep]
df.head()
Id MSZoning LotFrontage LotArea Street OverallQual YearBuilt CentralAir SaleType SalePrice
0 1 RL 65.0 8450 Pave 7 2003 Y WD 208500
1 2 RL 80.0 9600 Pave 6 1976 Y WD 181500
2 3 RL 68.0 11250 Pave 7 2001 Y WD 223500
3 4 RL 60.0 9550 Pave 7 1915 Y WD 140000
4 5 RL 84.0 14260 Pave 8 2000 Y WD 250000
pd.set_option('display.max_rows', 20)
df.dtypes
Id               int64
MSZoning        object
LotFrontage    float64
LotArea          int64
Street          object
OverallQual      int64
YearBuilt        int64
CentralAir      object
SaleType        object
SalePrice        int64
dtype: object

Datetime vs Continuous vs Categorical

In tabular data, there are usually four different data types: dates, continuous and categorical variables (of which boolean are a special type). We need to define which column is of which type. This dataset does not contain a datetime column, so I will define some random datetime data to show you how to cope with datetime data too.

df["Fake_date"] = np.random.choice(pd.date_range('1980-01-01', '2000-01-01'), len(df)).astype('str')
df["Fake_date_2"] = np.random.choice(pd.date_range('1980-01-01', '2000-01-01'), len(df)).astype('str')
df.head()
Id MSZoning LotFrontage LotArea Street OverallQual YearBuilt CentralAir SaleType SalePrice Fake_date Fake_date_2
0 1 RL 65.0 8450 Pave 7 2003 Y WD 208500 1985-12-22T00:00:00.000000000 1988-05-22T00:00:00.000000000
1 2 RL 80.0 9600 Pave 6 1976 Y WD 181500 1999-11-12T00:00:00.000000000 1983-08-11T00:00:00.000000000
2 3 RL 68.0 11250 Pave 7 2001 Y WD 223500 1998-03-07T00:00:00.000000000 1991-12-15T00:00:00.000000000
3 4 RL 60.0 9550 Pave 7 1915 Y WD 140000 1985-04-11T00:00:00.000000000 1980-11-27T00:00:00.000000000
4 5 RL 84.0 14260 Pave 8 2000 Y WD 250000 1980-12-04T00:00:00.000000000 1982-10-01T00:00:00.000000000
df.dtypes
Id               int64
MSZoning        object
LotFrontage    float64
LotArea          int64
Street          object
OverallQual      int64
YearBuilt        int64
CentralAir      object
SaleType        object
SalePrice        int64
Fake_date       object
Fake_date_2     object
dtype: object

We now need to define which column should be of which type. We use four lists for this:

date_type = ['Fake_date', 'Fake_date_2']
continuous_type = ['LotFrontage', 'LotArea', 'YearBuilt', 'SalePrice']
categorical_type = ['MSZoning', 'Street', 'OverallQual', 'SaleType','CentralAir']

I defined 'OverallQual' as a category, because even though this variable is in ascending order, we do not know whether the difference from OverallQual of 6 and 7 is the same magnitude than OverallQual from 0 to 1. However, you can define this variable as continous as well.

We then define a small function, which will take our lists as input and transform our data to the correct data types.

def df_to_type(df, cont_type, cat_type):
    if cat_type is not None:
        df[cat_type] = df[cat_type].astype('category')
    for i in date_type:
        df[i] = pd.to_datetime(df[i])
    return df    

Did it work?

df_to_type(df, continuous_type, categorical_type)
df.dtypes
Id                      int64
MSZoning             category
LotFrontage           float64
LotArea                 int64
Street               category
OverallQual          category
YearBuilt               int64
CentralAir           category
SaleType             category
SalePrice               int64
Fake_date      datetime64[ns]
Fake_date_2    datetime64[ns]
dtype: object

That looks good! We can now make use of pandas various datetime functions to add more informations to our data. I therefore make use of fastai's add_datepart function. However, I do not want to import the whole library, so I simple get the parts for this specific function.

import re

def ifnone(a:any,b:any)->any:
    "`a` if `a` is not None, otherwise `b`."
    return b if a is None else a
    
def make_date(df, date_field):
    "Make sure `df[date_field]` is of the right date type."
    field_dtype = df[date_field].dtype
    if isinstance(field_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        field_dtype = np.datetime64
    if not np.issubdtype(field_dtype, np.datetime64):
        df[date_field] = pd.to_datetime(df[date_field], infer_datetime_format=True)
    
def add_datepart(df, field_name, prefix=None, drop=True, time=False):
    "Helper function that adds columns relevant to a date in the column `field_name` of `df`."
    make_date(df, field_name)
    field = df[field_name]
    prefix = ifnone(prefix, re.sub('[Dd]ate$', '', field_name))
    attr = ['Year', 'Month', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start',
            'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[prefix + n] = getattr(field.dt, n.lower())
    # Pandas removed `dt.week` in v1.1.10
    week = field.dt.isocalendar().week if hasattr(field.dt, 'isocalendar') else field.dt.week
    df.insert(3, prefix+'Week', week)
    mask = ~field.isna()
    df[prefix + 'Elapsed'] = np.where(mask,field.values.astype(np.int64) // 10 ** 9,None)
    if drop: df.drop(field_name, axis=1, inplace=True)
    return df

Let's add to the function, so that after defining the correct data type it takes all of our date-variables and transforms them according to add_datepart.

def df_to_type(df, date_type=None, cont_type=None, cat_type=None):
    if cat_type is not None:
        df[cat_type] = df[cat_type].astype('category')
    if date_type is not None:
        for i in date_type:
            df[i] = pd.to_datetime(df[i])
            df = add_datepart(df, i)
    return df  
df_1 = df_to_type(df, date_type, continuous_type, categorical_type)
df_1.head()
Id MSZoning LotFrontage Fake_date_2Week Fake_Week LotArea Street OverallQual YearBuilt CentralAir ... Fake_date_2Day Fake_date_2Dayofweek Fake_date_2Dayofyear Fake_date_2Is_month_end Fake_date_2Is_month_start Fake_date_2Is_quarter_end Fake_date_2Is_quarter_start Fake_date_2Is_year_end Fake_date_2Is_year_start Fake_date_2Elapsed
0 1 RL 65.0 20 51 8450 Pave 7 2003 Y ... 22 6 143 False False False False False False 580262400
1 2 RL 80.0 32 45 9600 Pave 6 1976 Y ... 11 3 223 False False False False False False 429408000
2 3 RL 68.0 50 10 11250 Pave 7 2001 Y ... 15 6 349 False False False False False False 692755200
3 4 RL 60.0 48 15 9550 Pave 7 1915 Y ... 27 3 332 False False False False False False 344131200
4 5 RL 84.0 39 49 14260 Pave 8 2000 Y ... 1 4 274 False True False True False False 402278400

5 rows × 36 columns

Pretty cool right? The next step is to split the data into train and validation set. There are usually three ways to do this: split randomly, split based on a identifier column (so that the same person is either in train or valid regardless of how many rows she represents in the data) or by date (usually in time series).

I give three examples how to do this given this specific data. I start by randomly splitting the data:

First, we need to define our dependend variable and the rest:

dep_var = 'SalePrice'
cols = list(df_1.columns)
cols.remove(dep_var)

Here's how we can make a random split using the train_test_split:

X_train, X_test, y_train, y_test = train_test_split(df_1[cols], df_1[dep_var], test_size=0.33, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((978, 35), (482, 35), (978,), (482,))

Another common way how to split the data is based on an ID, so that an ID can only be in one group. Next to the dependend variable we need to define the variable on which to split.

split_var = 'Id'
# list of unique_id
unique_id_array = list(df_1[split_var].unique())

# split into train and test data based on uid
test_size=0.33
cnt_uid = len(unique_id_array)
len_test = np.round(cnt_uid*test_size).astype(int)
len_train = cnt_uid - len_test

test_idx = list(np.random.choice(unique_id_array, len_test, replace=False))
train_idx = list(set(unique_id_array) - set(test_idx))

X_train = df_1[df_1[split_var].isin(train_idx)].copy()
y_train = X_train[dep_var]
X_train = X_train[cols]
X_test = df_1[df_1[split_var].isin(test_idx)].copy()
y_test = X_test[dep_var]
X_test = X_test[cols]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((978, 35), (482, 35), (978,), (482,))

In this case we end up with the exact same shape of all train and test data. This is due to the fact, that in this case the Id is unique. To see that this will change when there is more than one row per Id I create a fake_ID and show you the results. Let's assume we have 10 IDs:

IDS = np.array([1,2,3,4,5,6,7,8,9,10])
n = df_1.shape[0]
df_1['Fake_ID'] = np.resize(IDS, n)
df_1.Fake_ID.head(12)
0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10     1
11     2
Name: Fake_ID, dtype: int64

We now do the same thing as before, only using this Fake_ID as split_ID:

split_var = 'Fake_ID'

# list of unique_id
unique_id_array = list(df_1[split_var].unique())

# split into train and test data based on uid
test_size=0.33
cnt_uid = len(unique_id_array)
len_test = np.round(cnt_uid*test_size).astype(int)
len_train = cnt_uid - len_test

test_idx = list(np.random.choice(unique_id_array, len_test, replace=False))
train_idx = list(set(unique_id_array) - set(test_idx))

X_train = df_1[df_1[split_var].isin(train_idx)].copy()
y_train = X_train[dep_var]
X_train = X_train[cols]
X_test = df_1[df_1[split_var].isin(test_idx)].copy()
y_test = X_test[dep_var]
X_test = X_test[cols]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1022, 35), (438, 35), (1022,), (438,))

Beatiful! So let's move on to our last way to split data, defined by the date. We have created the Fake_date variable, resulting in random dates from 01.01.2980 - 01.01.2000. Let's say, we want the last 6 month to be our testset.

cond = (df_1.Fake_Year<1999) | (df_1.Fake_Month<6)

train_idx = np.where( cond)[0]
test_idx = np.where(~cond)[0]

X_train = df_1.iloc[train_idx]
y_train = X_train[dep_var]
X_train = X_train[cols]
X_test = df_1.iloc[test_idx]
y_test = X_test[dep_var]
X_test = X_test[cols]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1413, 35), (47, 35), (1413,), (47,))

We now have the three most common cases for how to split your data. Let's put that into a function.

def split_df(df, x_cols, dep_var, test_size, split_mode='random', split_var=None, cond=None):
    '''
    split_mode can take three values: random, on_split_id, on_condition
    '''
    if split_mode == 'random':
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(df[x_cols], df_1[dep_var], test_size=test_size)
    elif split_mode == 'on_split_id':
        if split_var is None:
            print('Give name of split_var')
        else:
            # list of unique_id
            unique_id_array = list(df[split_var].unique())

            # split into train and test data based on uid
            test_size=0.33
            cnt_uid = len(unique_id_array)
            len_test = np.round(cnt_uid*test_size).astype(int)
            len_train = cnt_uid - len_test

            test_idx = list(np.random.choice(unique_id_array, len_test, replace=False))
            train_idx = list(set(unique_id_array) - set(test_idx))

            X_train = df[df[split_var].isin(train_idx)].copy()
            y_train = X_train[dep_var]
            X_train = X_train[x_cols]
            X_test = df[df[split_var].isin(test_idx)].copy()
            y_test = X_test[dep_var]
            X_test = X_test[x_cols]
    elif split_mode == 'on_condition':
        if cond is None:
            print('You have to specify cond, for example like so: cond = (df_1.Fake_Year<1999) | (df_1.Fake_Month<6)')
        else:
            train_idx = np.where( cond)[0]
            test_idx = np.where(~cond)[0]

            X_train = df_1.iloc[train_idx]
            y_train = X_train[dep_var]
            X_train = X_train[cols]
            X_test = df_1.iloc[test_idx]
            y_test = X_test[dep_var]
            X_test = X_test[cols]
    else:
        print('Something is not working right, did you specify the split_mode?')
    
    return X_train, X_test, y_train, y_test        

Let's check our function for all three split_modes:

X_train, X_test, y_train, y_test   = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='random')
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((978, 35), (482, 35), (978,), (482,))
X_train, X_test, y_train, y_test   = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='on_split_id', split_var=split_var)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1022, 35), (438, 35), (1022,), (438,))
X_train, X_test, y_train, y_test   = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='on_condition', cond=cond)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1413, 35), (47, 35), (1413,), (47,))

Absolutely brilliant. And we're almost done now. The last piece of the puzzle is the transformation and standardization of our data. We convert the categorical variables into numbers and save the mapping into a dictionary. We standardize the continuous variables with a standardizer of choice and save it. IMPORTANT: we build the dictionaries and the standardizer on the trainset only and use it on the testset to avoid any spillover effects.

First, we build a function which will transform all of our categorical variables and save it to path. We therefore need a function to save our dicts (and later load them).

def cat_transform(X_train, X_test, cat_type, path=''):
    dict_list = []
    dict_inv_list = []
    for i in cat_type:
        dict_ = dict( enumerate(X_train[i].cat.categories ) )
        dict_inv_ = {v: k for k, v in dict_.items()}
        X_train[i] = X_train[i].map(dict_inv_)
        X_test[i] = X_test[i].map(dict_inv_)
        dict_list.append(dict_)
        dict_inv_list.append(dict_inv_list)
    dict_name = f'{path}dict_list_cat'
    save_obj(dict_list, dict_name)
    dict_inv_name = f'{path}dict_inv_list_cat'
    save_obj(dict_inv_list, dict_inv_name)
    return X_train, X_test, dict_list, dict_inv_list
X_train, X_test, y_train, y_test = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='random')
X_train, X_test, dict_list, dict_inv_list = cat_transform(X_train, X_test, categorical_type)
X_train[categorical_type].head()
MSZoning Street OverallQual SaleType CentralAir
744 3 1 7 8 1
492 3 1 5 6 1
307 4 1 5 8 1
985 3 1 4 3 0
484 3 1 4 8 1

With this function we automatically saved our dictionaries and mappings to the corresponding path. Next, we take the continuous variables, use a standardizer of our choice and save the standardizer to path. We have to indicate the ID column, because we do not want to standardize this column. We also have to define whether we want to standardize our target variable.

def cont_standardize(X_train, X_test, y_train, y_test, cat_type=None, id_type=None, transform_y=True, path='', standardizer='StandardScaler'):
    if standardizer =='StandardScaler':
        scaler = StandardScaler()
        if cat_type==None:
            cont_type = list(X_train.columns)
            cont_type.remove(id_type)
        elif id_type==None:
            list(set(X_train.columns) - set(id_type))
        elif cat_type==None and id_type==None:
            cont_type = list(X_train.columns)
        else:
            cont_type = list(set(X_train.columns) - set(cat_type))
            cont_type.remove(id_type)
                                 
        X_train[cont_type] = scaler.fit_transform(X_train[cont_type])
        X_test[cont_type] = scaler.transform(X_test[cont_type])
        scaler_name = f'{path}StandardScaler'
        save_obj(scaler, scaler_name)
        if transform_y:
            scaler_y = StandardScaler()
            y_train = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
            y_test = scaler_y.transform(y_test.values.reshape(-1, 1))
            scaler_y_name = f'{path}StandardScaler_y'
            save_obj(scaler_y, scaler_name)
        else:
            pass
        if transform_y:
            return X_train, X_test, y_train, y_test, scaler, scaler_y
        else:
            return X_train, X_test, y_train, y_test, scaler
    
    elif standardizer =='MinMaxScaler':
        scaler = MinMaxScaler()
        if cat_type==None:
            cont_type = list(X_train.columns)
            cont_type.remove(id_type)
        elif id_type==None:
            list(set(X_train.columns) - set(id_type))
        elif cat_type==None and id_type==None:
            cont_type = list(X_train.columns)
        else:
            cont_type = list(set(X_train.columns) - set(cat_type))
            cont_type.remove(id_type)
                                 
        X_train[cont_type] = scaler.fit_transform(X_train[cont_type])
        X_test[cont_type] = scaler.transform(X_test[cont_type])
        scaler_name = f'{path}MinMaxScaler'
        save_obj(scaler, scaler_name)
        if transform_y:
            scaler_y = MinMaxScaler()
            y_train = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
            y_test = scaler_y.transform(y_test.values.reshape(-1, 1))
            scaler_y_name = f'{path}MinMaxScaler_y'
            save_obj(scaler_y, scaler_name)
        else:
            pass
        if transform_y:
            return X_train, X_test, y_train, y_test, scaler, scaler_y
        else:
            return X_train, X_test, y_train, y_test, scaler
        
    else:
        print('standardizer can either be StandardScaler or MinMaxScaler')    
id_type='Id'
X_train, X_test, y_train, y_test, scaler, scaler_y = cont_standardize(X_train, X_test, y_train, y_test, cat_type=categorical_type, id_type='Id', transform_y=True)
X_train.head()
Id MSZoning LotFrontage Fake_date_2Week Fake_Week LotArea Street OverallQual YearBuilt CentralAir ... Fake_date_2Day Fake_date_2Dayofweek Fake_date_2Dayofyear Fake_date_2Is_month_end Fake_date_2Is_month_start Fake_date_2Is_quarter_end Fake_date_2Is_quarter_start Fake_date_2Is_year_end Fake_date_2Is_year_start Fake_date_2Elapsed
744 0.043062 3 -1.147645 1.484951 1.160387 -0.511627 1 7 0.716393 1 ... -0.455404 1.448463 1.544753 -0.15861 -0.186871 -0.101639 -0.111456 0.0 -0.064084 -1.213687
492 -0.554919 3 1.391675 -0.384396 -0.232891 0.521756 1 5 1.155090 1 ... 0.557602 -0.500088 -0.401376 -0.15861 -0.186871 -0.101639 -0.111456 0.0 -0.064084 -1.640805
307 -0.993913 4 NaN -1.252306 -1.029050 -0.255387 1 5 -1.747058 1 ... 0.670158 1.448463 -1.250422 -0.15861 -0.186871 -0.101639 -0.111456 0.0 -0.064084 -0.513396
985 0.614941 3 -0.076369 -0.117346 0.032495 0.044997 1 4 -0.734681 0 ... 0.895271 1.448463 -0.086561 -0.15861 -0.186871 -0.101639 -0.111456 0.0 -0.064084 0.043000
484 -0.573903 3 NaN 0.149703 -0.630971 -0.271827 1 4 -0.329730 1 ... 0.107377 -1.474364 0.132856 -0.15861 -0.186871 -0.101639 -0.111456 0.0 -0.064084 0.887644

5 rows × 35 columns

y_train[:5]
array([[-0.02988161],
       [-0.11953991],
       [-1.1544938 ],
       [-0.71334758],
       [-0.62014767]])

We have now saved our StandardScaler to path and transformed the data the way we wanted it. To see the "real" values of our y_train we simply invert our standardizer:

scaler_y.inverse_transform(y_train[:5])
array([[180000.],
       [172785.],
       [ 89500.],
       [125000.],
       [132500.]])

Beautiful! And now I want to show you how incredibly quickly we can now setup our data. I deleted everything in memory and started from here.

Complete run-through

df = pd.read_csv('train.csv')
to_keep = ['Id', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'OverallQual', 'YearBuilt', 'CentralAir', 'SaleType', 'SalePrice']
df = df[to_keep]
df["Fake_date"] = np.random.choice(pd.date_range('1980-01-01', '2000-01-01'), len(df)).astype('str')
df.head()
Id MSZoning LotFrontage LotArea Street OverallQual YearBuilt CentralAir SaleType SalePrice Fake_date
0 1 RL 65.0 8450 Pave 7 2003 Y WD 208500 1988-10-01T00:00:00.000000000
1 2 RL 80.0 9600 Pave 6 1976 Y WD 181500 1989-05-17T00:00:00.000000000
2 3 RL 68.0 11250 Pave 7 2001 Y WD 223500 1992-09-20T00:00:00.000000000
3 4 RL 60.0 9550 Pave 7 1915 Y WD 140000 1984-08-11T00:00:00.000000000
4 5 RL 84.0 14260 Pave 8 2000 Y WD 250000 1986-04-07T00:00:00.000000000
date_type = ['Fake_date']
continuous_type = ['LotFrontage', 'LotArea', 'YearBuilt', 'SalePrice']
categorical_type = ['MSZoning', 'Street', 'OverallQual', 'SaleType','CentralAir']
df_1 = df_to_type(df, date_type, continuous_type, categorical_type)
dep_var = 'SalePrice'
cols = list(df_1.columns)
cols.remove(dep_var)
X_train, X_test, y_train, y_test = split_df(df=df_1, x_cols=cols, dep_var=dep_var, test_size=0.33, split_mode='random')
X_train, X_test, dict_list, dict_inv_list = cat_transform(X_train, X_test, categorical_type)
X_train, X_test, y_train, y_test, scaler, scaler_y = cont_standardize(X_train, X_test, y_train, y_test, cat_type=categorical_type, id_type='Id', transform_y=True)
X_train.head()
Id MSZoning LotFrontage Fake_Week LotArea Street OverallQual YearBuilt CentralAir SaleType ... Fake_Day Fake_Dayofweek Fake_Dayofyear Fake_Is_month_end Fake_Is_month_start Fake_Is_quarter_end Fake_Is_quarter_start Fake_Is_year_end Fake_Is_year_start Fake_Elapsed
253 254 3 0.570755 -0.515555 -0.113494 1 5 -0.242016 1 1 ... -1.189170 -1.484321 -0.556901 -0.171679 -0.180928 -0.111456 -0.106655 -0.064084 -0.045268 -1.648408
1185 1186 3 -0.416423 0.870733 -0.079594 1 4 -1.588959 1 8 ... -0.623669 1.480279 0.922893 -0.171679 -0.180928 -0.111456 -0.106655 -0.064084 -0.045268 -1.231187
515 516 3 0.926139 0.540664 0.137262 1 9 1.273295 1 6 ... 1.525232 -0.496121 0.536450 -0.171679 -0.180928 -0.111456 -0.106655 -0.064084 -0.045268 -0.907465
433 434 3 1.163061 0.012555 0.016602 1 5 0.869212 1 8 ... -0.962969 0.986179 0.036902 -0.171679 -0.180928 -0.111456 -0.106655 -0.064084 -0.045268 -0.932367
1406 1407 3 -0.021552 -0.911637 -0.192566 1 4 0.027373 1 8 ... 1.412132 0.986179 -0.924493 -0.171679 -0.180928 -0.111456 -0.106655 -0.064084 -0.045268 1.422486

5 rows × 22 columns

y_train[:5]
array([[-0.2703821 ],
       [-0.96272526],
       [ 2.92223258],
       [ 0.02950289],
       [-0.59634404]])

Absolutely brilliant. Look how easy and fast we pre-processed our data. We now can start any Machine Learning Algorithm we want. It's super easy to replicate the results, because we saved the mapping and the scaler. We automatically created train and testset, we are flexible to use this in any way we want to split our data. I hope this will come in handy for your next machine learning project.

Stay tuned for the next blogpost!

Lasse