How to make use of the ml-prepare package

import mlprepare as mlp
import pandas as pd
import numpy

Load Data

df = pd.read_csv('TrainAndValid.csv', low_memory=False)
df.head()
SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand saledate ... Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls
0 1139246 66000.0 999089 3157 121 3.0 2004 68.0 Low 11/16/2006 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
1 1139248 57000.0 117657 77 121 3.0 1996 4640.0 Low 3/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
2 1139249 10000.0 434808 7009 121 3.0 2001 2838.0 High 2/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1139251 38500.0 1026470 332 121 3.0 2001 3486.0 High 5/19/2011 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1139253 11000.0 1057373 17311 121 3.0 2007 722.0 Medium 7/23/2009 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 53 columns

to_keep = ['SalePrice', 'MachineID', 'saledate', 'MachineHoursCurrentMeter', 'UsageBand']
df = df[to_keep]
df.head()
SalePrice MachineID saledate MachineHoursCurrentMeter UsageBand
0 66000.0 999089 11/16/2006 0:00 68.0 Low
1 57000.0 117657 3/26/2004 0:00 4640.0 Low
2 10000.0 434808 2/26/2004 0:00 2838.0 High
3 38500.0 1026470 5/19/2011 0:00 3486.0 High
4 11000.0 1057373 7/23/2009 0:00 722.0 Medium

mlp Functions

df_to_type

date_type = ['saledate']
continuous_type = ['SalePrice', 'MachineHoursCurrentMeter']
categorical_type = ['MachineID', 'UsageBand']
result = mlp.df_to_type(df, date_type, continuous_type, categorical_type)
result.head()
SalePrice MachineID saleWeek MachineHoursCurrentMeter UsageBand saleYear saleMonth saleDay saleDayofweek saleDayofyear saleIs_month_end saleIs_month_start saleIs_quarter_end saleIs_quarter_start saleIs_year_end saleIs_year_start saleElapsed
0 66000.0 999089 46 68.0 Low 2006 11 16 3 320 False False False False False False 1163635200
1 57000.0 117657 13 4640.0 Low 2004 3 26 4 86 False False False False False False 1080259200
2 10000.0 434808 9 2838.0 High 2004 2 26 3 57 False False False False False False 1077753600
3 38500.0 1026470 20 3486.0 High 2011 5 19 3 139 False False False False False False 1305763200
4 11000.0 1057373 30 722.0 Medium 2009 7 23 3 204 False False False False False False 1248307200

We automatically extracted some extra information from the date variable and transformed the categorical variables to the correct type.

result.dtypes
SalePrice                    float64
MachineID                   category
saleWeek                      UInt32
MachineHoursCurrentMeter     float64
UsageBand                   category
saleYear                       int64
saleMonth                      int64
saleDay                        int64
saleDayofweek                  int64
saleDayofyear                  int64
saleIs_month_end                bool
saleIs_month_start              bool
saleIs_quarter_end              bool
saleIs_quarter_start            bool
saleIs_year_end                 bool
saleIs_year_start               bool
saleElapsed                   object
dtype: object

Let's only keep the saleYear and saleMonth from our date variable.

to_keep = ['SalePrice', 'MachineID', 'MachineHoursCurrentMeter', 'UsageBand', 'saleYear', 'saleMonth']
continuous_type = ['SalePrice', 'MachineHoursCurrentMeter']
categorical_type = ['MachineID', 'UsageBand']
result = result[to_keep]
result.head()
SalePrice MachineID MachineHoursCurrentMeter UsageBand saleYear saleMonth
0 66000.0 999089 68.0 Low 2006 11
1 57000.0 117657 4640.0 Low 2004 3
2 10000.0 434808 2838.0 High 2004 2
3 38500.0 1026470 3486.0 High 2011 5
4 11000.0 1057373 722.0 Medium 2009 7

split_df

Now, let's split the data into train and test, first randomly, then by a variable, then by a condition.

X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='random')
X_train.shape, X_test.shape
((288888, 5), (123810, 5))
X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='on_split_id', split_var='MachineID')
X_train.shape, X_test.shape
((276523, 5), (136175, 5))
#every row that fulfills this condition will be in the trainset
cond = (result.saleYear<2009)
X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='on_condition', cond=cond)
X_train.shape, X_test.shape
((288689, 5), (124009, 5))
X_train.head()
MachineID MachineHoursCurrentMeter UsageBand saleYear saleMonth
0 999089 68.0 Low 2006 11
1 117657 4640.0 Low 2004 3
2 434808 2838.0 High 2004 2
5 1001274 508.0 Low 2008 12
6 772701 11540.0 High 2004 8

cat_transform

X_train_, X_test_, dict_list, dict_inv_list = mlp.cat_transform(X_train, X_test, cat_type = categorical_type)
X_train_.head()
MachineID MachineHoursCurrentMeter UsageBand saleYear saleMonth
0 62273 68.0 2 2006 11
1 9581 4640.0 2 2004 3
2 28730 2838.0 1 2004 2
5 62837 508.0 2 2008 12
6 48637 11540.0 1 2004 8

We changed the defined categorical types to int and saved the corresponding dictionaries. Also, we added a special token for NaN values.

dict_list[1]
{0: '#NaN', 1: 'High', 2: 'Low', 3: 'Medium'}

cont_standardize

Let's standardize the data. If we want specific columns to not be standardized, we can put them into the cat_type argument. If we have an ID to later match the results to, put it into the id_type argument and it will not be standardized. If you don't want the dependend variable to be standardized, set transform_y to False (also realize that you will not get the scaler_y object as an output).

categorical_type = ['MachineID', 'UsageBand', 'saleYear', 'saleMonth']
X_train_2, X_test_2, y_train_2, y_test_2, scaler, scaler_y = mlp.cont_standardize(X_train_, X_test_, y_train, y_test, cat_type=categorical_type, transform_y=True, path='', standardizer='StandardScaler')
y_train[:5]
0    66000.0
1    57000.0
2    10000.0
5    26500.0
6    21000.0
Name: SalePrice, dtype: float64
y_train_2[:5]
array([[ 1.53656444],
       [ 1.14173911],
       [-0.92012647],
       [-0.19628004],
       [-0.43756218]])
X_train_2.head()
MachineID MachineHoursCurrentMeter UsageBand saleYear saleMonth
0 62273 -0.372846 2 2006 11
1 9581 0.893532 2 2004 3
2 28730 0.394404 1 2004 2
5 62837 -0.250972 2 2008 12
6 48637 2.804732 1 2004 8

saleYear and saleMonth didn't get standardized, also the categorical variables didn't get standardized.