How to make use of the ml-prepare package

import mlprepare as mlp
import pandas as pd
import numpy

Load Data

df = pd.read_csv('TrainAndValid.csv', low_memory=False)

df.head()

to_keep = ['SalePrice', 'MachineID', 'saledate', 'MachineHoursCurrentMeter', 'UsageBand']
df = df[to_keep]
df.head()

mlp Functions

df_to_type

date_type = ['saledate']
continuous_type = ['SalePrice', 'MachineHoursCurrentMeter']
categorical_type = ['MachineID', 'UsageBand']

result = mlp.df_to_type(df, date_type, continuous_type, categorical_type)

result.head()

We automatically extracted some extra information from the date variable and transformed the categorical variables to the correct type.

result.dtypes

SalePrice                    float64
MachineID                   category
saleWeek                      UInt32
MachineHoursCurrentMeter     float64
UsageBand                   category
saleYear                       int64
saleMonth                      int64
saleDay                        int64
saleDayofweek                  int64
saleDayofyear                  int64
saleIs_month_end                bool
saleIs_month_start              bool
saleIs_quarter_end              bool
saleIs_quarter_start            bool
saleIs_year_end                 bool
saleIs_year_start               bool
saleElapsed                   object
dtype: object

Let's only keep the saleYear and saleMonth from our date variable.

to_keep = ['SalePrice', 'MachineID', 'MachineHoursCurrentMeter', 'UsageBand', 'saleYear', 'saleMonth']
continuous_type = ['SalePrice', 'MachineHoursCurrentMeter']
categorical_type = ['MachineID', 'UsageBand']
result = result[to_keep]
result.head()

split_df

Now, let's split the data into train and test, first randomly, then by a variable, then by a condition.

X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='random')

X_train.shape, X_test.shape

((288888, 5), (123810, 5))

X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='on_split_id', split_var='MachineID')

X_train.shape, X_test.shape

((276523, 5), (136175, 5))

#every row that fulfills this condition will be in the trainset
cond = (result.saleYear<2009)

X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='on_condition', cond=cond)

X_train.shape, X_test.shape

((288689, 5), (124009, 5))

X_train.head()

cat_transform

X_train_, X_test_, dict_list, dict_inv_list = mlp.cat_transform(X_train, X_test, cat_type = categorical_type)

X_train_.head()

We changed the defined categorical types to int and saved the corresponding dictionaries. Also, we added a special token for NaN values.

dict_list[1]

{0: '#NaN', 1: 'High', 2: 'Low', 3: 'Medium'}

cont_standardize

Let's standardize the data. If we want specific columns to not be standardized, we can put them into the cat_type argument. If we have an ID to later match the results to, put it into the id_type argument and it will not be standardized. If you don't want the dependend variable to be standardized, set transform_y to False (also realize that you will not get the scaler_y object as an output).

categorical_type = ['MachineID', 'UsageBand', 'saleYear', 'saleMonth']
X_train_2, X_test_2, y_train_2, y_test_2, scaler, scaler_y = mlp.cont_standardize(X_train_, X_test_, y_train, y_test, cat_type=categorical_type, transform_y=True, path='', standardizer='StandardScaler')

y_train[:5]

0    66000.0
1    57000.0
2    10000.0
5    26500.0
6    21000.0
Name: SalePrice, dtype: float64

y_train_2[:5]

array([[ 1.53656444],
       [ 1.14173911],
       [-0.92012647],
       [-0.19628004],
       [-0.43756218]])

X_train_2.head()

saleYear and saleMonth didn't get standardized, also the categorical variables didn't get standardized.

	SalesID	SalePrice	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	UsageBand	saledate	...	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls
0	1139246	66000.0	999089	3157	121	3.0	2004	68.0	Low	11/16/2006 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
1	1139248	57000.0	117657	77	121	3.0	1996	4640.0	Low	3/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Standard	Conventional
2	1139249	10000.0	434808	7009	121	3.0	2001	2838.0	High	2/26/2004 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1139251	38500.0	1026470	332	121	3.0	2001	3486.0	High	5/19/2011 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1139253	11000.0	1057373	17311	121	3.0	2007	722.0	Medium	7/23/2009 0:00	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	SalePrice	MachineID	saledate	MachineHoursCurrentMeter	UsageBand
0	66000.0	999089	11/16/2006 0:00	68.0	Low
1	57000.0	117657	3/26/2004 0:00	4640.0	Low
2	10000.0	434808	2/26/2004 0:00	2838.0	High
3	38500.0	1026470	5/19/2011 0:00	3486.0	High
4	11000.0	1057373	7/23/2009 0:00	722.0	Medium

	MachineID	MachineHoursCurrentMeter	UsageBand	saleYear	saleMonth
0	999089	68.0	Low	2006	11
1	117657	4640.0	Low	2004	3
2	434808	2838.0	High	2004	2
5	1001274	508.0	Low	2008	12
6	772701	11540.0	High	2004	8

	MachineID	MachineHoursCurrentMeter	UsageBand	saleYear	saleMonth
0	62273	68.0	2	2006	11
1	9581	4640.0	2	2004	3
2	28730	2838.0	1	2004	2
5	62837	508.0	2	2008	12
6	48637	11540.0	1	2004	8

	MachineID	MachineHoursCurrentMeter	UsageBand	saleYear	saleMonth
0	62273	-0.372846	2	2006	11
1	9581	0.893532	2	2004	3
2	28730	0.394404	1	2004	2
5	62837	-0.250972	2	2008	12
6	48637	2.804732	1	2004	8