MLP Tutorial
import mlprepare as mlp
import pandas as pd
import numpy
df = pd.read_csv('TrainAndValid.csv', low_memory=False)
df.head()
to_keep = ['SalePrice', 'MachineID', 'saledate', 'MachineHoursCurrentMeter', 'UsageBand']
df = df[to_keep]
df.head()
date_type = ['saledate']
continuous_type = ['SalePrice', 'MachineHoursCurrentMeter']
categorical_type = ['MachineID', 'UsageBand']
result = mlp.df_to_type(df, date_type, continuous_type, categorical_type)
result.head()
We automatically extracted some extra information from the date variable and transformed the categorical variables to the correct type.
result.dtypes
Let's only keep the saleYear and saleMonth from our date variable.
to_keep = ['SalePrice', 'MachineID', 'MachineHoursCurrentMeter', 'UsageBand', 'saleYear', 'saleMonth']
continuous_type = ['SalePrice', 'MachineHoursCurrentMeter']
categorical_type = ['MachineID', 'UsageBand']
result = result[to_keep]
result.head()
Now, let's split the data into train and test, first randomly, then by a variable, then by a condition.
X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='random')
X_train.shape, X_test.shape
X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='on_split_id', split_var='MachineID')
X_train.shape, X_test.shape
#every row that fulfills this condition will be in the trainset
cond = (result.saleYear<2009)
X_train, X_test, y_train, y_test = mlp.split_df(result, dep_var='SalePrice', test_size=0.3, split_mode='on_condition', cond=cond)
X_train.shape, X_test.shape
X_train.head()
X_train_, X_test_, dict_list, dict_inv_list = mlp.cat_transform(X_train, X_test, cat_type = categorical_type)
X_train_.head()
We changed the defined categorical types to int and saved the corresponding dictionaries. Also, we added a special token for NaN values.
dict_list[1]
Let's standardize the data. If we want specific columns to not be standardized, we can put them into the cat_type argument. If we have an ID to later match the results to, put it into the id_type argument and it will not be standardized. If you don't want the dependend variable to be standardized, set transform_y to False (also realize that you will not get the scaler_y object as an output).
categorical_type = ['MachineID', 'UsageBand', 'saleYear', 'saleMonth']
X_train_2, X_test_2, y_train_2, y_test_2, scaler, scaler_y = mlp.cont_standardize(X_train_, X_test_, y_train, y_test, cat_type=categorical_type, transform_y=True, path='', standardizer='StandardScaler')
y_train[:5]
y_train_2[:5]
X_train_2.head()
saleYear and saleMonth didn't get standardized, also the categorical variables didn't get standardized.