Lab 5B - Machine Learning in Practice¶

1. Load the Data¶

1. Load all required libraries.

import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

2. Set the working directory to "C:\Workshop\Data".

os.chdir("C:\Workshop\Data")

3. Read the Risk.csv file into a data frame called policies.

policies = pd.read_csv("Risk.csv")

2. Explore the Data¶

1. Inspect the policies data using the head function.

policies.head()

2. Summarize the columns in the data frame using the info function.

policies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1942 entries, 0 to 1941
Data columns (total 8 columns):
Gender        1942 non-null object
State         1942 non-null object
State_Rate    1942 non-null float64
Height        1942 non-null int64
Weight        1942 non-null float64
BMI           1942 non-null float64
Age           1942 non-null int64
Risk          1942 non-null object
dtypes: float64(3), int64(2), object(3)
memory usage: 121.5+ KB

3. Sumarize the data in the data frame using the describe function.

policies.describe(
    include = "all")

4. Create a correlation matrix using the corr function.

correlations = policies.corr()

5. Create a correlogram using the seaborn heatmap function.

sns.heatmap(
    data = correlations,
    cmap = sns.diverging_palette(
        h_neg = 10, 
        h_pos = 220, 
        as_cmap = True));

6. Inspect missing values with the isnull and sum functions.

policies.isnull().sum()

Gender        0
State         0
State_Rate    0
Height        0
Weight        0
BMI           0
Age           0
Risk          0
dtype: int64

3. Transform the Data¶

1. Assign the following features to a data frame named X: Gender, State Rate, Height, Weight, BMI, and Age.

X = policies[["Gender", "State_Rate", "Height", "Weight", "BMI", "Age"]]

2. Encode the categorical Gender variable {Female, Male} as an integer {0, 1}.

X.Gender.replace(("Female", "Male"), (0, 1), inplace = True)

C:\Users\Matthew\Anaconda3\lib\site-packages\pandas\core\generic.py:4619: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)

3. Inspect the transfomred data with the head function.

X.head()

4. Create a new series for the labels named y.

y = policies.Risk

5. Scale the feature data using the standard scaler.

scaler = StandardScaler()

scaler.fit(X)

X_scaled = scaler.transform(X)

4. Create the Training and Test Set¶

1. Set the random number seed to 42.

np.random.seed(42)

2. Create stratified training and test sets (80/20).

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify = y,
    train_size = 0.80,
    test_size = 0.20)

5. Create KNN Classifier Models¶

1. Create a KNN model.

knn_model = KNeighborsClassifier()

2. Define the KNN hyperparameters to test (i.e. k = {2, 7, 9, 11, 13})

knn_params = [5, 7, 9, 11, 13]

knn_param_grid = {"n_neighbors" : knn_params }

3. Create 10 KNN models for each of the five hyper-parameters using 10-fold cross validation.

knn_models = GridSearchCV(
    estimator = knn_model, 
    param_grid = knn_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.

knn_models.fit(
    X = X_train, 
    y = y_train)

Fitting 10 folds for each of 5 candidates, totalling 50 fits

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    4.3s finished

GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [5, 7, 9, 11, 13]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

5. Get the average accuracy for each of the five hyperparameters.

knn_avg_scores = knn_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(knn_params[i], knn_avg_scores[i]))

  5 : 0.965
  7 : 0.967
  9 : 0.968
 11 : 0.965
 13 : 0.968

7. Plot the change in accuracy over each hyper-parameter.

plt.plot(
    knn_params, 
    knn_avg_scores)
plt.xlabel("k (neighbors)")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error of the top performing model.

knn_top_index = np.argmax(knn_avg_scores)
knn_top_param = knn_params[knn_top_index]
knn_top_score = knn_avg_scores[knn_top_index]
knn_top_error = knn_models.cv_results_["std_test_score"][knn_top_index]

9. Inspect the top performing model.

print("Top knn model is k = {:d} at {:0.2f} +/- {:0.3f} accuracy"
    .format(knn_top_param, knn_top_score, knn_top_error))

Top knn model is k = 9 at 0.97 +/- 0.015 accuracy

6. Create Decision Tree Classifier Models.¶

1. Create a decision tree model.

tree_model = DecisionTreeClassifier()

2. Define the hyper-parameters to test (i.e. max_depth = {3, 4, 5, 6, 7}).

tree_params = [3, 4, 5, 6, 7]

tree_param_grid = {"max_depth" : tree_params }

3. Create 10 tree models for each of the 5 hyper-parameters using 10-fold cross validation.

tree_models = GridSearchCV(
    estimator = tree_model, 
    param_grid = tree_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.

tree_models.fit(
    X = X_train, 
    y = y_train)

Fitting 10 folds for each of 5 candidates, totalling 50 fits

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    2.3s finished

GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [3, 4, 5, 6, 7]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring='accuracy',
       verbose=1)

5. Get the average accuracy for each hyper-parameter.

tree_avg_scores = tree_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(tree_params[i], tree_avg_scores[i]))

  3 : 0.972
  4 : 0.980
  5 : 0.979
  6 : 0.977
  7 : 0.979

7. Plot the change in accuracy over each hyper-parameter.

plt.plot(
    tree_params, 
    tree_avg_scores)
plt.xlabel("Max Depth (nodes)")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error for the top-performing model.

tree_top_index = np.argmax(tree_avg_scores)
tree_top_param = tree_params[tree_top_index]
tree_top_score = tree_avg_scores[tree_top_index]
tree_top_error = tree_models.cv_results_["std_test_score"][tree_top_index]

9. Inspect the top-performing model.

print("Top tree model is k = {:d} at {:0.2f} +/- {:0.3} accuracy"
    .format(tree_top_param, tree_top_score, tree_top_error))

Top tree model is k = 4 at 0.98 +/- 0.0197 accuracy

7. Create Neural Network Classifier Models.¶

1. Create a neural network model with tanh activation functions and 5000 max iterations.

neural_model = MLPClassifier(
    activation = "tanh",
    solver = "sgd",
    max_iter = 5000)

2. Define hyper-parameters to test (i.e. hidden_layer_sizes = {3, 4, 5, 6, 7}).

neural_params = [3, 4, 5, 6, 7]

neural_param_grid = {"hidden_layer_sizes" : neural_params }

3. Create 10 models for each of the 5 hyper-parameters using 10-fold cross validation.

neural_models = GridSearchCV(
    estimator = neural_model, 
    param_grid = neural_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.
Note: This could take a few minutes.

neural_models.fit(
    X = X_train, 
    y = y_train)

Fitting 10 folds for each of 5 candidates, totalling 50 fits

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.2min finished

GridSearchCV(cv=10, error_score='raise',
       estimator=MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='sgd', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'hidden_layer_sizes': [3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

5. Get the average accuracy for each hyper-parameter.

neural_avg_scores = neural_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(neural_params[i], neural_avg_scores[i]))

  3 : 0.797
  4 : 0.771
  5 : 0.818
  6 : 0.840
  7 : 0.787

7. Plot the change in accuracy over each hyper-parameter.

plt.plot(
    neural_params, 
    neural_avg_scores)
plt.xlabel("Hidden Layer Nodes")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error for the top-performing model.

neural_top_index = np.argmax(neural_avg_scores)
neural_top_param = neural_params[neural_top_index]
neural_top_score = neural_avg_scores[neural_top_index]
neural_top_error = neural_models.cv_results_["std_test_score"][neural_top_index]

9. Inspect the statistics of the top-performing 10 models.

print("Top nnet model is k = {:d} at {:0.2f} +/- {:0.3f} accuracy"
    .format(neural_top_param, neural_top_score, neural_top_error))

Top nnet model is k = 6 at 0.84 +/- 0.113 accuracy

8. Evaluate the Models¶

1. Compare the top three performers numerically.

print("KNN:  {:0.2f} +/- {:0.3f} accuracy"
    .format(knn_top_score, knn_top_error))
print("Tree: {:0.2f} +/- {:0.3f} accuracy"
    .format(tree_top_score, tree_top_error))
print("NNet: {:0.2f} +/- {:0.3f} accuracy"
    .format(neural_top_score, neural_top_error))

KNN:  0.97 +/- 0.015 accuracy
Tree: 0.98 +/- 0.020 accuracy
NNet: 0.84 +/- 0.113 accuracy

2. Compare the top-three performing models visually.

plt.errorbar(
    x = [knn_top_score, tree_top_score, neural_top_score],
    y = ["KNN", "Tree", "NNet"],
    xerr = [knn_top_error, tree_top_error, neural_top_error],
    linestyle = "none",
    marker = "o")
plt.xlim(0, 1)

(0, 1)

3. Question: Which model would you choose based on this information?

9. Test the Final Model¶

1. Create a final model based on the top-performing algorithm and hyper-parameter.

final_model = DecisionTreeClassifier(
    max_depth = 3)

2. Train the final model using the entire training set.

final_model.fit(
    X = X_train, 
    y = y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

6. Predict the labels of the test set using the hold-out test set.

final_predictions = final_model.predict(X_test)

7. Get the final prediction accuracy.

final_score = accuracy_score(
    y_true = y_test, 
    y_pred = final_predictions)

8. Inspect the final prediction accuracy.

print(final_score)

0.9794344473007712

10. Deploy the Model¶

Question to be answered: Is Jack (from the Titanic) a high risk or low risk policy?

1. Create an input feature for Jack.

X_jack = pd.DataFrame(
    columns = ["Gender", "State_Rate", "Height", "Weight", "BMI", "Age"],
    data = [[1, 0.09080315, 183, 75, 22.4, 20]])

2. Predict the risk class of Jack.

final_model.predict(X_jack)[0]

'Low'

3. Predict the probablility that Jack belongs to the above risk class.

final_model.predict_proba(X_jack)[0][1]

1.0

4. Question: Would you offer life insurance to Jack?

	Gender	State	State_Rate	Height	Weight	BMI	Age	Risk
0	Male	MA	0.100434	184	67.8	20.025992	77	High
1	Male	VA	0.141723	163	89.4	33.648237	82	High
2	Male	NY	0.090803	170	81.2	28.096886	31	Low
3	Male	TN	0.119973	175	99.7	32.555102	39	Low
4	Male	FL	0.110345	184	72.1	21.296078	68	High

	Gender	State	State_Rate	Height	Weight	BMI	Age	Risk
count	1942	1942	1942.000000	1942.000000	1942.000000	1942.000000	1942.000000	1942
unique	2	51	NaN	NaN	NaN	NaN	NaN	2
top	Male	CA	NaN	NaN	NaN	NaN	NaN	Low
freq	986	191	NaN	NaN	NaN	NaN	NaN	1366
mean	NaN	NaN	0.138064	169.718847	81.155767	28.292804	50.841401	NaN
std	NaN	NaN	0.044180	9.571082	16.009041	5.808799	19.327130	NaN
min	NaN	NaN	0.001000	150.000000	44.100000	16.022174	18.000000	NaN
25%	NaN	NaN	0.110345	162.000000	68.600000	23.739705	34.000000	NaN
50%	NaN	NaN	0.127584	170.000000	81.300000	28.055706	51.000000	NaN
75%	NaN	NaN	0.144251	176.000000	93.800000	32.456822	68.000000	NaN
max	NaN	NaN	0.318100	190.000000	116.500000	46.796193	84.000000	NaN

	Gender	State_Rate	Height	Weight	BMI	Age
0	1	0.100434	184	67.8	20.025992	77
1	1	0.141723	163	89.4	33.648237	82
2	1	0.090803	170	81.2	28.096886	31
3	1	0.119973	175	99.7	32.555102	39
4	1	0.110345	184	72.1	21.296078	68