Lab 5A - Machine Learning in Practice

1. Load the Data

1. Load all required libraries.

In [1]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

2. Set the working directory to "C:\Workshop\Data".

In [2]:
os.chdir("C:\Workshop\Data")

3. Read the Titanic CSV file into a data frame called titanic.

In [3]:
titanic = pd.read_csv("Titanic.csv")

2. Explore the Data

1. Inspect the data using the head function.

In [4]:
titanic.head()
Out[4]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON

2. Summarize the columns in the data frame using the info function.

In [5]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.2+ KB

3. Sumarize the data in the data frame using the describe function.

In [6]:
titanic.describe(
    include = "all")
Out[6]:
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
count 1309.000000 1309.000000 1309 1309 1046.000000 1309.000000 1309.000000 1309 1308.000000 295 1307 486 121.000000 745
unique NaN NaN 1307 2 NaN NaN NaN 929 NaN 186 3 27 NaN 369
top NaN NaN Connolly, Miss. Kate male NaN NaN NaN CA. 2343 NaN C23 C25 C27 S 13 NaN New York, NY
freq NaN NaN 2 843 NaN NaN NaN 11 NaN 6 914 39 NaN 64
mean 2.294882 0.381971 NaN NaN 29.881135 0.498854 0.385027 NaN 33.295479 NaN NaN NaN 160.809917 NaN
std 0.837836 0.486055 NaN NaN 14.413500 1.041658 0.865560 NaN 51.758668 NaN NaN NaN 97.696922 NaN
min 1.000000 0.000000 NaN NaN 0.166700 0.000000 0.000000 NaN 0.000000 NaN NaN NaN 1.000000 NaN
25% 2.000000 0.000000 NaN NaN 21.000000 0.000000 0.000000 NaN 7.895800 NaN NaN NaN 72.000000 NaN
50% 3.000000 0.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN NaN 155.000000 NaN
75% 3.000000 1.000000 NaN NaN 39.000000 1.000000 0.000000 NaN 31.275000 NaN NaN NaN 256.000000 NaN
max 3.000000 1.000000 NaN NaN 80.000000 8.000000 9.000000 NaN 512.329200 NaN NaN NaN 328.000000 NaN

4. Create a correlation matrix using the corr function.

In [7]:
correlations = titanic.corr()

5. Create a correlogram using the seaborn heatmap function.

In [8]:
sns.heatmap(
    data = correlations,
    cmap = sns.diverging_palette(
        h_neg = 10, 
        h_pos = 220, 
        as_cmap = True));

6. Inspect missing values with the isnull and sum functions.

In [9]:
titanic.isnull().sum()
Out[9]:
pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

3. Transform the Data

Note: It may be helpful to inspect the result of each transformation using the head function.

1. Assign the raw data to a temporary data frame.

In [10]:
temp = titanic

2. Encode the categorical variable sex as one-hot dummy variables for female and male.

In [11]:
dummies = pd.get_dummies(temp.sex)
temp = pd.concat([temp, dummies], axis = 1)

3. Imput missing values for Age using the mean age.

In [12]:
meanAge = temp.age.mean()

temp.age = temp.age.fillna(meanAge)

4. Engineer a new feature named family as the total siblings, spouses, parents, and children.

In [13]:
temp["family"] = temp.sibsp + temp.parch

5. Encode the integer survived variable as a categorical variable with levels "Yes" and "No".

In [14]:
temp.survived.replace((1, 0), ('Yes', 'No'), inplace = True)

6. Select only the following features: pclass, male, female, age, family, and survived.

In [15]:
temp = temp.loc[:, ["pclass", "male", "female", "age", "family", "survived"]]

7. Rename the selected columns: Class, Male, Female, Age, Family, and Survived.

In [16]:
temp.columns = ["Class", "Male", "Female", "Age", "Family", "Survived"]

8. Inspect the transformed data with the head function.

In [17]:
temp.head()
Out[17]:
Class Male Female Age Family Survived
0 1 0 1 29.0000 0 Yes
1 1 1 0 0.9167 3 Yes
2 1 0 1 2.0000 3 No
3 1 1 0 30.0000 3 No
4 1 0 1 25.0000 3 No

9. Create a data frame of the features named X.

In [18]:
X = temp.iloc[:, 0:5]

10. Create a new series for the labels named y.

In [19]:
y = temp.Survived

11. Scale the feature data using the standard scaler.

In [20]:
scaler = StandardScaler()

scaler.fit(X)

X_scaled = scaler.transform(X)

4. Create the Training and Test Set

1. Set the random number seed to 42

In [21]:
np.random.seed(42)

2. Create stratified training and test sets (80/20).

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    stratify = y,
    train_size = 0.80,
    test_size = 0.20)

5. Create KNN Classifier Models

1. Create a KNN model.

In [23]:
knn_model = KNeighborsClassifier()

2. Define the KNN hyperparameters to test (i.e. k = {2, 7, 9, 11, 13})

In [24]:
knn_params = [5, 7, 9, 11, 13]

knn_param_grid = {"n_neighbors" : knn_params }

3. Create 10 KNN models for each of the five hyper-parameters using 10-fold cross validation.

In [25]:
knn_models = GridSearchCV(
    estimator = knn_model, 
    param_grid = knn_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.

In [26]:
knn_models.fit(
    X = X_train, 
    y = y_train)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    3.0s finished
Out[26]:
GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [5, 7, 9, 11, 13]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

5. Get the average accuracy for each of the five hyperparameters.

In [27]:
knn_avg_scores = knn_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

In [28]:
for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(knn_params[i], knn_avg_scores[i]))
  5 : 0.756
  7 : 0.771
  9 : 0.780
 11 : 0.772
 13 : 0.761

7. Plot the change in accuracy over each hyper-parameter.

In [29]:
plt.plot(
    knn_params, 
    knn_avg_scores)
plt.xlabel("k (neighbors)")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error of the top performing model.

In [30]:
knn_top_index = np.argmax(knn_avg_scores)
knn_top_param = knn_params[knn_top_index]
knn_top_score = knn_avg_scores[knn_top_index]
knn_top_error = knn_models.cv_results_["std_test_score"][knn_top_index]

9. Inspect the top performing model.

In [31]:
print("Top knn model is k = {:d} at {:0.2f} +/- {:0.3f} accuracy"
    .format(knn_top_param, knn_top_score, knn_top_error))
Top knn model is k = 9 at 0.78 +/- 0.030 accuracy

6. Create Decision Tree Classifier Models.

1. Create a decision tree model.

In [32]:
tree_model = DecisionTreeClassifier()

2. Define the hyper-parameters to test (i.e. max_depth = {3, 4, 5, 6, 7}).

In [33]:
tree_params = [3, 4, 5, 6, 7]

tree_param_grid = {"max_depth" : tree_params }

3. Create 10 tree models for each of the 5 hyper-parameters using 10-fold cross validation.

In [34]:
tree_models = GridSearchCV(
    estimator = tree_model, 
    param_grid = tree_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.

In [35]:
tree_models.fit(
    X = X_train, 
    y = y_train)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    1.3s finished
Out[35]:
GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [3, 4, 5, 6, 7]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring='accuracy',
       verbose=1)

5. Get the average accuracy for each hyper-parameter.

In [36]:
tree_avg_scores = tree_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

In [37]:
for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(tree_params[i], tree_avg_scores[i]))
  3 : 0.810
  4 : 0.810
  5 : 0.797
  6 : 0.784
  7 : 0.778

7. Plot the change in accuracy over each hyper-parameter.

In [38]:
plt.plot(
    tree_params, 
    tree_avg_scores)
plt.xlabel("Max Depth (nodes)")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error for the top-performing model.

In [39]:
tree_top_index = np.argmax(tree_avg_scores)
tree_top_param = tree_params[tree_top_index]
tree_top_score = tree_avg_scores[tree_top_index]
tree_top_error = tree_models.cv_results_["std_test_score"][tree_top_index]

9. Inspect the top-performing model.

In [40]:
print("Top tree model is k = {:d} at {:0.2f} +/- {:0.3} accuracy"
    .format(tree_top_param, tree_top_score, tree_top_error))
Top tree model is k = 3 at 0.81 +/- 0.03 accuracy

7. Create Neural Network Classifier Models.

1. Create a neural network model with tanh activation functions and 5000 max iterations.

In [41]:
neural_model = MLPClassifier(
    activation = "tanh",
    max_iter = 5000)

2. Define hyper-parameters to test (i.e. hidden_layer_sizes = {3, 4, 5, 6, 7}).

In [42]:
neural_params = [3, 4, 5, 6, 7]

neural_param_grid = {"hidden_layer_sizes" : neural_params }

3. Create 10 models for each of the 5 hyper-parameters using 10-fold cross validation.

In [43]:
neural_models = GridSearchCV(
    estimator = neural_model, 
    param_grid = neural_param_grid,
    scoring = "accuracy",
    cv = 10,
    verbose = 1)

4. Train all 50 models using the training set.
Note: This could take a few minutes.

In [44]:
neural_models.fit(
    X = X_train, 
    y = y_train)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  2.0min finished
Out[44]:
GridSearchCV(cv=10, error_score='raise',
       estimator=MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'hidden_layer_sizes': [3, 4, 5, 6, 7]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

5. Get the average accuracy for each hyper-parameter.

In [45]:
neural_avg_scores = neural_models.cv_results_["mean_test_score"]

6. Display the average accuracy for each hyper-parameter.

In [46]:
for i in range(0, 5):
    print("{:>3} : {:0.3f}"
        .format(neural_params[i], neural_avg_scores[i]))
  3 : 0.721
  4 : 0.709
  5 : 0.772
  6 : 0.736
  7 : 0.795

7. Plot the change in accuracy over each hyper-parameter.

In [47]:
plt.plot(
    neural_params, 
    neural_avg_scores)
plt.xlabel("Hidden Layer Nodes")
plt.ylabel("Accuracy")
plt.show()

8. Get the hyper-parameter, average accuracy, and standard error for the top-performing model.

In [48]:
neural_top_index = np.argmax(neural_avg_scores)
neural_top_param = neural_params[neural_top_index]
neural_top_score = neural_avg_scores[neural_top_index]
neural_top_error = neural_models.cv_results_["std_test_score"][neural_top_index]

9. Inspect the statistics of the top-performing 10 models.

In [49]:
print("Top nnet model is k = {:d} at {:0.2f} +/- {:0.3f} accuracy"
    .format(neural_top_param, neural_top_score, neural_top_error))
Top nnet model is k = 7 at 0.79 +/- 0.027 accuracy

8. Evaluate the Models

1. Compare the top three performers numerically.

In [50]:
print("KNN:  {:0.2f} +/- {:0.3f} accuracy"
    .format(knn_top_score, knn_top_error))
print("Tree: {:0.2f} +/- {:0.3f} accuracy"
    .format(tree_top_score, tree_top_error))
print("NNet: {:0.2f} +/- {:0.3f} accuracy"
    .format(neural_top_score, neural_top_error))
KNN:  0.78 +/- 0.030 accuracy
Tree: 0.81 +/- 0.030 accuracy
NNet: 0.79 +/- 0.027 accuracy

2. Compare the top-three performing models visually.

In [51]:
plt.errorbar(
    x = [knn_top_score, tree_top_score, neural_top_score],
    y = ["KNN", "Tree", "NNet"],
    xerr = [knn_top_error, tree_top_error, neural_top_error],
    linestyle = "none",
    marker = "o")
plt.xlim(0, 1)
Out[51]:
(0, 1)

3. Question: Which model would you choose based on this information?

9. Test the Final Model

1. Create a final model based on the top-performing algorithm and hyper-parameter.

In [52]:
final_model = DecisionTreeClassifier(
    max_depth = 3)

2. Train the final model using the entire training set.

In [53]:
final_model.fit(
    X = X_train, 
    y = y_train)
Out[53]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

6. Predict the labels of the test set using the hold-out test set.

In [54]:
final_predictions = final_model.predict(X_test)

7. Get the final prediction accuracy.

In [55]:
final_score = accuracy_score(
    y_true = y_test, 
    y_pred = final_predictions)

8. Inspect the final prediction accuracy.

In [56]:
print(final_score)
0.8320610687022901

10. Deploy the Model

Question to be answered: How likely is it that Jack will survive the Titanic?

1. Create an input feature data frame for Jack.

In [57]:
X_jack = pd.DataFrame(
    columns = ["Class", "Male", "Female", "Age", "Family"],
    data = [[3, 1, 0, 20, 0]])

2. Predict if Jack survives.

In [58]:
final_model.predict(X_jack)[0]
Out[58]:
'No'

3. What is the liklihood that Jack survives?

In [59]:
final_model.predict_proba(X_jack)[0][1]
Out[59]:
0.13253012048192772

4. Question: Would you take that ticket?