1. Import the OS library.
import os
2. Set the working directory to "C:\Workshop\Data".
os.chdir("C:\Workshop\Data")
3. Import the pandas library as "pd".
import pandas as pd
4. Read the Risk.csv file into a data frame named policies.
policies = pd.read_csv("Risk.csv")
1. Inspect the policies data set with the head
function.
policies.head()
2. Import the matplotlib.pyplot library as "plt".
import matplotlib.pyplot as plt
3. Create a color palette containing two colors for Low and High risk.
palette = {
'Low':'#fb8072',
'High':'#80b1d3'}
4. Map the colors to each risk category.
colors = policies.Risk.apply(lambda x:palette[x])
5. Create a scatterplot matrix of the policies data set colored by risk.
Note: The semicolon at the end returns only plot and no text output.
pd.plotting.scatter_matrix(
frame = policies,
color = colors,
alpha = 1,
s = 100,
diagonal = "none");
6. Create a scatterplot of BMI (on the y-axis) vs Age (on the x-axis) colored by risk.
plt.scatter(
x = policies.Age,
y = policies.BMI,
color = colors)
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()
1. Create a data frame named X containing the features (i.e. Age, BMI, Gender, State Rate).
X = policies.loc[:, ["Age", "BMI", "Gender", "State_Rate"]]
2. Inspect the features data frame X using the head
function.
X.head()
3. Encode the categorical Gender variable {Female, Male} as integers {0, 1}.
X.Gender = X.Gender.apply(lambda x: 0 if x == "Female" else 1)
4. Inspect the new Gender encoding using the head
function.
X.Gender.head()
5. Create a series named y containing the Risk labels.
y = policies.Risk
6. Inspect the series of labels y using the head
function.
y.head()
1. Import the numpy library as "np".
import numpy as np
2. Set the random number seed to 42.
np.random.seed(42)
3. Import the train_test_split function from sklearn.
from sklearn.model_selection import train_test_split
4. Randomly sample 80% of the rows for the training set and 20% for the test set.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
train_size = 0.8,
test_size = 0.2)
5. Inspect the shape of the training and test sets using their shape
property.
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test: ", X_test.shape)
print("y_test: ", y_test.shape)
6. Question: How do you interpret these shapes in terms of columns and rows?
1. Import KNN classifier class from sklearn.
from sklearn.neighbors import KNeighborsClassifier
2. Create a KNN model with k = 3.
knn_model = KNeighborsClassifier(
n_neighbors = 3)
3. Train the model using the training data.
knn_model.fit(
X = X_train,
y = y_train)
4. Predict the labels of the test set using the model.
knn_predictions = knn_model.predict(X_test)
5. Create a confusion matrix for the predictions.
pd.crosstab(
y_test,
knn_predictions,
rownames = ['Reference'],
colnames = ['Predicted'])
6. Import the accuracy_score function from sklearn.
from sklearn.metrics import accuracy_score
7. Get the prediction accuracy.
knn_score = accuracy_score(
y_true = y_test,
y_pred = knn_predictions)
8. Inspect the prediction accuracy.
print(knn_score)
9. Visualize the knn predictions with correct prediction in black and incorrect predictions in red.
plt.scatter(
x = X_test.Age,
y = X_test.BMI,
color = np.where(
y_test == knn_predictions,
'black',
'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()
10. Question: Why do you think these data points were misclassified?
1. Import the decision tree classifier from sklearn.
from sklearn.tree import DecisionTreeClassifier
2. Create a decision tree classifier with max_depth = 3.
tree_model = DecisionTreeClassifier(
max_depth = 3)
3. Train the model using the training data.
tree_model.fit(
X = X_train,
y = y_train)
4. Import the tree visualizer from sklearn.
from sklearn.tree import export_graphviz
5. Visualize the decision tree.
import graphviz
tree_graph = export_graphviz(
decision_tree = tree_model,
feature_names = list(X_train.columns.values),
class_names = list(y_train.unique()),
out_file = None)
graphviz.Source(tree_graph)
6. Question: Are you able to read and follow the logic of this decision tree?
7. Predict the labels of the test set with the model.
tree_predictions = tree_model.predict(X_test)
8. Get the prediction accuracy.
tree_score = accuracy_score(
y_true = y_test,
y_pred = tree_predictions)
9. Inspect the prediction accuracy.
print(tree_score)
10. Visualize the prediction errors (in red).
plt.scatter(
x = X_test.Age,
y = X_test.BMI,
color = np.where(
y_test == tree_predictions,
'black',
'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()
1. Import the standard scaler from sklearn.
from sklearn.preprocessing import StandardScaler
2. Create a standard scaler.
scaler = StandardScaler()
3. Fit the scaler to all training data (i.e. X).
scaler.fit(X)
4. Scale the training and test set.
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
5. Import the neural network classifier from sklearn.
from sklearn.neural_network import MLPClassifier
6. Create a neural network classifier with 4 hidden tanh layers.
neural_model = MLPClassifier(
hidden_layer_sizes = (4),
activation = "tanh",
max_iter = 2000)
7. Train the model using the training data.
neural_model.fit(
X = X_train_scaled,
y = y_train)
8. Predict the test set labels using the model.
neural_predictions = neural_model.predict(X_test_scaled)
9. Get the prediction accuracy.
neural_score = accuracy_score(
y_true = y_test,
y_pred = neural_predictions)
10. Inspect the prediction accuracy.
print(neural_score)
11. Visualize the prediction errors (in red).
plt.scatter(
x = X_test.Age,
y = X_test.BMI,
color = np.where(
y_test == neural_predictions,
'black',
'red'))
plt.xlabel("Age")
plt.ylabel("BMI")
plt.show()
1. Compare the accuracy of all three models.
print("KNN: ", knn_score)
print("Tree:", tree_score)
print("NNet:", neural_score)
2. Question: Which of these three classifiers would you choose? Why?