Lab 2A - Classification

1. Load the Data

1. Import the OS library.

In [1]:
import os

2. Set the working directory to "C:\Workshop\Data".

In [2]:
os.chdir("C:\Workshop\Data")

3. Import the pandas library as "pd".

In [3]:
import pandas as pd

4. Read the Iris CSV file into a data frame named iris.

In [4]:
iris = pd.read_csv("Iris.csv")

2. Explore the Data

1. Inspect the iris data set with the head function.

In [5]:
iris.head()
Out[5]:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

2. Import the matplotlib.pyplot library as "plt".

In [6]:
import matplotlib.pyplot as plt

3. Create a color palette containing three colors for setosa, versicolor, and virginica.

In [7]:
palette = {
    'setosa':'#fb8072', 
    'versicolor':'#80b1d3', 
    'virginica':'#b3de69'}

4. Map the colors to each species of iris flower.

In [8]:
colors = iris.Species.apply(lambda x:palette[x])

5. Create a scatterplot matrix of the iris data set colored by species.
Note: The semicolon at the end returns only plot and no text output.

In [9]:
pd.plotting.scatter_matrix(
    frame = iris,
    color = colors,
    alpha = 1,
    s = 100,
    diagonal = "none");

6. Create a scatterplot of petal width (on the y-axis) vs. petal length (on the x-axis) colored by species.

In [10]:
plt.scatter(
    x = iris.Petal_Length,
    y = iris.Petal_Width,
    color = colors)
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

3. Transform the Data

1. Create a data frame named X containing all features (i.e. the first four columns).

In [11]:
X = iris.iloc[:, 0:4]

2. Insect the features data frame X using the head function.

In [12]:
X.head()
Out[12]:
Sepal_Length Sepal_Width Petal_Length Petal_Width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

3. Create a series named y containing the Species labels.

In [13]:
y = iris.Species

4. Inspect the series of labels y using the head function.

In [14]:
y.head()
Out[14]:
0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: Species, dtype: object

4. Create the Training and Test Set

1. Import the numpy library as "np".

In [15]:
import numpy as np

2. Set the random number seed to 123.

In [16]:
np.random.seed(123)

3. Import the train_test_split function from sklearn.

In [17]:
from sklearn.model_selection import train_test_split

4. Randomly sample 100 rows for the training set and 50 rows for the test set.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.67,
    test_size = 0.33)

5. Inspect the shape of the training and test sets using their shape property.

In [19]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)
X_train:  (100, 4)
y_train:  (100,)
X_test:   (50, 4)
y_test:   (50,)

6. Question: How do you interpret these shapes in terms of columns and rows?

4. Predict with K-Nearest Neighbors

1. Import KNN classifier class from sklearn.

In [20]:
from sklearn.neighbors import KNeighborsClassifier

2. Create a KNN model with k = 3.

In [21]:
knn_model = KNeighborsClassifier(
    n_neighbors = 3)

3. Train the model using the training data.

In [22]:
knn_model.fit(
    X = X_train,
    y = y_train)
Out[22]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

4. Predict the labels of the test set using the model.

In [23]:
knn_predictions = knn_model.predict(X_test)

5. Create a confusion matrix for the predictions.

In [24]:
pd.crosstab(
    y_test, 
    knn_predictions, 
    rownames = ['Reference'], 
    colnames = ['Predicted'])
Out[24]:
Predicted setosa versicolor virginica
Reference
setosa 20 0 0
versicolor 0 10 1
virginica 0 1 18

6. Import the accuracy_score function from sklearn.

In [25]:
from sklearn.metrics import accuracy_score

7. Get the prediction accuracy.

In [26]:
knn_score = accuracy_score(
    y_true = y_test,
    y_pred = knn_predictions)

8. Inspect the prediction accuracy.

In [27]:
print(knn_score)
0.96

9. Visualize the knn predictions with correct prediction in black and incorrect predictions in red.

In [28]:
plt.scatter(
    x = X_test.Petal_Length,
    y = X_test.Petal_Width,
    color = np.where(
        y_test == knn_predictions, 
        'black', 
        'red'))
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

10. Question: Why do you think these two points were misclassified?

5. Predict with a Decision Tree Classifier

1. Import the decision tree classifier from sklearn.

In [29]:
from sklearn.tree import DecisionTreeClassifier

2. Create a decision tree classifier with max_depth = 3.

In [30]:
tree_model = DecisionTreeClassifier(
    max_depth = 3)

3. Train the model using the training data.

In [31]:
tree_model.fit(
    X = X_train, 
    y = y_train)
Out[31]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

4. Import the tree visualizer from sklearn.

In [32]:
from sklearn.tree import export_graphviz

5. Visualize the decision tree.

In [33]:
import graphviz
tree_graph = export_graphviz(
    decision_tree = tree_model,
    feature_names = list(X_train.columns.values),  
    class_names = list(y_train.unique()), 
    out_file = None) 
graphviz.Source(tree_graph)
Out[33]:
Tree 0 Petal_Width <= 0.8 gini = 0.662 samples = 100 value = [30, 39, 31] class = setosa 1 gini = 0.0 samples = 30 value = [30, 0, 0] class = virginica 0->1 True 2 Petal_Width <= 1.75 gini = 0.493 samples = 70 value = [0, 39, 31] class = setosa 0->2 False 3 Petal_Length <= 5.35 gini = 0.136 samples = 41 value = [0, 38, 3] class = setosa 2->3 6 Petal_Length <= 4.85 gini = 0.067 samples = 29 value = [0, 1, 28] class = versicolor 2->6 4 gini = 0.05 samples = 39 value = [0, 38, 1] class = setosa 3->4 5 gini = 0.0 samples = 2 value = [0, 0, 2] class = versicolor 3->5 7 gini = 0.5 samples = 2 value = [0, 1, 1] class = setosa 6->7 8 gini = 0.0 samples = 27 value = [0, 0, 27] class = versicolor 6->8

6. Question: Are you able to read and follow the logic of this decision tree?

7. Predict the labels of the test set with the model.

In [34]:
tree_predictions = tree_model.predict(X_test)

8. Get the prediction accuracy.

In [35]:
tree_score = accuracy_score(
    y_true = y_test, 
    y_pred = tree_predictions)

9. Inspect the prediction accuracy.

In [36]:
print(tree_score)
0.94

10. Visualize the prediction errors (in red).

In [37]:
plt.scatter(
    x = X_test.Petal_Length,
    y = X_test.Petal_Width,
    color = np.where(
        y_test == tree_predictions, 
        'black', 
        'red'))
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

6. Predict with a Neural Network

1. Import the standard scaler from sklearn.

In [38]:
from sklearn.preprocessing import StandardScaler

2. Create a standard scaler.

In [39]:
scaler = StandardScaler()

3. Fit the scaler to all training data (i.e. X).

In [40]:
scaler.fit(X)
Out[40]:
StandardScaler(copy=True, with_mean=True, with_std=True)

4. Scale the training and test set.

In [41]:
X_train_scaled = scaler.transform(X_train)

X_test_scaled = scaler.transform(X_test)

5. Import the neural network classifier from sklearn.

In [42]:
from sklearn.neural_network import MLPClassifier

6. Create a neural network classifier with 4 hidden tanh layers.

In [43]:
neural_model = MLPClassifier(
    hidden_layer_sizes = (4),
    activation = "tanh",
    max_iter = 2000)

7. Train the model using the training data.

In [44]:
neural_model.fit(
    X = X_train_scaled, 
    y = y_train)
Out[44]:
MLPClassifier(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=4, learning_rate='constant',
       learning_rate_init=0.001, max_iter=2000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

8. Predict the test set labels using the model.

In [45]:
neural_predictions = neural_model.predict(X_test_scaled)

9. Get the prediction accuracy.

In [46]:
neural_score = accuracy_score(
    y_true = y_test, 
    y_pred = neural_predictions)

10. Inspect the prediction accuracy.

In [47]:
print(neural_score)
0.98

11. Visualize the prediction errors (in red).

In [48]:
plt.scatter(
    x = X_test.Petal_Length,
    y = X_test.Petal_Width,
    color = np.where(
        y_test == neural_predictions, 
        'black', 
        'red'))
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.show()

7. Evaluate the Classifiers

1. Compare the accuracy of all three models.

In [49]:
print("KNN: ", knn_score)
print("Tree:", tree_score)
print("NNet:", neural_score)
KNN:  0.96
Tree: 0.94
NNet: 0.98

2. Question: Which of these three classifiers would you choose? Why?