Lab 4A - Clustering

1. Load the Data

1. Import the OS library.

In [1]:
import os

2. Set the working directory.

In [2]:
os.chdir("C:\\Workshop\\Data")

3. Import the pandas library as "pd".

In [3]:
import pandas as pd

4. Read the Iris CSV file into a data frame called iris.

In [4]:
iris = pd.read_csv("Iris.csv")

2. Explore the Data

  1. Inspect the iris data frame using the head function.
In [5]:
iris.head()
Out[5]:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

2. Import pyplot from matplotlib as "plt".

In [6]:
import matplotlib.pyplot as plt

3. Create a scatterplot matrix of the data set.

In [7]:
pd.plotting.scatter_matrix(
    frame = iris,
    alpha = 1,
    s = 100,
    diagonal = 'none');

4. Question: Do you see any natural clusters in these data? How many?

3. Transform the Data

1. Create a data frame of features for clustering (i.e. all columns except Species).

In [8]:
X = iris.iloc[:, 0:4]

2. Inspect the features using the head function.

In [9]:
X.head()
Out[9]:
Sepal_Length Sepal_Width Petal_Length Petal_Width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

3. Import the numpy library as "np".

In [10]:
import numpy as np

4. Set the random number seed

In [11]:
np.random.seed(42)

4. Cluster with k-Means

1. Import the k-means class from sklearn.

In [12]:
from sklearn.cluster import KMeans

2. Create a k-means model with k = 3 and 10 random initializations.

In [13]:
k_model = KMeans(
    n_clusters = 3,
    n_init = 10)

3. Fit the model to the data.

In [14]:
k_model.fit(X)
Out[14]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

4. Create a palette with three colors for each of the tree clusters.

In [15]:
palette = {0:'#fb8072', 1:'#80b1d3', 2:'#b3de69'}

5. Map the colors to each of the clusters.

In [16]:
k_colors = pd.Series(k_model.labels_) \
    .apply(lambda x:palette[x])

6. Plot a scatterplot of petal width (y) vs. petal length (x) colored by the clusters.
Superimpose the centroids of each cluster as X's on the scatterplot.

In [17]:
plt.scatter(
    x = iris.Petal_Length,
    y = iris.Petal_Width,
    color = k_colors)
plt.scatter(
    x = k_model.cluster_centers_[:,2],
    y = k_model.cluster_centers_[:,3],
    marker = 'x',
    color = "black",
    s = 100)
plt.xlabel = "Petal Length"
plt.ylabel = "Petal Width"
plt.show()

7. Question: How do you interpret this plot?

5. Cluster with Hierachical Clustering

1. Import the agglomerative clustering class from sklearn.

In [18]:
from sklearn.cluster import AgglomerativeClustering

2. Create a hierachical cluster model with three clusters.

In [19]:
h_model = AgglomerativeClustering(
    n_clusters = 3)

3. Fit the model to the data.

In [20]:
h_model.fit(X)
Out[20]:
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=3,
            pooling_func=<function mean at 0x000001D44ED140D0>)

4. Import the dendrogram function from scipy.

In [21]:
from scipy.cluster.hierarchy import dendrogram

5. Plot the dendrogram.

In [22]:
children = h_model.children_

distance = np.arange(children.shape[0])

observations = np.arange(2, children.shape[0] + 2)

linkage_matrix = np.column_stack([children, distance, observations]).astype(float)

dendrogram(
    Z = linkage_matrix,
    leaf_font_size = 8,
    color_threshold = 147);

6. Question: How do you interpret this dendrogram?

7. Map the previous three colors to each cluster.

In [23]:
h_colors = pd.Series(h_model.labels_) \
    .apply(lambda x:palette[x])

8. Plot a scatterplot of petal width (y) vs. petal length (x) colored by cluster.

In [24]:
plt.scatter(
    x = iris.Petal_Length,
    y = iris.Petal_Width,
    color = h_colors)
plt.xlabel = "Petal Length"
plt.ylabel = "Petal Width"
plt.show()

9. Question: What is the difference between these two methods of clustering?