Lab 4A - Clustering¶

1. Load the Data¶

1. Import the OS library.

import os

2. Set the working directory.

os.chdir("C:\\Workshop\\Data")

3. Import the pandas library as "pd".

import pandas as pd

4. Read the Iris CSV file into a data frame called iris.

iris = pd.read_csv("Iris.csv")

2. Explore the Data¶

Inspect the iris data frame using the head function.

iris.head()

2. Import pyplot from matplotlib as "plt".

import matplotlib.pyplot as plt

3. Create a scatterplot matrix of the data set.

pd.plotting.scatter_matrix(
    frame = iris,
    alpha = 1,
    s = 100,
    diagonal = 'none');

4. Question: Do you see any natural clusters in these data? How many?

3. Transform the Data¶

1. Create a data frame of features for clustering (i.e. all columns except Species).

X = iris.iloc[:, 0:4]

2. Inspect the features using the head function.

X.head()

3. Import the numpy library as "np".

import numpy as np

4. Set the random number seed

np.random.seed(42)

4. Cluster with k-Means¶

1. Import the k-means class from sklearn.

from sklearn.cluster import KMeans

2. Create a k-means model with k = 3 and 10 random initializations.

k_model = KMeans(
    n_clusters = 3,
    n_init = 10)

3. Fit the model to the data.

k_model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

4. Create a palette with three colors for each of the tree clusters.

palette = {0:'#fb8072', 1:'#80b1d3', 2:'#b3de69'}

5. Map the colors to each of the clusters.

k_colors = pd.Series(k_model.labels_) \
    .apply(lambda x:palette[x])

6. Plot a scatterplot of petal width (y) vs. petal length (x) colored by the clusters.
Superimpose the centroids of each cluster as X's on the scatterplot.

plt.scatter(
    x = iris.Petal_Length,
    y = iris.Petal_Width,
    color = k_colors)
plt.scatter(
    x = k_model.cluster_centers_[:,2],
    y = k_model.cluster_centers_[:,3],
    marker = 'x',
    color = "black",
    s = 100)
plt.xlabel = "Petal Length"
plt.ylabel = "Petal Width"
plt.show()

7. Question: How do you interpret this plot?

5. Cluster with Hierachical Clustering¶

1. Import the agglomerative clustering class from sklearn.

from sklearn.cluster import AgglomerativeClustering

2. Create a hierachical cluster model with three clusters.

h_model = AgglomerativeClustering(
    n_clusters = 3)

3. Fit the model to the data.

h_model.fit(X)

AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=3,
            pooling_func=<function mean at 0x000001D44ED140D0>)

4. Import the dendrogram function from scipy.

from scipy.cluster.hierarchy import dendrogram

5. Plot the dendrogram.

children = h_model.children_

distance = np.arange(children.shape[0])

observations = np.arange(2, children.shape[0] + 2)

linkage_matrix = np.column_stack([children, distance, observations]).astype(float)

dendrogram(
    Z = linkage_matrix,
    leaf_font_size = 8,
    color_threshold = 147);

6. Question: How do you interpret this dendrogram?

7. Map the previous three colors to each cluster.

h_colors = pd.Series(h_model.labels_) \
    .apply(lambda x:palette[x])

8. Plot a scatterplot of petal width (y) vs. petal length (x) colored by cluster.

plt.scatter(
    x = iris.Petal_Length,
    y = iris.Petal_Width,
    color = h_colors)
plt.xlabel = "Petal Length"
plt.ylabel = "Petal Width"
plt.show()

9. Question: What is the difference between these two methods of clustering?

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	Sepal_Length	Sepal_Width	Petal_Length	Petal_Width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2