Lab 3B - Regression (Hard)¶

1. Load the Data¶

1. Import the OS library.

import os

2. Set the working directory.

os.chdir("C:\\Workshop\\Data")

3. Import the pandas library as "pd".

import pandas as pd

4. Read the Rates.csv file into a data frame called policies.

policies = pd.read_csv("Rates.csv")

2. Explore the Data¶

1. Inspect the policy rates data set using the head function.
Note: Notice this data set has a numeric Rate variable instead of a categorical Risk variable.

policies.head()

2. Import the matplotlib pyplot library as "plt".

import matplotlib.pyplot as plt

3. Create a scatterplot matrix of the policies data set.
Note: The semicolon at the end prevents text output from being displayed with the plot.

pd.plotting.scatter_matrix(
    frame = policies,
    alpha = 1,
    s = 100,
    diagonal = 'none');

4. Create a correlation matrix of the policies data set.

correlations = policies.corr()

print(correlations)

            State_Rate    Height    Weight       BMI       Age      Rate
State_Rate    1.000000 -0.016523  0.009233  0.019241  0.112347  0.226852
Height       -0.016523  1.000000  0.238085 -0.316961 -0.164781 -0.128582
Weight        0.009233  0.238085  1.000000  0.839628  0.011679  0.060939
BMI           0.019241 -0.316961  0.839628  1.000000  0.102317  0.140507
Age           0.112347 -0.164781  0.011679  0.102317  1.000000  0.780079
Rate          0.226852 -0.128582  0.060939  0.140507  0.780079  1.000000

5. Import the seaborn library as "sns".

import seaborn as sns

6. Create a correlogram using the correlation matrix.

sns.heatmap(
    data = correlations,
    cmap = sns.diverging_palette(
        h_neg = 10, 
        h_pos = 220, 
        as_cmap=True));

7. Question: Which variable is most strongly correlated with Rate?

8. Get the correlation between Age and Rate.

policies.Age \
    .corr(policies.Rate)

0.7800790487947441

9. Create a scatterplot of Rate (on the y-axis) vs Age (on the x-axis).

plt.scatter(
    x = policies.Age,
    y = policies.Rate)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

3. Transform the Data¶

1. Inspect the policies data set.

policies.head()

2. Create a data frame named X containing feature variables Age, Gender, State Rate, and BMI.

X = policies[["Gender", "Age", "State_Rate", "BMI"]]

3. Inspect the features X.

X.head()

4. Convert the categorical variable Gender into a set of one-hot-encoding variables.

dummies = pd.get_dummies(X.Gender)

5. Inspect the one-hot encoded variables.

dummies.head()

6. Append the one-hot-encoded gender variables to the features data set X.

X = pd.concat([X, dummies], axis = 1)

7. Drop the Gender column from the features data frame X.

X = X.drop("Gender", 1)

8. Inspect the features data frame X.

X.head()

9. Create a series named y containing just the labels (i.e. Rate).

y = policies.Rate

10. Inspect the series of labels y.

y.head()

0    0.332000
1    0.869148
2    0.010000
3    0.021532
4    0.149750
Name: Rate, dtype: float64

### 4. Create the Training and Test Set

1. Import the numpy library as "np".

import numpy as np

2. Set the random number seed to 42.

np.random.seed(42)

3. Import the test_train_split function from sklearn.

from sklearn.model_selection import train_test_split

4. Randomly sample 80% of the rows for the training set and 20% of the rows for the test set.

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.80,
    test_size = 0.20)

5. Inspect the shape of the training and test sets using the shape property.

print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)

X_train:  (1553, 5)
y_train:  (1553,)
X_test:   (389, 5)
y_test:   (389,)

5. Predict with Simple Linear Regression¶

1. Import the linear regression class from sklearn.

from sklearn.linear_model import LinearRegression

2. Create a simple linear regression model.

simple_model = LinearRegression()

3. Create a data frame named x1_train containing only the Age feature from the training set.

x1_train = X_train.loc[:, ["Age"]]

4. Create a data frame named x1_test containing onl the Age feature from the test set.

x1_test = X_test.loc[:, ["Age"]]

5. Train the model using the training data.
Note: You should be using x1_train as your training data.

simple_model.fit(
    X = x1_train,
    y = y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

6. Draw the regression line on top of a scatterplot of Rate (y-axis) vs Age (x-axis).

plt.scatter(
    x = policies.Age,
    y = policies.Rate,
    color = "grey")
plt.plot(
    x1_test,
    simple_model.predict(
        x1_test),
    color = "blue",
    linewidth = 3)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

7. Inspect the slope (m) and y-intercept (b) parameter estimates.

print("y-intercept (b): ", simple_model.intercept_)
print("Slope (m):        ", simple_model.coef_[0])

y-intercept (b):  -0.26291950607211056
Slope (m):         0.007888729541323627

8. Question: How do you interpret these two values?

9. Predict the labels of the test set using the model.

simple_predictions = simple_model.predict(x1_test)

10. Visualize the prediction error.

# Plot the training set (grey dots)
plt.scatter(
    x = x1_train.Age,
    y = y_train,
    color = "grey",
    facecolor = "none")

# Plot the predictions (blue dots)
plt.scatter(
    x = x1_test.Age,
    y = simple_predictions,
    color = "blue",
    marker = 'x')

# Plot the correct answer (green dots)
plt.scatter(
    x = x1_test.Age,
    y = y_test,
    color = "green")

# Plot the error (red lines)
plt.plot(
    [x1_test.Age, x1_test.Age],
    [simple_predictions, y_test],
    color = "red",
    zorder = 0)

# Finish the plot
plt.xlabel("Age")
plt.ylabel("Risk")
plt.show()

11. How do you interpret this graph?

12. Compute the root mean squared error (RMSE) the these predictions.

simple_rmse = np.sqrt(np.mean((y_test - simple_predictions) ** 2))

print(simple_rmse)

0.12079653135112772

13. Question: Was simple linear regression a good choice for modeling this relationship? Why or why not?

6. Predict with Multiple Linear Regression¶

1. Create a linear regression model.

multiple_model = LinearRegression()

2. Train the model using all features of the training data.

multiple_model.fit(
    X = X_train,
    y = y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

3. Inspect the parameter estimates.

print("{:<12}: {: .3f}"
    .format("y-intercept", multiple_model.intercept_))

for i, column_name in enumerate(X_train.columns):
    print("{:<12}: {: .3f}".format(
        column_name, 
        multiple_model.coef_[i]))

y-intercept : -0.406
Age         :  0.008
State_Rate  :  0.626
BMI         :  0.002
Female      : -0.018
Male        :  0.018

4. Question: How do you interpret these values?

5. Predict output values for the input values in the test set.

multiple_predictions = multiple_model.predict(X_test)

6. Visualize the prediction error.

plt.scatter(
    x = X_train.Age,
    y = y_train,
    color = "black",
    facecolor = "none")
plt.scatter(
    x = X_test.Age,
    y = multiple_predictions,
    color = "blue",
    marker = 'x')
plt.scatter(
    x = X_test.Age,
    y = y_test,
    color = "green")
plt.plot(
    [X_test.Age, X_test.Age],
    [multiple_predictions, y_test],
    color = "red",
    zorder = 0)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

7. Question: How do you interpret this graph?

8. Compute the root mean squared error (RMSE) of these predictions.

multiple_rmse = np.sqrt(np.mean((y_test - multiple_predictions) ** 2))

print(multiple_rmse)

0.11691051667329788

9. Question: Is this a better predictive model of the data?

7. Predict with a Neural Network Regressor¶

1. Import the standard scaler from sklearn.

from sklearn.preprocessing import StandardScaler

2. Create standard scalers for training and test data.

X_scaler = StandardScaler()
y_scaler = StandardScaler()

3. Fit the scaler to all training data.

X_scaler.fit(X)
y_scaler.fit(y.values.reshape(-1, 1))

StandardScaler(copy=True, with_mean=True, with_std=True)

4. Scale the training and test data.

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train.values.reshape(-1, 1))
y_test_scaled = y_scaler.transform(y_test.values.reshape(-1, 1))

5. Import the neural network regressor class from sklearn.

from sklearn.neural_network import MLPRegressor

6. Create a neural network regressor with 4 hidden nodes, a tanh activation function, an LBFGS solver, and 1000 maximum iterations.

neural_model = MLPRegressor(
    hidden_layer_sizes = (4),
    activation = "tanh",
    solver = "lbfgs",
    max_iter = 1000)

7. Train the model with the training set.

neural_model.fit(
    X = X_train_scaled,
    y = y_train_scaled.reshape(-1, ))

MLPRegressor(activation='tanh', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=4, learning_rate='constant',
       learning_rate_init=0.001, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

8. Predict output values for the test set.

scaled_predictions = neural_model.predict(X_test_scaled)

9. Unscale the predictions.

neural_predictions = y_scaler.inverse_transform(scaled_predictions)

10. Visualize the prediction error.

plt.scatter(
    x = X_train.Age,
    y = y_train,
    color = "black",
    facecolor = "none")
plt.scatter(
    x = X_test.Age,
    y = neural_predictions,
    color = "blue",
    marker = 'x')
plt.scatter(
    x = X_test.Age,
    y = y_test,
    color = "green")
plt.plot(
    [X_test.Age, X_test.Age],
    [neural_predictions, y_test],
    color = "red",
    zorder = 0)
plt.xlabel("Age")
plt.ylabel("Rate")
plt.show()

11. Compute the root mean squared error (RMSE) of these predictions.

neural_rmse = np.sqrt(np.mean((y_test - neural_predictions) ** 2))

12. Inspect the RMSE of these predictions.

print(neural_rmse)

0.03612720419871486

8. Evaluate the Regressors¶

1. Compare all three results.

print("Simple RMSE:   ", simple_rmse)
print("Multiple RMSE: ", multiple_rmse)
print("Neural RMSE:   ", neural_rmse)

Simple RMSE:    0.12079653135112772
Multiple RMSE:  0.11691051667329788
Neural RMSE:    0.03612720419871486

2. Question: Which of these models would you choose? Why?

	Gender	State	State_Rate	Height	Weight	BMI	Age	Rate
0	Male	MA	0.100434	184	67.8	20.025992	77	0.332000
1	Male	VA	0.141723	163	89.4	33.648237	82	0.869148
2	Male	NY	0.090803	170	81.2	28.096886	31	0.010000
3	Male	TN	0.119973	175	99.7	32.555102	39	0.021532
4	Male	FL	0.110345	184	72.1	21.296078	68	0.149750

	Gender	State	State_Rate	Height	Weight	BMI	Age	Rate
0	Male	MA	0.100434	184	67.8	20.025992	77	0.332000
1	Male	VA	0.141723	163	89.4	33.648237	82	0.869148
2	Male	NY	0.090803	170	81.2	28.096886	31	0.010000
3	Male	TN	0.119973	175	99.7	32.555102	39	0.021532
4	Male	FL	0.110345	184	72.1	21.296078	68	0.149750