15  Advanced Topics

%precision 3

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import datasets
'%.3f'

K-Nearest Neighbors (KNN)

iris = datasets.load_iris()
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])
KNeighborsClassifier(n_neighbors=6)

Task: Split the iris data into training and test. Predict on test

# Split into training and test set
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))
1.0

Simple Idea, no modeling assumptions at all !! Think about the following:

  • What is “the model”, i.e. what needs to be stored ? (coefficients, functions, …)
  • What is the model complexity ?
  • Does this only work for classification ? What would be the regression analogy ?
  • What improvements could we make to the simple idea ?
  • In the modeling world:
    • linear ?
    • local vs. global
    • memory/CPU requirements
    • wide versus tall data ?

Handwritten Digits

digits = datasets.load_digits()
import matplotlib.pyplot as plt

#how can I improve the plots ? (i..e no margins, box around plot...)
plt.figure(1)

for i in np.arange(10)+1:
    plt.subplot(2, 5, i)
    plt.axis('off')
    #plt.gray() 
    #plt.matshow(digits.images[i-1]) 
    plt.imshow(digits.images[i-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show() 

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))
0.9833333333333333
# Big Confusion Matrix
preds = knn.predict(X_test)

pd.crosstab(preds,y_test)
col_0 0 1 2 3 4 5 6 7 8 9
row_0
0 36 0 0 0 0 0 0 0 0 0
1 0 36 0 0 0 0 0 0 2 0
2 0 0 35 0 0 0 0 0 0 0
3 0 0 0 37 0 0 0 0 0 0
4 0 0 0 0 36 0 0 0 0 1
5 0 0 0 0 0 37 0 0 0 0
6 0 0 0 0 0 0 35 0 0 0
7 0 0 0 0 0 0 0 36 1 0
8 0 0 0 0 0 0 1 0 32 1
9 0 0 0 0 0 0 0 0 0 34

Task

Construct a model complexity curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

# # Setup arrays to store train and test accuracies
# neighbors = np.arange(1, 20)
# train_accuracy = np.empty(len(neighbors))
# test_accuracy = np.empty(len(neighbors))
# 
# # Loop over different values of k
# for i, k in enumerate(neighbors):
#     # Setup a k-NN Classifier with k neighbors: knn
#     knn = KNeighborsClassifier(___)
# 
#     # Fit the classifier to the training data
#     ___
#     
#     #Compute accuracy on the training set
#     train_accuracy[i] = ___
# 
#     #Compute accuracy on the testing set
#     test_accuracy[i] = ___
# 
# # Generate plot
# plt.title('k-NN: Varying Number of Neighbors')
# plt.plot(___, label = 'Testing Accuracy')
# plt.plot(___, label = 'Training Accuracy')
# plt.legend()
# plt.xlabel('Number of Neighbors')
# plt.ylabel('Accuracy')
# plt.show()

Multinomial Logistic Regression

from sklearn.linear_model import LogisticRegression
iris = datasets.load_iris()

log_reg = LogisticRegression(multi_class='multinomial',solver='sag', max_iter=100, random_state=42)
log_reg.fit(iris["data"][:,3:],iris["target"])
    
preds = log_reg.predict_proba(iris["data"][:,3:])

preds[1:5,:]
array([[9.259e-01, 7.396e-02, 1.788e-04],
       [9.259e-01, 7.396e-02, 1.788e-04],
       [9.259e-01, 7.396e-02, 1.788e-04],
       [9.259e-01, 7.396e-02, 1.788e-04]])

Task: Compute a confusion matrix

Further Reading and Explorations: * Read about “one versus all” pitted against multinomial. Check out this notebook. * Read about marginal probabilities.

ROC Curves

Simplest to go back to 2 labels from lesson 7:

iris = datasets.load_iris()
X = iris["data"][:,3:]
y = (iris["target"]==2).astype(int)
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)

y_pred = log_reg.predict_proba(X_test) 
y_pred_prob = y_pred[:,1]

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred_prob > 0.25 ))

print(confusion_matrix(y_test, y_pred_prob > 0.5 ))

print(confusion_matrix(y_test, y_pred_prob > 0.75 ))
[[24  8]
 [ 0 13]]
[[32  0]
 [ 0 13]]
[[32  0]
 [ 4  9]]

The need for more sophisticated metrics than accuracy and single thresholding

This is particularly relevant for imbalanced classes, example: Emails

  • Spam classification
    • 99% of emails are real; 1% of emails are spam
  • Could build a classifier that predicts ALL emails as real
    • 99% accurate!
    • But horrible at actually classifying spam
    • Fails at its original purpose

Metrics from CM

  • Precision: \(\frac{TP}{TP+FP}\)
  • Recall: \(\frac{TP}{TP+FN}\)
  • F1 score: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. () - High precision: Not many real emails predicted as spam - High recall: Predicted most spam emails correctly

#Example:
print(classification_report(y_test, y_pred_prob > 0.5 ))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        32
           1       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

But we still need to fix a threshold for any of the metrics above to work !

Wouldn’t it be best to communicate the prediction quality over a wide range (all!) of thresholds and enable the user to choose the most suitable one for his/her application ! That is the idea of the Receiver Operating Characteristic (ROC) curve!

#The pima-indians- diabetes data:
#  https://www.kaggle.com/uciml/pima-indians-diabetes-database#diabetes.csv

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("../data/diabetes.csv", header=None, names=col_names,skiprows=[0])
pima.head()
pregnant glucose bp skin insulin bmi pedigree age label
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
pima.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pregnant  768 non-null    int64  
 1   glucose   768 non-null    int64  
 2   bp        768 non-null    int64  
 3   skin      768 non-null    int64  
 4   insulin   768 non-null    int64  
 5   bmi       768 non-null    float64
 6   pedigree  768 non-null    float64
 7   age       768 non-null    int64  
 8   label     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
a = [1,2,3]
b = [*a]
b
[1, 2, 3]
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train,y_train)

y_pred = log_reg.predict_proba(X_test) 
y_pred_prob = y_pred[:,1]
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();

Area under the ROC curve (AUC)

from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred_prob)
0.7969370860927152

AUC using cross-validation

from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(log_reg, X, y, cv=10, scoring='roc_auc')

print(cv_scores)
[0.789 0.83  0.838 0.773 0.812 0.848 0.819 0.921 0.857 0.833]

Regularized Regression

We will illustrate the concepts on the Boston housing data set

boston = pd.read_csv('../data/boston.csv')
#print(boston.head())
X = boston.drop('medv', axis=1).values
y = boston['medv'].values

Variable Selection

ISLR slides on model selection

L2 Regression

Recall: Linear regression minimizes a loss function - It chooses a coefficient for each feature variable - Large coefficients can lead to overfitting - Penalizing large coefficients: Regularization

Detour: \(L_p\) norms

http://mathworld.wolfram.com/VectorNorm.html

Our new penalty term in finding the coefficients \(\beta_j\) is the minimization

\[ \sum_{i=1}^n{\left( y_i - \beta_0 - \sum_{j=1}^p{\beta_j x_{ij}} \right)^2} + \lambda \sum_{j=1}^p{\beta_j^2} = RSS + \lambda \sum_{j=1}^p{\beta_j^2} \]

Instead of obtaining one set of coefficients we have a dependency of \(\beta_j\) on \(\lambda\):

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.3, random_state=42)

ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)#?does this automatically take care of nromalization ??
#Returns the coefficient of determination R^2 of the prediction.
ridge.score(X_test, y_test)
ridge_pred[0:5]
array([27.97 , 35.383, 16.826, 25.145, 18.749])
ridge2 = Ridge(alpha=0.1, normalize=False)
ridge2.fit(X_train, y_train)
ridge2_pred = ridge2.predict(X_test)
ridge2_pred[0:5]
ridge2.coef_
array([-1.325e-01,  3.599e-02,  4.385e-02,  3.096e+00, -1.406e+01,
        4.061e+00, -1.200e-02, -1.365e+00,  2.395e-01, -8.817e-03,
       -8.955e-01,  1.183e-02, -5.498e-01])

L1 Regression

Our penalty termy looks slightly different (with big consequences for sparsity)

\[ \sum_{i=1}^n{\left( y_i - \beta_0 - \sum_{j=1}^p{\beta_j x_{ij}} \right)^2} + \lambda \sum_{j=1}^p{| \beta_j |} = RSS + \lambda \sum_{j=1}^p{| \beta_j |} \]

(Comment: LASSO = “least absolute shrinkage and selection operator”)

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
#Returns the coefficient of determination R^2 of the prediction.
lasso.score(X_test, y_test)
lasso.coef_
array([-0.   ,  0.   , -0.   ,  0.   , -0.   ,  3.189, -0.   , -0.   ,
       -0.   , -0.   , -0.307,  0.   , -0.487])

Feature Selection Property of the LASSO

names = boston.drop('medv', axis=1).columns
lasso_coef = lasso.fit(X, y).coef_
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylabel('Coefficients')
plt.show()

Tuning the shrinkage parameter with CV

(It seems that \(\lambda\) is often referred to as \(\alpha\))

From sklearn.linear_model:

  • LassoCV
  • RidgeCV
  • GridSearchCV
from sklearn.linear_model import RidgeCV

reg = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1],store_cv_values = True)
reg.fit(X, y)
reg.score(X, y)

reg.cv_values_
#Open Questions: 
#1. how to extract all scores, possibly even the individual folds?
#2. What is the optimal alpha ??
#reg.cv_values_
array([[ 37.303,  37.347,  37.766,  40.434],
       [ 11.997,  11.974,  11.757,  10.501],
       [ 17.488,  17.492,  17.532,  17.786],
       ...,
       [ 14.642,  14.676,  15.003,  17.073],
       [ 17.862,  17.902,  18.281,  20.683],
       [113.709, 113.813, 114.793, 120.997]])
from sklearn.linear_model import LassoCV

reg = LassoCV(cv=5, random_state=0).fit(X, y)
reg.score(X, y)
0.7024437179872696
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]


lasso = Lasso(random_state=0,max_iter=2000)
#logarithmically spaced sequence
alphas = np.logspace(-4, -0.5, 20)

tuned_parameters = [{'alpha': alphas}]
n_folds = 5

reg = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
reg.fit(X, y)
GridSearchCV(cv=5, estimator=Lasso(max_iter=2000, random_state=0),
             param_grid=[{'alpha': array([1.000e-04, 1.528e-04, 2.336e-04, 3.570e-04, 5.456e-04, 8.338e-04,
       1.274e-03, 1.947e-03, 2.976e-03, 4.549e-03, 6.952e-03, 1.062e-02,
       1.624e-02, 2.482e-02, 3.793e-02, 5.796e-02, 8.859e-02, 1.354e-01,
       2.069e-01, 3.162e-01])}],
             refit=False)
scores = reg.cv_results_['mean_test_score']
scores_std = reg.cv_results_['std_test_score']
plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)

# plot error lines showing +/- std. errors of the scores
std_error = scores_std / np.sqrt(n_folds)

plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')

# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)

plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])
(0.0001, 0.31622776601683794)

How much can you trust the selection of alpha?

Task: Find the opimal Alpha parameters (maximising the generalization score) on different subsets of the data

ElasticNet

The mystery of the additional \(\alpha\) paramater

  • The elastic net for correlated variables, which uses a penalty that is part L1, part L2.
  • Compromise between the ridge regression penalty \((\alpha = 0)\) and the lasso penalty \((\alpha = 1)\).
  • This penalty is particularly useful in the \(p >> N\) situation, or any situation where there are many correlated predictor variables.

\(RSS + \lambda \sum_{j=1}^p{ \left( \frac{1}{2} (1-\alpha) \beta_j^2 + \alpha | \beta_j | \right)}\)

The right hand side can be written as

\(\sum_{j=1}^p{ \frac{1}{2} \lambda (1-\alpha) \beta_j^2 + \alpha \lambda | \beta_j |} = \sum_{j=1}^p{ \lambda_R \beta_j^2 + \lambda_L | \beta_j |}\)

with the Ridge penalty parameter \(\lambda_R \equiv \frac{1}{2} \lambda (1-\alpha)\) and the lasso penalty parameter \(\lambda_L \equiv \alpha \lambda\).

So we see that with

\(\alpha = \frac{\lambda_L}{\lambda_L+ 2 \lambda_R}, \text{ and } \lambda= \lambda_L+ 2 \lambda_R\)

Further Reading: - this notebook shows how to plot the entire “path” of coefficients.

Kaggle

Housing Data

This notebook (despite its annoying “humor”) is a good start. (Get it directly from kaggle)

Your first submission

df_train = pd.read_csv('../data/kaggle/HousePrices/train.csv')
df_test = pd.read_csv('../data/kaggle/HousePrices/test.csv')
df_train
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows × 81 columns

# Submission:
# pred_df = pd.DataFrame(y_pred, index=df_test.index, columns=["SalePrice"])
# pred_df.to_csv('../data/kaggle/HousePrices/submissions/en1.csv', header=True, index_label='Id')