15 Advanced Topics

%precision 3

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import datasets

'%.3f'

K-Nearest Neighbors (KNN)

iris = datasets.load_iris()
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(iris['data'], iris['target'])

KNeighborsClassifier(n_neighbors=6)

Task: Split the iris data into training and test. Predict on test

# Split into training and test set
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

1.0

Simple Idea, no modeling assumptions at all !! Think about the following:

What is “the model”, i.e. what needs to be stored ? (coefficients, functions, …)
What is the model complexity ?
Does this only work for classification ? What would be the regression analogy ?
What improvements could we make to the simple idea ?
In the modeling world:
- linear ?
- local vs. global
- memory/CPU requirements
- wide versus tall data ?

Handwritten Digits

digits = datasets.load_digits()
import matplotlib.pyplot as plt

#how can I improve the plots ? (i..e no margins, box around plot...)
plt.figure(1)

for i in np.arange(10)+1:
    plt.subplot(2, 5, i)
    plt.axis('off')
    #plt.gray() 
    #plt.matshow(digits.images[i-1]) 
    plt.imshow(digits.images[i-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

0.9833333333333333

# Big Confusion Matrix
preds = knn.predict(X_test)

pd.crosstab(preds,y_test)

col_0	0	1	2	3	4	5	6	7	8	9
row_0
0	36	0	0	0	0	0	0	0	0	0
1	0	36	0	0	0	0	0	0	2	0
2	0	0	35	0	0	0	0	0	0	0
3	0	0	0	37	0	0	0	0	0	0
4	0	0	0	0	36	0	0	0	0	1
5	0	0	0	0	0	37	0	0	0	0
6	0	0	0	0	0	0	35	0	0	0
7	0	0	0	0	0	0	0	36	1	0
8	0	0	0	0	0	0	1	0	32	1
9	0	0	0	0	0	0	0	0	0	34

Task

Construct a model complexity curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

# # Setup arrays to store train and test accuracies
# neighbors = np.arange(1, 20)
# train_accuracy = np.empty(len(neighbors))
# test_accuracy = np.empty(len(neighbors))
# 
# # Loop over different values of k
# for i, k in enumerate(neighbors):
#     # Setup a k-NN Classifier with k neighbors: knn
#     knn = KNeighborsClassifier(___)
# 
#     # Fit the classifier to the training data
#     ___
#     
#     #Compute accuracy on the training set
#     train_accuracy[i] = ___
# 
#     #Compute accuracy on the testing set
#     test_accuracy[i] = ___
# 
# # Generate plot
# plt.title('k-NN: Varying Number of Neighbors')
# plt.plot(___, label = 'Testing Accuracy')
# plt.plot(___, label = 'Training Accuracy')
# plt.legend()
# plt.xlabel('Number of Neighbors')
# plt.ylabel('Accuracy')
# plt.show()

Multinomial Logistic Regression

from sklearn.linear_model import LogisticRegression
iris = datasets.load_iris()

log_reg = LogisticRegression(multi_class='multinomial',solver='sag', max_iter=100, random_state=42)
log_reg.fit(iris["data"][:,3:],iris["target"])
    
preds = log_reg.predict_proba(iris["data"][:,3:])

preds[1:5,:]

array([[9.259e-01, 7.396e-02, 1.788e-04],
       [9.259e-01, 7.396e-02, 1.788e-04],
       [9.259e-01, 7.396e-02, 1.788e-04],
       [9.259e-01, 7.396e-02, 1.788e-04]])

Task: Compute a confusion matrix

Further Reading and Explorations: * Read about “one versus all” pitted against multinomial. Check out this notebook. * Read about marginal probabilities.

ROC Curves

Simplest to go back to 2 labels from lesson 7:

iris = datasets.load_iris()
X = iris["data"][:,3:]
y = (iris["target"]==2).astype(int)
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)

y_pred = log_reg.predict_proba(X_test) 
y_pred_prob = y_pred[:,1]

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred_prob > 0.25 ))

print(confusion_matrix(y_test, y_pred_prob > 0.5 ))

print(confusion_matrix(y_test, y_pred_prob > 0.75 ))

[[24  8]
 [ 0 13]]
[[32  0]
 [ 0 13]]
[[32  0]
 [ 4  9]]

The need for more sophisticated metrics than accuracy and single thresholding

This is particularly relevant for imbalanced classes, example: Emails

Spam classification
- 99% of emails are real; 1% of emails are spam
Could build a classifier that predicts ALL emails as real
- 99% accurate!
- But horrible at actually classifying spam
- Fails at its original purpose

Metrics from CM

Precision: \(\frac{TP}{TP+FP}\)
Recall: \(\frac{TP}{TP+FN}\)
F1 score: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. () - High precision: Not many real emails predicted as spam - High recall: Predicted most spam emails correctly

#Example:

print(classification_report(y_test, y_pred_prob > 0.5 ))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        32
           1       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

But we still need to fix a threshold for any of the metrics above to work !

Wouldn’t it be best to communicate the prediction quality over a wide range (all!) of thresholds and enable the user to choose the most suitable one for his/her application ! That is the idea of the Receiver Operating Characteristic (ROC) curve!

#The pima-indians- diabetes data:
#  https://www.kaggle.com/uciml/pima-indians-diabetes-database#diabetes.csv

col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("../data/diabetes.csv", header=None, names=col_names,skiprows=[0])
pima.head()

	pregnant	glucose	bp	skin	insulin	bmi	pedigree	age	label
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

pima.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pregnant  768 non-null    int64  
 1   glucose   768 non-null    int64  
 2   bp        768 non-null    int64  
 3   skin      768 non-null    int64  
 4   insulin   768 non-null    int64  
 5   bmi       768 non-null    float64
 6   pedigree  768 non-null    float64
 7   age       768 non-null    int64  
 8   label     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

a = [1,2,3]
b = [*a]
b

[1, 2, 3]

#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train,y_train)

y_pred = log_reg.predict_proba(X_test) 
y_pred_prob = y_pred[:,1]

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show();

Area under the ROC curve (AUC)

from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred_prob)

0.7969370860927152

AUC using cross-validation

from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(log_reg, X, y, cv=10, scoring='roc_auc')

print(cv_scores)

[0.789 0.83  0.838 0.773 0.812 0.848 0.819 0.921 0.857 0.833]

Regularized Regression

We will illustrate the concepts on the Boston housing data set

boston = pd.read_csv('../data/boston.csv')
#print(boston.head())
X = boston.drop('medv', axis=1).values
y = boston['medv'].values

Variable Selection

ISLR slides on model selection

L2 Regression

Recall: Linear regression minimizes a loss function - It chooses a coefficient for each feature variable - Large coefficients can lead to overfitting - Penalizing large coefficients: Regularization

Detour: \(L_p\) norms

http://mathworld.wolfram.com/VectorNorm.html

Our new penalty term in finding the coefficients \(\beta_j\) is the minimization

\[ \sum_{i=1}^n{\left( y_i - \beta_0 - \sum_{j=1}^p{\beta_j x_{ij}} \right)^2} + \lambda \sum_{j=1}^p{\beta_j^2} = RSS + \lambda \sum_{j=1}^p{\beta_j^2} \]

Instead of obtaining one set of coefficients we have a dependency of \(\beta_j\) on \(\lambda\):

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.3, random_state=42)

ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)#?does this automatically take care of nromalization ??
#Returns the coefficient of determination R^2 of the prediction.
ridge.score(X_test, y_test)
ridge_pred[0:5]

array([27.97 , 35.383, 16.826, 25.145, 18.749])

ridge2 = Ridge(alpha=0.1, normalize=False)
ridge2.fit(X_train, y_train)
ridge2_pred = ridge2.predict(X_test)
ridge2_pred[0:5]
ridge2.coef_

array([-1.325e-01,  3.599e-02,  4.385e-02,  3.096e+00, -1.406e+01,
        4.061e+00, -1.200e-02, -1.365e+00,  2.395e-01, -8.817e-03,
       -8.955e-01,  1.183e-02, -5.498e-01])

L1 Regression

Our penalty termy looks slightly different (with big consequences for sparsity)

\[ \sum_{i=1}^n{\left( y_i - \beta_0 - \sum_{j=1}^p{\beta_j x_{ij}} \right)^2} + \lambda \sum_{j=1}^p{| \beta_j |} = RSS + \lambda \sum_{j=1}^p{| \beta_j |} \]

(Comment: LASSO = “least absolute shrinkage and selection operator”)

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1, normalize=True)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
#Returns the coefficient of determination R^2 of the prediction.
lasso.score(X_test, y_test)
lasso.coef_

array([-0.   ,  0.   , -0.   ,  0.   , -0.   ,  3.189, -0.   , -0.   ,
       -0.   , -0.   , -0.307,  0.   , -0.487])

Feature Selection Property of the LASSO

names = boston.drop('medv', axis=1).columns
lasso_coef = lasso.fit(X, y).coef_
_ = plt.plot(range(len(names)), lasso_coef)
_ = plt.xticks(range(len(names)), names, rotation=60)
_ = plt.ylabel('Coefficients')
plt.show()

Tuning the shrinkage parameter with CV

(It seems that \(\lambda\) is often referred to as \(\alpha\))

From sklearn.linear_model:

LassoCV
RidgeCV
GridSearchCV

from sklearn.linear_model import RidgeCV

reg = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1],store_cv_values = True)
reg.fit(X, y)
reg.score(X, y)

reg.cv_values_
#Open Questions: 
#1. how to extract all scores, possibly even the individual folds?
#2. What is the optimal alpha ??
#reg.cv_values_

array([[ 37.303,  37.347,  37.766,  40.434],
       [ 11.997,  11.974,  11.757,  10.501],
       [ 17.488,  17.492,  17.532,  17.786],
       ...,
       [ 14.642,  14.676,  15.003,  17.073],
       [ 17.862,  17.902,  18.281,  20.683],
       [113.709, 113.813, 114.793, 120.997]])

from sklearn.linear_model import LassoCV

reg = LassoCV(cv=5, random_state=0).fit(X, y)
reg.score(X, y)

0.7024437179872696

from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]


lasso = Lasso(random_state=0,max_iter=2000)
#logarithmically spaced sequence
alphas = np.logspace(-4, -0.5, 20)

tuned_parameters = [{'alpha': alphas}]
n_folds = 5

reg = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
reg.fit(X, y)

GridSearchCV(cv=5, estimator=Lasso(max_iter=2000, random_state=0),
             param_grid=[{'alpha': array([1.000e-04, 1.528e-04, 2.336e-04, 3.570e-04, 5.456e-04, 8.338e-04,
       1.274e-03, 1.947e-03, 2.976e-03, 4.549e-03, 6.952e-03, 1.062e-02,
       1.624e-02, 2.482e-02, 3.793e-02, 5.796e-02, 8.859e-02, 1.354e-01,
       2.069e-01, 3.162e-01])}],
             refit=False)

scores = reg.cv_results_['mean_test_score']
scores_std = reg.cv_results_['std_test_score']
plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)

# plot error lines showing +/- std. errors of the scores
std_error = scores_std / np.sqrt(n_folds)

plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')

# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)

plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])

(0.0001, 0.31622776601683794)

How much can you trust the selection of alpha?

Task: Find the opimal Alpha parameters (maximising the generalization score) on different subsets of the data

ElasticNet

The mystery of the additional \(\alpha\) paramater

The elastic net for correlated variables, which uses a penalty that is part L1, part L2.
Compromise between the ridge regression penalty \((\alpha = 0)\) and the lasso penalty \((\alpha = 1)\).
This penalty is particularly useful in the \(p >> N\) situation, or any situation where there are many correlated predictor variables.

\(RSS + \lambda \sum_{j=1}^p{ \left( \frac{1}{2} (1-\alpha) \beta_j^2 + \alpha | \beta_j | \right)}\)

The right hand side can be written as

\(\sum_{j=1}^p{ \frac{1}{2} \lambda (1-\alpha) \beta_j^2 + \alpha \lambda | \beta_j |} = \sum_{j=1}^p{ \lambda_R \beta_j^2 + \lambda_L | \beta_j |}\)

with the Ridge penalty parameter \(\lambda_R \equiv \frac{1}{2} \lambda (1-\alpha)\) and the lasso penalty parameter \(\lambda_L \equiv \alpha \lambda\).

So we see that with

\(\alpha = \frac{\lambda_L}{\lambda_L+ 2 \lambda_R}, \text{ and } \lambda= \lambda_L+ 2 \lambda_R\)

Further Reading: - this notebook shows how to plot the entire “path” of coefficients.

Kaggle

Housing Data

This notebook (despite its annoying “humor”) is a good start. (Get it directly from kaggle)

Your first submission

df_train = pd.read_csv('../data/kaggle/HousePrices/train.csv')
df_test = pd.read_csv('../data/kaggle/HousePrices/test.csv')

df_train

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	12	2008	WD	Normal	250000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1455	1456	60	RL	62.0	7917	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	8	2007	WD	Normal	175000
1456	1457	20	RL	85.0	13175	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	2	2010	WD	Normal	210000
1457	1458	70	RL	66.0	9042	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	GdPrv	Shed	2500	5	2010	WD	Normal	266500
1458	1459	20	RL	68.0	9717	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	4	2010	WD	Normal	142125
1459	1460	20	RL	75.0	9937	Pave	NaN	Reg	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2008	WD	Normal	147500

1460 rows × 81 columns

# Submission:
# pred_df = pd.DataFrame(y_pred, index=df_test.index, columns=["SalePrice"])
# pred_df.to_csv('../data/kaggle/HousePrices/submissions/en1.csv', header=True, index_label='Id')