12 Multiple Linear Regression

(Chapter 3 ISL Book) How can we extend our analysis of the advertising data in order to accommodate these two additional predictors? One option is to run three separate simple linear regressions, each of which uses a different advertising medium as a predictor.

Importing Standard Libs

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import seaborn as sns
#import scipy

from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline
sns.set_style('white')

Warmup Exercise

Beer Chips

Load Datasets

Datasets available on http://faculty.marshall.usc.edu/gareth-james/ISL/

%run importData.py

Table 3.3

est = smf.ols('Sales ~ Radio', advertising).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	9.3116	0.563	16.542	0.000	8.202	10.422
Radio	0.2025	0.020	9.921	0.000	0.162	0.243

est = smf.ols('Sales ~ Newspaper', advertising).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	12.3514	0.621	19.876	0.000	11.126	13.577
Newspaper	0.0547	0.017	3.300	0.001	0.022	0.087

est = smf.ols('Sales ~ TV', advertising).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.0326	0.458	15.360	0.000	6.130	7.935
TV	0.0475	0.003	17.668	0.000	0.042	0.053

However, the approach of fitting a separate simple linear regression model for each predictor is not entirely satisfactory. First of all, it is unclear how to make a single prediction of sales given levels of the three advertising media budgets, since each of the budgets is associated with a separate regression equation. Second, each of the three regression equations ignores the other two media in forming estimates for the regression coefficients. We will see shortly that if the media budgets are correlated with each other in the 200 markets that constitute our data set, then this can lead to very misleading estimates of the individual media effects on sales. Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model (3.5) so that it can directly accommodate multiple predictors. We can do this by giving each predictor a separate slope coefficient in a single model. In general, suppose that we have p distinct predictors. Then the multiple linear regression model takes the form

\[ Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon \]

We interpret $\beta_j$ as the average effect on Y of a one unit increase in $X_j$, holding all other predictors fixed.

est = smf.ols('Sales ~ TV + Radio + Newspaper', advertising).fit()
#print(est.summary())
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.9389	0.312	9.422	0.000	2.324	3.554
TV	0.0458	0.001	32.809	0.000	0.043	0.049
Radio	0.1885	0.009	21.893	0.000	0.172	0.206
Newspaper	-0.0010	0.006	-0.177	0.860	-0.013	0.011

Table 3.4 & 3.6

We interpret these results as follows: for a given amount of TV and newspaper advertising, spending an additional $1,000 on radio advertising leads to an increase in sales by approximately 189 units. Comparing these coefficient estimates to those displayed in Tables 3.1 and 3.3, we notice that the multiple regression coefficient estimates for TV and radio are pretty similar to the simple linear regression coefficient estimates. However, while the newspaper regression coefficient estimate in Table 3.3 was significantly non-zero, the coefficient estimate for newspaper in the multiple regression model is close to zero, and the corresponding p-value is no longer significant, with a value around 0.86. This illustrates that the simple and multiple regression coefficients can be quite different. This difference stems from the fact that in the simple regression case, the slope term represents the average effect of a 1,000 increase in newspaper advertising, ignoring other predictors such as TV and radio. In contrast, in the multiple regression setting, the coefficient for newspaper represents the average effect of increasing newspaper spending by 1,000 while holding TV and radio fixed.

Table 3.5 - Correlation Matrix

Does it make sense for the multiple regression to suggest no relationship between sales and newspaper while the simple linear regression implies the opposite? In fact it does. Consider the correlation matrix for the three predictor variables and response variable, displayed in Table 3.5. Notice that the correlation between radio and newspaper is 0.35. This reveals a tendency to spend more on newspaper advertising in markets where more is spent on radio advertising. Now suppose that the multiple regression is correct and newspaper advertising has no direct impact on sales, but radio advertising does increase sales. Then in markets where we spend more on radio our sales will tend to be higher, and as our correlation matrix shows, we also tend to spend more on newspaper advertising in those same markets. Hence, in a simple linear regression which only examines sales versus newspaper, we will observe that higher values of newspaper tend to be associated with higher values of sales, even though newspaper advertising does not actually affect sales. So newspaper sales are a surrogate for radio advertising; newspaper gets “credit” for the effect of radio on sales. This slightly counterintuitive result is very common in many real life situations. Consider an absurd example to illustrate the point. Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship, similar to that seen between sales and newspaper. Of course no one (yet) has suggested that ice creams should be banned at beaches to reduce shark attacks. In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks. A multiple regression of attacks versus ice cream sales and temperature reveals that, as intuition implies, the former predictor is no longer significant after adjusting for temperature.

advertising.corr()

	Unnamed: 0	TV	Radio	Newspaper	Sales
Unnamed: 0	1.000000	0.017715	-0.110680	-0.154944	-0.051616
TV	0.017715	1.000000	0.054809	0.056648	0.782224
Radio	-0.110680	0.054809	1.000000	0.354104	0.576223
Newspaper	-0.154944	0.056648	0.354104	1.000000	0.228299
Sales	-0.051616	0.782224	0.576223	0.228299	1.000000

Figure 3.5 - Multiple Linear Regression

Sklearn’s way

regr = skl_lm.LinearRegression()

X = advertising[['Radio', 'TV']].values
y = advertising.Sales

regr.fit(X,y)
print(regr.coef_)
print(regr.intercept_)

[0.18799423 0.04575482]
2.9210999124051398

# What are the min/max values of Radio & TV?
# Use these values to set up the grid for plotting.
advertising[['Radio', 'TV']].describe()

	Radio	TV
count	200.000000	200.000000
mean	23.264000	147.042500
std	14.846809	85.854236
min	0.000000	0.700000
25%	9.975000	74.375000
50%	22.900000	149.750000
75%	36.525000	218.825000
max	49.600000	296.400000

# Create a coordinate grid
Radio = np.arange(0,50)
TV = np.arange(0,300)

B1, B2 = np.meshgrid(Radio, TV, indexing='xy')
Z = np.zeros((TV.size, Radio.size))

for (i,j),v in np.ndenumerate(Z):
        Z[i,j] =(regr.intercept_ + B1[i,j]*regr.coef_[0] + B2[i,j]*regr.coef_[1])

# Create plot
fig = plt.figure(figsize=(10,6))
fig.suptitle('Regression: Sales ~ Radio + TV Advertising', fontsize=20)

ax = axes3d.Axes3D(fig)

ax.plot_surface(B1, B2, Z, rstride=10, cstride=5, alpha=0.4)
ax.scatter3D(advertising.Radio, advertising.TV, advertising.Sales, c='r')

ax.set_xlabel('Radio')
ax.set_xlim(0,50)
ax.set_ylabel('TV')
ax.set_ylim(ymin=0)
ax.set_zlabel('Sales');

Other Considerations in the Regression Model

Collinearity

“Exact Collinearity”

Tasks

You try to impress your investors with the number of advertising channels that you have investigated. So you make a copy of the TV column, call it facebook and refit the linear multiple regression.
Multi-Collinearity: You have learned from your foolish mistakes and are more ambitiously creating fake data: You create a new column instagram by adding TV and Radio and subtracting 3 times Newspaper. Now, refit the linear multiple regression.

advertising["facebook"] = advertising.TV
#advertising.info()

est = smf.ols('Sales ~ TV + Radio + Newspaper + facebook', advertising).fit()
es = est.summary()
es.tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.9389	0.312	9.422	0.000	2.324	3.554
TV	0.0229	0.001	32.809	0.000	0.022	0.024
Radio	0.1885	0.009	21.893	0.000	0.172	0.206
Newspaper	-0.0010	0.006	-0.177	0.860	-0.013	0.011
facebook	0.0229	0.001	32.809	0.000	0.022	0.024

Multicollinearity is treated VERY differently in statsmodels than in R as discussed in this post on medium as well on stackoverflow. Worthwhile to inspect the condition number:

es.tables[2]

Omnibus:	60.414	Durbin-Watson:	2.084
Prob(Omnibus):	0.000	Jarque-Bera (JB):	151.241
Skew:	-1.327	Prob(JB):	1.44e-33
Kurtosis:	6.332	Cond. No.	2.61e+16

advertising["instagram"] = advertising.TV +advertising.Radio - 3*advertising.Newspaper
#advertising.info()

est = smf.ols('Sales ~ TV + Radio + Newspaper + instagram', advertising).fit()
es = est.summary()
es.tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.9389	0.312	9.422	0.000	2.324	3.554
TV	0.0260	0.002	11.660	0.000	0.022	0.030
Radio	0.1687	0.008	22.469	0.000	0.154	0.184
Newspaper	0.0583	0.002	27.223	0.000	0.054	0.063
instagram	0.0198	0.002	10.679	0.000	0.016	0.023

es.tables[2]

Omnibus:	60.414	Durbin-Watson:	2.084
Prob(Omnibus):	0.000	Jarque-Bera (JB):	151.241
Skew:	-1.327	Prob(JB):	1.44e-33
Kurtosis:	6.332	Cond. No.	2.32e+16

Figure 3.6

What about non-exact collinearities?

credit = pd.read_csv('../data/Credit.csv', usecols=list(range(1,12)))
credit['Student2'] = credit.Student.map({'No':0, 'Yes':1})
credit.head()

	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity	Balance	Student2
0	14.891	3606	283	2	34	11	Male	No	Yes	Caucasian	333	0
1	106.025	6645	483	3	82	15	Female	Yes	Yes	Asian	903	1
2	104.593	7075	514	4	71	11	Male	No	No	Asian	580	0
3	148.924	9504	681	3	36	11	Female	No	No	Asian	964	0
4	55.882	4897	357	2	68	16	Male	No	Yes	Caucasian	331	0

sns.pairplot(credit[['Balance','Age','Cards','Education','Income','Limit','Rating']]);

Figure 3.14

fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))

# Left plot
ax1.scatter(credit.Limit, credit.Age, facecolor='None', edgecolor='r')
ax1.set_ylabel('Age')

# Right plot
ax2.scatter(credit.Limit, credit.Rating, facecolor='None', edgecolor='r')
ax2.set_ylabel('Rating')

for ax in fig.axes:
    ax.set_xlabel('Limit')
    ax.set_xticks([2000,4000,6000,8000,12000])

“High Collinearity”

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The concept of collinearity is illustrated in Figure 3.14 using the Credit data set. In the left-hand panel of Figure 3.14, the two predictors limit and age appear to have no obvious relationship. In contrast, in the right-hand panel of Figure 3.14, the predictors limit and rating are very highly correlated with each other, and we say that they are collinear. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. In other words, since limit and rating tend to increase or decrease together, it can be difficult to determine how each one separately is associated with the response, balance.

Figure 3.15

y = credit.Balance

# Regression for left plot
X = credit[['Age', 'Limit']].values
regr1 = skl_lm.LinearRegression()
regr1.fit(scale(X.astype('float'), with_std=False), y)
print('Age/Limit\n',regr1.intercept_)
print(regr1.coef_)

# Regression for right plot
X2 = credit[['Rating', 'Limit']].values
regr2 = skl_lm.LinearRegression()
regr2.fit(scale(X2.astype('float'), with_std=False), y)
print('\nRating/Limit\n',regr2.intercept_)
print(regr2.coef_)

Age/Limit
 520.0150000000001
[-2.29148553  0.17336497]

Rating/Limit
 520.015
[2.20167217 0.02451438]

Figure 3.15 illustrates some of the difficulties that can result from collinearity. The left-hand panel of Figure 3.15 is a contour plot of the RSS (3.22) associated with different possible coefficient estimates for the regression of balance on limit and age. Each ellipse represents a set of coefficients that correspond to the same RSS, with ellipses nearest to the center taking on the lowest values of RSS. The dots and associated dashed lines represent the coefficient estimates that result in the smallest possible RSS—in other words, these are the least squares estimates. The axes for limit and age have been scaled so that the plot includes possible coefficient estimates that are up to four standard errors on either side of the least squares estimates. Thus the plot includes all plausible values for the coefficients. For example, we see that the true limit coefficient is almost certainly somewhere between 0.15 and 0.20. In contrast, the right-hand panel of Figure 3.15 displays contour plots of the RSS associated with possible coefficient estimates for the regression of balance onto limit and rating, which we know to be highly collinear. Now the contours run along a narrow valley; there is a broad range of values for the coefficient estimates that result in equal values for RSS. Hence a small change in the data could cause the pair of coefficient values that yield the smallest RSS—that is, the least squares estimates—to move anywhere along this valley. This results in a great deal of uncertainty in the coefficient estimates. Notice that the scale for the limit coefficient now runs from roughly −0.2 to 0.2; this is an eight-fold increase over the plausible range of the limit coefficient in the regression with age.

Table 3.11 compares the coefficient estimates obtained from two separate multiple regression models. The first is a regression of balance on age and limit, and the second is a regression of balance on rating and limit. In the first regression, both age and limit are highly significant with very small pvalues. In the second, the collinearity between limit and rating has caused the standard error for the limit coefficient estimate to increase by a factor of 12 and the p-value to increase to 0.701. In other words, the importance of the limit variable has been masked due to the presence of collinearity. To avoid such a situation, it is desirable to identify and address potential collinearity problems while fitting the model.

Qualitative/Categorical Variables

Dummy Coding
Interpretation of coefficients

Table 3.7

est = smf.ols('Balance ~ Gender ', credit).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	509.8031	33.128	15.389	0.000	444.675	574.931
Gender[T.Female]	19.7331	46.051	0.429	0.669	-70.801	110.267

Table 3.8

np.unique(credit["Ethnicity"])
credit.groupby("Ethnicity")["Balance"].mean()

Ethnicity
African American    531.000000
Asian               512.313725
Caucasian           518.497487
Name: Balance, dtype: float64

est = smf.ols('Balance ~ Ethnicity', credit).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	531.0000	46.319	11.464	0.000	439.939	622.061
Ethnicity[T.Asian]	-18.6863	65.021	-0.287	0.774	-146.515	109.142
Ethnicity[T.Caucasian]	-12.5025	56.681	-0.221	0.826	-123.935	98.930

Interactions

Table 3.9

est = smf.ols('Sales ~ TV + Radio + TV*Radio', advertising).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.7502	0.248	27.233	0.000	6.261	7.239
TV	0.0191	0.002	12.699	0.000	0.016	0.022
Radio	0.0289	0.009	3.241	0.001	0.011	0.046
TV:Radio	0.0011	5.24e-05	20.727	0.000	0.001	0.001

Interaction between qualitative and quantitative variables

credit["Income2"] = credit["Income"] + scipy.stats.norm.rvs(400)

est1 = smf.ols('Balance ~ Income + Income2 + C(Student)', credit).fit()
regr1 = est1.params
est2 = smf.ols('Balance ~ Income + Income*C(Student)', credit).fit()
regr2 = est2.params

print('Regression 1 - without interaction term')
est1.summary().tables[1]

Regression 1 - without interaction term

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-0.0123	0.002	-7.143	0.000	-0.016	-0.009
C(Student)[T.Yes]	382.6705	65.311	5.859	0.000	254.272	511.069
Income	5.4557	0.621	8.779	0.000	4.234	6.677
Income2	0.5287	0.081	6.506	0.000	0.369	0.688

np.mean(credit["Income"])

45.218885000000036

est2 = smf.ols('Balance ~ Income + Income*C(Ethnicity)*C(Student)', credit).fit()
est2.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	145.1346	64.922	2.236	0.026	17.492	272.777
C(Ethnicity)[T.Asian]	-15.5992	94.976	-0.164	0.870	-202.332	171.133
C(Ethnicity)[T.Caucasian]	122.8200	80.930	1.518	0.130	-36.297	281.937
C(Student)[T.Yes]	427.5858	246.321	1.736	0.083	-56.706	911.877
C(Ethnicity)[T.Asian]:C(Student)[T.Yes]	281.3665	297.683	0.945	0.345	-303.906	866.639
C(Ethnicity)[T.Caucasian]:C(Student)[T.Yes]	-109.7842	309.373	-0.355	0.723	-718.041	498.473
Income	7.1950	1.069	6.728	0.000	5.093	9.298
Income:C(Ethnicity)[T.Asian]	0.0963	1.667	0.058	0.954	-3.181	3.374
Income:C(Ethnicity)[T.Caucasian]	-2.0995	1.371	-1.531	0.127	-4.796	0.597
Income:C(Student)[T.Yes]	-0.0693	3.716	-0.019	0.985	-7.375	7.236
Income:C(Ethnicity)[T.Asian]:C(Student)[T.Yes]	-4.6311	4.475	-1.035	0.301	-13.429	4.167
Income:C(Ethnicity)[T.Caucasian]:C(Student)[T.Yes]	-0.9514	5.431	-0.175	0.861	-11.630	9.727

print('\nRegression 2 - with interaction term')
est2.summary().tables[1]


Regression 2 - with interaction term

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	145.1346	64.922	2.236	0.026	17.492	272.777
C(Ethnicity)[T.Asian]	-15.5992	94.976	-0.164	0.870	-202.332	171.133
C(Ethnicity)[T.Caucasian]	122.8200	80.930	1.518	0.130	-36.297	281.937
C(Student)[T.Yes]	427.5858	246.321	1.736	0.083	-56.706	911.877
C(Ethnicity)[T.Asian]:C(Student)[T.Yes]	281.3665	297.683	0.945	0.345	-303.906	866.639
C(Ethnicity)[T.Caucasian]:C(Student)[T.Yes]	-109.7842	309.373	-0.355	0.723	-718.041	498.473
Income	7.1950	1.069	6.728	0.000	5.093	9.298
Income:C(Ethnicity)[T.Asian]	0.0963	1.667	0.058	0.954	-3.181	3.374
Income:C(Ethnicity)[T.Caucasian]	-2.0995	1.371	-1.531	0.127	-4.796	0.597
Income:C(Student)[T.Yes]	-0.0693	3.716	-0.019	0.985	-7.375	7.236
Income:C(Ethnicity)[T.Asian]:C(Student)[T.Yes]	-4.6311	4.475	-1.035	0.301	-13.429	4.167
Income:C(Ethnicity)[T.Caucasian]:C(Student)[T.Yes]	-0.9514	5.431	-0.175	0.861	-11.630	9.727

# Income (x-axis)
income = np.linspace(0,150)

# Balance without interaction term (y-axis)
student1 = np.linspace(regr1['Intercept']+regr1['C(Student)[T.Yes]'],
                       regr1['Intercept']+regr1['C(Student)[T.Yes]']+150*regr1['Income'])
non_student1 =  np.linspace(regr1['Intercept'], regr1['Intercept']+150*regr1['Income'])

# Balance with iteraction term (y-axis)
student2 = np.linspace(regr2['Intercept']+regr2['C(Student)[T.Yes]'],
                       regr2['Intercept']+regr2['C(Student)[T.Yes]']+
                       150*(regr2['Income']+regr2['Income:C(Student)[T.Yes]']))
non_student2 =  np.linspace(regr2['Intercept'], regr2['Intercept']+150*regr2['Income'])

# Create plot
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(income, student1, 'r', income, non_student1, 'k')

ax2.plot(income, student2, 'r', income, non_student2, 'k')

for ax in fig.axes:
    ax.legend(['student', 'non-student'], loc=2)
    ax.set_xlabel('Income')
    ax.set_ylabel('Balance')
    ax.set_ylim(ymax=1550)

Non-linear relationships

Figure 3.8

auto = pd.read_csv('./data/Auto.csv', na_values='?').dropna()

sns.regplot(x='weight', y='mpg', data=auto,fit_reg=True, order=2);

<AxesSubplot:xlabel='weight', ylabel='mpg'>

# With Seaborn's regplot() you can easily plot higher order polynomials.
plt.scatter(x=auto.horsepower, y=auto.mpg, facecolors='None', edgecolors='k', alpha=.5) 
sns.regplot(x=auto.horsepower, y=auto.mpg, ci=None, label='Linear', scatter=False, color='orange')
sns.regplot(x=auto.horsepower, y=auto.mpg, ci=None, label='Degree 2', order=2, scatter=False, color='lightblue')
sns.regplot(x=auto.horsepower, y=auto.mpg, ci=None, label='Degree 5', order=5, scatter=False, color='g')
plt.legend()
plt.ylim(5,55)
plt.xlim(40,240);

Table 3.10

auto['horsepower2'] = auto.horsepower**2
auto.head(3)

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name	Manufacturer	horsepower2
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu	chevrolet	16900
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320	buick	27225
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite	plymouth	22500

est = smf.ols('mpg ~ horsepower + horsepower2', auto).fit()
est.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	56.9001	1.800	31.604	0.000	53.360	60.440
horsepower	-0.4662	0.031	-14.978	0.000	-0.527	-0.405
horsepower2	0.0012	0.000	10.080	0.000	0.001	0.001