Linear Regression
-
A linear regression model assumes that the regression function E(Y|X) is linear in the inputs X1,…,Xp
-
Linear models were largely developed in the pre-computer age of statistics, but even in today’s computer era there are still good reasons to study and use them
Linear Regression Models and Least Squares
Basic Settings
-
We have a input vector XT=(X1,X2,…,Xp) and want to predic a real-valued output Y
-
The linear regression model has the form \(f(X) = \beta_0 + \sum_{j=1}^p X_j \beta_j\)
-
The linear model either assumes that the regression function E(Y|X) is linear, or that the linear model is a reasonable approximation.
-
Here the βj’s are unknown parameters or coefficients, and the vatiables Xj can come from different sources:
-
quantitative inputs
-
transformations of quantitative inputs, such as log, square-root or square
-
basic expansions, such as X2=X2,X3=X3, leading to a polynomial representation
-
numeric or “dummy” coding of the levels of qualitative inputs
-
interactions between variables, for example, X3=X1⋅X2
-
-
No matter the source of the Xj, the model is linear in the parameters
-
Typically we have a set of training data (x1,y1)…(xN,yN) from which to estimate the parameters β
-
Each xi=(xi1,xi2m,…,xip)T is a vector of feature measurements for the i-th case
-
β0,β1,…,βp : model parameters
Probabilistic Model
-
Assume that Yi∼P(Y|X=xi) and E(Y|X=xi)=f(xi)=β0+∑pj=1xijβj
-
Yi=β0+∑pj=1xijβj+ϵi where i.i.d. ϵi with E(ϵi)=σ2 for i=1,2,…,N
-
i.e., ϵi∼i.i.d.N(0,σ2)
LEast Square
- The most popular estimation method is least squares, in which we pick the coefficients β=(β0,β1,…,βp)T to minimize the residual sum of squares (RSS)
Probabilistic view
-
Least square is equivalent to the maximum likelihood estiatation if we assmue Normal distribution for the errors!
-
For the input values are given (fixed), yi become normally distributed
-
logP(yi|xi;β)=−12σ2(yi−f(xi))2+const
Visualization
- Figure 3.1 illustrates the geometry of least-squares fitting in the p+1 dimensional space occupied by the pairs (X,Y)
Optimization
-
How do we minimize RSS?
-
Denote by X the N x (p+1) matrix with each row an input vector (with a 1 in the first position), and similarly let y be the N-vector of outputs in the training set. Then we can write the risidual sum of squares as \(RSS(\beta) = (y-X\beta)^T(y-X\beta)\)
-
THis is a quadratic function in the p+1 parameters. Differentiating with respect to β we obtain \(\frac{\partial RSS}{\partial \beta} = -2X^T(y-X\beta)\) \(\frac{\partial^2 RSS}{\partial \beta \partial \beta^T} = 2X^TX\)
-
Assuming (for the moment) that X has full column rank, and hence XTX is positive definite, we set the first derivative to zero
-
XT(y−Xβ)=0
-
To obtain the unique solution
-
ˆβ=(XTX)−1XTy
-
The predicted values at an input vector xte are given by ˆf(xte)=(1:xte)Tˆβ ; the fitted values at the training inputs are
-
$\hat y = X\hat\beta = X(X^TX)^{-1}X^Ty
Geometrical representation
- Figure 3.2 shows a different geometrical representation of the least squares estimate, this time in RN