Linear Regression

  • A linear regression model assumes that the regression function E(Y|X) is linear in the inputs X1,,Xp

  • Linear models were largely developed in the pre-computer age of statistics, but even in today’s computer era there are still good reasons to study and use them

Linear Regression Models and Least Squares

Basic Settings

  • We have a input vector XT=(X1,X2,,Xp) and want to predic a real-valued output Y

  • The linear regression model has the form \(f(X) = \beta_0 + \sum_{j=1}^p X_j \beta_j\)

  • The linear model either assumes that the regression function E(Y|X) is linear, or that the linear model is a reasonable approximation.

  • Here the βj’s are unknown parameters or coefficients, and the vatiables Xj can come from different sources:

    • quantitative inputs

    • transformations of quantitative inputs, such as log, square-root or square

    • basic expansions, such as X2=X2,X3=X3, leading to a polynomial representation

    • numeric or “dummy” coding of the levels of qualitative inputs

    • interactions between variables, for example, X3=X1X2

  • No matter the source of the Xj, the model is linear in the parameters

  • Typically we have a set of training data (x1,y1)(xN,yN) from which to estimate the parameters β

  • Each xi=(xi1,xi2m,,xip)T is a vector of feature measurements for the i-th case

  • β0,β1,,βp : model parameters

Probabilistic Model

  • Assume that YiP(Y|X=xi) and E(Y|X=xi)=f(xi)=β0+pj=1xijβj

  • Yi=β0+pj=1xijβj+ϵi where i.i.d. ϵi with E(ϵi)=σ2 for i=1,2,,N

  • i.e., ϵii.i.d.N(0,σ2)

LEast Square

  • The most popular estimation method is least squares, in which we pick the coefficients β=(β0,β1,,βp)T to minimize the residual sum of squares (RSS)
\[RSS(\beta) = \sum^N_{i=1}(y_i - f(x_i))^2 = \sum^N_{i=1}(y_i - \beta_0 - \sum^p_{j=1}x_{ij}\beta_j)^2\]

Probabilistic view

  • Least square is equivalent to the maximum likelihood estiatation if we assmue Normal distribution for the errors!

  • For the input values are given (fixed), yi become normally distributed

  • logP(yi|xi;β)=12σ2(yif(xi))2+const

Visualization

  • Figure 3.1 illustrates the geometry of least-squares fitting in the p+1 dimensional space occupied by the pairs (X,Y)

Optimization

  • How do we minimize RSS?

  • Denote by X the N x (p+1) matrix with each row an input vector (with a 1 in the first position), and similarly let y be the N-vector of outputs in the training set. Then we can write the risidual sum of squares as \(RSS(\beta) = (y-X\beta)^T(y-X\beta)\)

  • THis is a quadratic function in the p+1 parameters. Differentiating with respect to β we obtain \(\frac{\partial RSS}{\partial \beta} = -2X^T(y-X\beta)\) \(\frac{\partial^2 RSS}{\partial \beta \partial \beta^T} = 2X^TX\)

  • Assuming (for the moment) that X has full column rank, and hence XTX is positive definite, we set the first derivative to zero

  • XT(yXβ)=0

  • To obtain the unique solution

  • ˆβ=(XTX)1XTy

  • The predicted values at an input vector xte are given by ˆf(xte)=(1:xte)Tˆβ ; the fitted values at the training inputs are

  • $\hat y = X\hat\beta = X(X^TX)^{-1}X^Ty

Geometrical representation

  • Figure 3.2 shows a different geometrical representation of the least squares estimate, this time in RN