Linear Regression

  • A linear regression model assumes that the regression function $E(Y \vert X)$ is linear in the inputs $X_1, \dots ,X_p$

  • Linear models were largely developed in the pre-computer age of statistics, but even in today’s computer era there are still good reasons to study and use them

Linear Regression Models and Least Squares

Basic Settings

  • We have a input vector $X^T = (X_1, X_2, \dots , X_p)$ and want to predic a real-valued output $Y$

  • The linear regression model has the form \(f(X) = \beta_0 + \sum_{j=1}^p X_j \beta_j\)

  • The linear model either assumes that the regression function $E(Y \vert X)$ is linear, or that the linear model is a reasonable approximation.

  • Here the $\beta_j$’s are unknown parameters or coefficients, and the vatiables $X_j$ can come from different sources:

    • quantitative inputs

    • transformations of quantitative inputs, such as log, square-root or square

    • basic expansions, such as $X_2 = X^2, X_3 = X^3$, leading to a polynomial representation

    • numeric or “dummy” coding of the levels of qualitative inputs

    • interactions between variables, for example, $X_3 = X_1 \cdot X_2$

  • No matter the source of the $X_j$, the model is linear in the parameters

  • Typically we have a set of training data $(x_1,y_1) \dots (x_N, y_N)$ from which to estimate the parameters $\beta$

  • Each $x_i = (x_{i1}, x_{i2}m, \dots ,x_{ip})^T$ is a vector of feature measurements for the $i$-th case

  • $\beta_0, \beta_1, \dots ,\beta_p$ : model parameters

Probabilistic Model

  • Assume that $Y_i \sim P(Y\vert X = x_i)$ and $E(Y\vert X = x_i) = f(x_i) = \beta_0 + \sum^p_{j=1}x_{ij}\beta_j$

  • $Y_i = \beta_0 + \sum^p_{j=1}x_{ij}\beta_j + \epsilon_i$ where i.i.d. $\epsilon_i$ with $E(\epsilon_i) = \sigma^2$ for $i = 1, 2, \dots, N$

  • i.e., $\epsilon_i \sim ^{i.i.d.} N(0, \sigma^2)$

LEast Square

  • The most popular estimation method is least squares, in which we pick the coefficients $\beta = (\beta_0, \beta_1, \dots , \beta_p)^T$ to minimize the residual sum of squares (RSS)
\[RSS(\beta) = \sum^N_{i=1}(y_i - f(x_i))^2 = \sum^N_{i=1}(y_i - \beta_0 - \sum^p_{j=1}x_{ij}\beta_j)^2\]

Probabilistic view

  • Least square is equivalent to the maximum likelihood estiatation if we assmue Normal distribution for the errors!

  • For the input values are given (fixed), $y_i$ become normally distributed

  • $logP(y_i \vert x_i;\beta) = -\frac{1}{2\sigma^2}(y_i - f(x_i))^2 + const$

Visualization

  • Figure 3.1 illustrates the geometry of least-squares fitting in the $p+1$ dimensional space occupied by the pairs $(X,Y)$

Optimization

  • How do we minimize RSS?

  • Denote by X the N x (p+1) matrix with each row an input vector (with a 1 in the first position), and similarly let y be the N-vector of outputs in the training set. Then we can write the risidual sum of squares as \(RSS(\beta) = (y-X\beta)^T(y-X\beta)\)

  • THis is a quadratic function in the $p+1$ parameters. Differentiating with respect to $\beta$ we obtain \(\frac{\partial RSS}{\partial \beta} = -2X^T(y-X\beta)\) \(\frac{\partial^2 RSS}{\partial \beta \partial \beta^T} = 2X^TX\)

  • Assuming (for the moment) that X has full column rank, and hence $X^TX$ is positive definite, we set the first derivative to zero

  • $X^T(y-X\beta)=0$

  • To obtain the unique solution

  • $\hat\beta = (X^TX)^{-1}X^Ty$

  • The predicted values at an input vector $x_{te}$ are given by $\hat f (x_{te}) = (1 : x_{te})^T \hat\beta$ ; the fitted values at the training inputs are

  • $\hat y = X\hat\beta = X(X^TX)^{-1}X^Ty

Geometrical representation

  • Figure 3.2 shows a different geometrical representation of the least squares estimate, this time in $R^N$