Linear Regression
A linear regression model assumes that the regression function $E(Y \vert X)$ is linear in the inputs $X_1, \dots ,X_p$
Linear models were largely developed in the pre-computer age of statistics, but even in today’s computer era there are still good reasons to study and use them
Linear Regression Models and Least Squares
Basic Settings
We have a input vector $X^T = (X_1, X_2, \dots , X_p)$ and want to predic a real-valued output $Y$
The linear regression model has the form \(f(X) = \beta_0 + \sum_{j=1}^p X_j \beta_j\)
The linear model either assumes that the regression function $E(Y \vert X)$ is linear, or that the linear model is a reasonable approximation.
Here the $\beta_j$’s are unknown parameters or coefficients, and the vatiables $X_j$ can come from different sources:
quantitative inputs
transformations of quantitative inputs, such as log, square-root or square
basic expansions, such as $X_2 = X^2, X_3 = X^3$, leading to a polynomial representation
numeric or “dummy” coding of the levels of qualitative inputs
interactions between variables, for example, $X_3 = X_1 \cdot X_2$
No matter the source of the $X_j$, the model is linear in the parameters
Typically we have a set of training data $(x_1,y_1) \dots (x_N, y_N)$ from which to estimate the parameters $\beta$
Each $x_i = (x_{i1}, x_{i2}m, \dots ,x_{ip})^T$ is a vector of feature measurements for the $i$-th case
$\beta_0, \beta_1, \dots ,\beta_p$ : model parameters
Probabilistic Model
Assume that $Y_i \sim P(Y\vert X = x_i)$ and $E(Y\vert X = x_i) = f(x_i) = \beta_0 + \sum^p_{j=1}x_{ij}\beta_j$
$Y_i = \beta_0 + \sum^p_{j=1}x_{ij}\beta_j + \epsilon_i$ where i.i.d. $\epsilon_i$ with $E(\epsilon_i) = \sigma^2$ for $i = 1, 2, \dots, N$
i.e., $\epsilon_i \sim ^{i.i.d.} N(0, \sigma^2)$
LEast Square
- The most popular estimation method is least squares, in which we pick the coefficients $\beta = (\beta_0, \beta_1, \dots , \beta_p)^T$ to minimize the residual sum of squares (RSS)
Probabilistic view
Least square is equivalent to the maximum likelihood estiatation if we assmue Normal distribution for the errors!
For the input values are given (fixed), $y_i$ become normally distributed
$logP(y_i \vert x_i;\beta) = -\frac{1}{2\sigma^2}(y_i - f(x_i))^2 + const$
- Figure 3.1 illustrates the geometry of least-squares fitting in the $p+1$ dimensional space occupied by the pairs $(X,Y)$
How do we minimize RSS?
Denote by X the N x (p+1) matrix with each row an input vector (with a 1 in the first position), and similarly let y be the N-vector of outputs in the training set. Then we can write the risidual sum of squares as \(RSS(\beta) = (y-X\beta)^T(y-X\beta)\)
THis is a quadratic function in the $p+1$ parameters. Differentiating with respect to $\beta$ we obtain \(\frac{\partial RSS}{\partial \beta} = -2X^T(y-X\beta)\) \(\frac{\partial^2 RSS}{\partial \beta \partial \beta^T} = 2X^TX\)
Assuming (for the moment) that X has full column rank, and hence $X^TX$ is positive definite, we set the first derivative to zero
To obtain the unique solution
$\hat\beta = (X^TX)^{-1}X^Ty$
The predicted values at an input vector $x_{te}$ are given by $\hat f (x_{te}) = (1 : x_{te})^T \hat\beta$ ; the fitted values at the training inputs are
$\hat y = X\hat\beta = X(X^TX)^{-1}X^Ty
Geometrical representation
- Figure 3.2 shows a different geometrical representation of the least squares estimate, this time in $R^N$