In the simple linear regression case $y=\beta_0+\beta_1x$, you can derive the least square estimator $\hat\beta_1=\frac{\sum(x_i-\bar x)(y_i-\bar y)}{\sum(x_i-\bar x)^2}$ such that you don't have to know $\hat\beta_0$ to estimate $\hat\beta_1$

Suppose I have $y=\beta_1x_1+\beta_2x_2$, how do I derive $\hat\beta_1$ without estimating $\hat\beta_2$? or is this not possible?

It is possible to estimate just one coefficient in a multiple regression without estimating the others.

The estimate of $\beta_1$ is obtained by removing the effects of $x_2$ from the other variables and then regressing the residuals of $y$ against the residuals of $x_1$. This is explained and illustrated How exactly does one control for other variables? and How to normalize (a) regression coefficient?. The beauty of this approach is that it requires no calculus, no linear algebra, can be visualized using just two-dimensional geometry, is numerically stable, and exploits just one fundamental idea of multiple regression: that of taking out (or "controlling for") the effects of a single variable.

In the present case the multiple regression can be done using three ordinary regression steps:

Regress $y$ on $x_2$ (without a constant term!). Let the fit be $y = \alpha_{y,2}x_2 + \delta$. The estimate is $$\alpha_{y,2} = \frac{\sum_i y_i x_{2i}}{\sum_i x_{2i}^2}.$$ Therefore the residuals are $$\delta = y - \alpha_{y,2}x_2.$$ Geometrically, $\delta$ is what is left of $y$ after its projection onto $x_2$ is subtracted.

Regress $x_1$ on $x_2$ (without a constant term). Let the fit be $x_1 = \alpha_{1,2}x_2 + \gamma$. The estimate is $$\alpha_{1,2} = \frac{\sum_i x_{1i} x_{2i}}{\sum_i x_{2i}^2}.$$ The residuals are $$\gamma = x_1 - \alpha_{1,2}x_2.$$ Geometrically, $\gamma$ is what is left of $x_1$ after its projection onto $x_2$ is subtracted.

Regress $\delta$ on $\gamma$ (without a constant term). The estimate is $$\hat\beta_1 = \frac{\sum_i \delta_i \gamma_i}{\sum_i \gamma_i^2}.$$ The fit will be $\delta = \hat\beta_1 \gamma + \varepsilon$. Geometrically, $\hat\beta_1$ is the component of $\delta$ (which represents $y$ with $x_2$ taken out) in the $\gamma$ direction (which represents $x_1$ with $x_2$ taken out).

Notice that $\beta_2$ has not been estimated. It easily can be recovered from what has been obtained so far (just as $\hat\beta_0$ in the ordinary regression case is easily obtained from the slope estimate $\hat\beta_1$). The $\varepsilon$ are the residuals for the bivariate regression of $y$ on $x_1$ and $x_2$.

The parallel with ordinary regression is strong: steps (1) and (2) are analogs of subtracting the means in the usual formula. If you let $x_2$ be a vector of ones, you will in fact recover the usual formula.

This generalizes in the obvious way to regression with more than two variables: to estimate $\hat\beta_1$, regress $y$ and $x_1$ separately against all the other variables, then regress their residuals against each other. At that point none of the other coefficients in the multiple regression of $y$ have yet been estimated.

The ordinary least squares estimate of $\beta$ is a linear function of the response variable. Simply put, the OLS estimate of the coefficients, the $\beta$'s, can be written using only the dependent variable ($Y_i$'s) and the independent variables ($X_{ki}$'s).

To explain this fact for a general regression model, you need to understand a little linear algebra. Suppose you would like to estimate the coefficients $(\beta_0, \beta_1, ...,\beta_k)$ in a multiple regression model,

$$
Y_i = \beta_0+\beta_1X_{1i}+...+\beta_kX_{ki}+\epsilon_i
$$

where $\epsilon_i \overset{iid}{\sim} N(0,\sigma^2)$ for $i=1,...,n$. The design matrix $\mathbf{X}$ is a $n\times k$ matrix where each column contains the $n$ observations of the $k^{th}$ dependent variable $X_k$. You can find many explanations and derivations here of the formula used to calculate the estimated coefficients $\boldsymbol{\hat{\beta}}=(\hat{\beta}_0, \hat{\beta}_1, ..., \hat{\beta}_k)$, which is

$\begingroup$I have a follow up question, on the simple regression case, you make $y_i=\beta_0+\beta_1\bar x+\beta_1(x_i-\bar x)+e_i$ then $X$ becomes a matrix of $(1,...,1)$ and $(x_1-\bar x,...,x_n-\bar x)$, then follow through the $\hat\beta=(X'X)^(-1)X'Y$. How should I rewrite the equation in my case?$\endgroup$
– Saber CNDec 18 '12 at 8:20

$\begingroup$And 1 more question, does this apply to cases where $x_1$ and $x_2$ are not linear, but the model is still linear? For example the decay curve $y=\beta_1 e^{x_1t}+\beta_2 e^{x_2t}$, can I substitute the exponential with $x_1'$ and $x_2'$so it becomes my original question?$\endgroup$
– Saber CNDec 18 '12 at 8:52

$\begingroup$In your first comment, you can center the variable (subtract its mean from it) and use that is your independent variable. Search for "standardized regression". The formula you wrote in terms of matrices is not correct. For your second question, yes you may do that, a linear model is one that is linear in $\beta$, so as long as $y$ equal to a linear combination of $\beta$'s you are fine.$\endgroup$
– caburkeDec 18 '12 at 9:01

We want to minimize the total square error, such that the following expression should be as small as possible

$$ E'E = (Y-\hat{Y})' (Y-\hat{Y}) $$

This is equal to:

$$ E'E = (Y-X\beta)' (Y-X\beta)$$

The rewriting might seem confusing but it follows from linear algebra. Notice that the matrices behave similar to variables when we are multiplying them in some regards.

We want to find the values of $\beta$ such that this expression is as small as possible. We will need to differentiate and set the derivative equal to zero. We use the chain rule here.

$$ \frac{dE'E}{d\beta} = - 2 X'Y + 2 X'X\beta = 0$$

This gives:

$$ X'X\beta = X'Y $$

Such that finally:
$$ \beta = (X'X)^{-1} X'Y $$

So mathematically we seem to have found a solution. There is one problem though, and that is that $(X'X)^{-1}$ is very hard to calculate if the matrix $X$ is very very large. This might give numerical accuracy issues. Another way to find the optimal values for $\beta$ in this situation is to use a gradient descent type of method. The function that we want to optimize is unbounded and convex so we would also use a gradient method in practice if need be.

$\begingroup$valid point. one could also use the gram schmidt process, but I just wanted to remark that finding the optimal values for the $\beta$ vector can also be done numerically because of the convexity.$\endgroup$
– Vincent WarmerdamDec 19 '12 at 17:44