Naive Bayes reaches its asmyptotic error very quickly with regards to the number of training examples. As the number of training examples grows, logistic regression will outperform naive Bayes and achieve a lower asymptotic error rate.

How to evaluate a model? If one classifier gives you worse result, it is a bad classifier?

Which classifier to choose?

loss function?

For difference (between A/B tests/two models/etc.), how do you know if it's signifcant? If this week's CTR is 5% better than last week, can we conclude we've done better?

Paired different test is used to assess if two population means are different. for normally distributed differences we can used t-test/z-test. others we can use Wilcoxon test.

student's t-test: if unequal variances, we construct $t=\frac{\bar{X_1}-\bar{X_2}}{\sqrt{\frac{\sigma_1}{n_1}-\frac{\sigma_2}{n_2}}} \sim N(0,1)$. Then we can look up standard gaussian table and get its $p$-value (usually should be < 0.05 or 0.01 to be signifcant).

Use a sigmoid function to map $\mathbb{R} \rightarrow [0,1]$ then it regression becomes probabilistic model.

Stochastic Gradient Descent vs. Gradient Descent

SGD updates parameters on every single training example so if training sample is large, SGD takes much shorter time to update parameters and then starts oscillating since it minimizes error on single example instead of the total error. You can picture the path SGD runs to optimum. Instead of the overall gradient, it makes decision on single examples with expected direction being gradient descent, and in reality could be off the optimum course but eventually it'll get there.

More practically, if training data is really big with batch updating you might not even be able to iterate once.

"This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent" http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/238.pdf

Gradient Descent vs. Gradient Ascent

Maximize a quantity (likelihood) vs. Minimize a quantity (loss)

Maximize likelihood vs. Minimizing Loss

Loss function can really be anything as long as it measures the correctness of the model. You can directly define a loss function (hinge, logistic, etc), or use negative likelihood for probabilistic models.

For Gaussian Mixture: Each $X_i$ is from one of the Gaussian components, the problem is you don't which one. Randomly assign membership to each point. Then we can estimate $p(z|\mu, \Sigma)=\frac{p(\mu,\Sigma|z)p(z)}{\sum_zp(\mu,\Sigma,z)}$, followed by $\mu$ = .., $\Sigma$ = ...

Likelihood ratio

Ratio between likelihood from null hypothesis and alternative hypothesis:$\frac{p(X|\theta_0)}{p(X|\theta_1)}$. Null hypothesis with parameter $\theta_0$ will be rejected if ratio $\lt c$. $c$ is determined by certain significane level $\alpha$.