What Least Squares Assumes About Your Data
In my linear algebra course in college I learned about least squares regression, where it wasn't very motivated outside of "we square the error because it makes error non-negative, really penalizes outliers, and it's easy to compute". Recently, I learned about maximum likelihood estimation, where we pick parameters for a model based on whatever maximizes the likelihood of the observed data. For example, if we have a random variable Y = a X + b + E where X ~ U(0, 1) (X uniformly sampled from [0, 1]) E ~ N(0, s^2) (error sampled from a Gaussian distribution) Given some set of samples (X1, Y1), (X2, Y2), ..., (Xn, Yn), omitting some constants for clarity, we can find the parameters (a, b) that maximize the likelihood via: argmax[a,b] of product[i=1...n] of exp(-((Yi - a Xi - b) / s)^2) { taking logarithms, an increasing function } = argmax[a,b] of sum[i=1...n] of -((Yi - a Xi - b) / s)^2 ...