IML L2.2 Loss functions
Loss functions
Loss functions are used to quantify how well or bad a model can reproduce the values of the training set.
The appropriate loss function depends on the type of problems and the algorithm we use.
Let’s denote with $\hat{y}$ the prediction of the model and $y$ the true value.
Gaussian noise
Let’s assume that the relationship between the features $X$ and the label $Y$ is given by
\[Y=\hat y(X)+\epsilon\]where $\hat y$ is the model whose parameters we want to fix and ϵϵ is some random noise with zero mean and variance $\sigma$.
The likelihood to measure yy for feature values $x$ is given by
\[L\sim\exp\left(-\frac{(\hat y-y(x))^2}{2\sigma}\right)\]If we have a set of examples $x^{(i)}$ the likelihood becomes
\[L\sim\prod_i\exp\biggl(-\frac{(y^{(i)}-\hat y(x^{(i)}))^2}{2\sigma}\biggr)\]We now want to fix the parameters in $\hat y$ such that we maximize the likelihood that our data was generated by the model.
It is more convenient to work with the log of the likelihood. Maximizing the likelihood is equivalent to minimising the negative log-likelihood.
\(NLL=-\log(L)=\frac{1}{2\sigma}\sum_{\mathrm{i}}\left(y^{(\mathrm{i})}-\hat y(x^{(\mathrm{i})})\right)^{2}\) So assuming gaussian noise for the difference between the model and the data leads to the least square rule.
We can use the square error loss
\(J(\hat y)=\sum_{\mathrm{i}}\left(y^{(\mathrm{i})}-\hat y(x^{(\mathrm{i})})\right)^{2}\) To train our machine learning algorithm.
Two class model
If we have two classes, we call one the positive class $(c=1)$ and the other the negative class $(c=0)$. If the probability to belong to class 1
\(p(c=1)=p\) we also have \(p(c=0)=1-p\) The likelihood for a single measurement if the outcome is in the positive class is $p$ and if the outcome is in the negative class the likelihood is $1−p$. For a set of measurements with outcomes yiyi the likelihood is given by
\(\mathrm{L=\prod_{v_i=1}p\prod_{v_i=0}(1-p)}\) So the negative log-likelihood is:
\(\mathrm{NLL=-\sum_{y_{i}=1}\log(p)-\sum_{y_{i}=0}\log(1-p)}\) Given that y=0y=0 or y=1y=1 we can rewrite it as
\(\mathrm{NLL=-\sum(y\log(p)+(1-y)\log(1-p))}\) So if we have a model for the probability ŷ =p(X)y^=p(X) we can maximize the likelihood of the training data by optimizing
\(J=-\sum_iy_i\log(y)+(1-y_i)\log(1-y)\) It is called the cross entropy.