IML L2.2 Loss functions

Posted Nov 16, 2024 Updated Nov 19, 2024

1 min read

Loss functions

Loss functions are used to quantify how well or bad a model can reproduce the values of the training set.

The appropriate loss function depends on the type of problems and the algorithm we use.

Let’s denote with $\hat{y}$ the prediction of the model and $y$ the true value.

Gaussian noise

Let’s assume that the relationship between the features $X$ and the label $Y$ is given by

\[Y=\hat y(X)+\epsilon\]

where $\hat y$ is the model whose parameters we want to fix and ϵϵ is some random noise with zero mean and variance $\sigma$.

The likelihood to measure yy for feature values $x$ is given by

\[L\sim\exp\left(-\frac{(\hat y-y(x))^2}{2\sigma}\right)\]

If we have a set of examples $x^{(i)}$ the likelihood becomes

\[L\sim\prod_i\exp\biggl(-\frac{(y^{(i)}-\hat y(x^{(i)}))^2}{2\sigma}\biggr)\]

We now want to fix the parameters in $\hat y$ such that we maximize the likelihood that our data was generated by the model.

It is more convenient to work with the log of the likelihood. Maximizing the likelihood is equivalent to minimising the negative log-likelihood.

$NLL=-\log(L)=\frac{1}{2\sigma}\sum_{\mathrm{i}}\left(y^{(\mathrm{i})}-\hat y(x^{(\mathrm{i})})\right)^{2}$ So assuming gaussian noise for the difference between the model and the data leads to the least square rule.

We can use the square error loss

$J(\hat y)=\sum_{\mathrm{i}}\left(y^{(\mathrm{i})}-\hat y(x^{(\mathrm{i})})\right)^{2}$ To train our machine learning algorithm.

Two class model

If we have two classes, we call one the positive class $(c=1)$ and the other the negative class $(c=0)$. If the probability to belong to class 1

$p(c=1)=p$ we also have $p(c=0)=1-p$ The likelihood for a single measurement if the outcome is in the positive class is $p$ and if the outcome is in the negative class the likelihood is $1−p$. For a set of measurements with outcomes yiyi the likelihood is given by

$\mathrm{L=\prod_{v_i=1}p\prod_{v_i=0}(1-p)}$ So the negative log-likelihood is:

$\mathrm{NLL=-\sum_{y_{i}=1}\log(p)-\sum_{y_{i}=0}\log(1-p)}$ Given that y=0y=0 or y=1y=1 we can rewrite it as

$\mathrm{NLL=-\sum(y\log(p)+(1-y)\log(1-p))}$ So if we have a model for the probability ŷ =p(X)y^=p(X) we can maximize the likelihood of the training data by optimizing

$J=-\sum_iy_i\log(y)+(1-y_i)\log(1-y)$ It is called the cross entropy.

Study, Master

DU AI ML

This post is licensed under CC BY 4.0 by the author.

Loss functions

Gaussian noise

Two class model

Trending Tags