IML L2.1

Posted Nov 16, 2024 Updated Nov 18, 2024

3 min read

IML L2.1

Logistic regression

The problems we had with the perceptron were:

only converged for linearly separable problems 仅对线性可分问题收敛
stopped places that did not look like they would generalise well 不太能概括

We need an algorithm that takes a more balanced approach: 更平衡方法的算法

finds a “middle ground” decision boundary 找到“中间地带”决策边界
can make decisions even when the data is not separable 即使数据不可分离，也可以做出决策

for binary classification for positive $(y=1) $vs negative $(y=0)$ class
for a probability $p$ to belong to the positive class, the odds ratio is given by $r = \frac{p}{1-p}$
if $r>1$ we are more likely to be in the positive class
if $r<1$ we are more likely to be in the negative class
to make it more symmetric we consider the log of $r$
use a linear model for $z = log(r)$

\[z=\omega_{0}+\sum_{i}x_{j}w_{j}=w_{0}+(x_{1},...,x_{D})\cdot(\omega_{1},...,\omega_{D})\]

If we set $x_0=1$ we can write this as $\mathrm{z=\sum_{i=0}^{n_{f}}x_{j}w_{j}=(x_{0},x_{1},...,x_{D})\cdot(\omega_{0},\omega_{1},...,\omega_{D})}$ We can invert to obtain probability as a function of $z$

\[p=\frac{1}{1+e^{-z}}\equiv\phi(z)\]

logistic regression optimisation

the typicall loss function optimised for logistic regression is

\[J=-\sum_{\mathrm{i}}y^{(\mathrm{i})}\log(\phi(z^{(\mathrm{i})}))+(1-y^{(\mathrm{i})})\log(1-\phi(z^{(\mathrm{i})}))\]

only one of the log terms is activated per training data point
- if $y^{(i)}=0$ the loss has a term $\log(1-\phi(z^{(\mathrm{i})}))$
  - the contribution to the loss is small if $z^{i}$ is negative (assigned correctly)
  - the contribution to the loss is large if $z^{i}$ is positive (assigned incorrectly)
- if $y^{(i)}=1$ the loss has a term $\log(\phi(z^{(\mathrm{i})}))$
  - the contribution to the loss is large if $z^{i}$ is negative (assigned incorrectly)
- the contribution to the loss is large if $z^{i}$ is positive (assigned correctly)

Gradient descent

To optimize the parameters ww in the logistic regression fit to the log of the probability we calculate the gradient of the loss function JJ and go in opposite direction.

$w_j\to w_j-\eta\frac{\partial J}{\partial w_j}$ $\eta$ is the learning rate, it sets the speed at which the parameters are adapted.

It has to be set empirically.

Finding a suitable $\eta$ is not always easy.

A too small learning rate leads to slow convergence.

Too large learning rate might spoil convergence altogether!

Gradient descent for logistic regression

\[\frac{\partial\phi(z)}{\partial z}=-\frac{1}{(1+e^{-z})^{2}}(-e^{-z})=\phi(z)\frac{e^{-z}}{1+e^{-z}}=\phi(z)(1-\phi(z))\] \[\frac{\partial\phi}{\partial w_{j}}=\frac{\partial\phi(z)}{\partial z}\frac{\partial z}{\partial w_{j}}=\phi(z)(1-\phi(z))x_{j}\] \[\begin{aligned} \frac{\partial J}{\partial w_{j}}& =-\sum_{\mathrm{i}}y_{\mathrm{i}}\frac{1}{\phi(z_{\mathrm{i}})}\frac{\partial\phi}{\partial w_{\mathrm{j}}}-(1-y_{\mathrm{i}})\frac{1}{1-\phi(z_{\mathrm{i}})}\frac{\partial\phi}{\partial w_{\mathrm{j}}} \\ &= - \sum_{\mathrm{i}}y_{\mathrm{i}}(1-\phi(z_{\mathrm{i}}))x_{\mathrm{j}}^{(\mathrm{i})}-(1-y_{\mathrm{i}})\phi(z_{\mathrm{i}})x_{\mathrm{j}}^{(\mathrm{i})} \\ &=-\sum_{\mathrm{i}}(y_{\mathrm{i}}-\phi(z_{\mathrm{i}}))x_{\mathrm{j}}^{(\mathrm{i})} \end{aligned}\]

where $i$ runs over all data sample, $1\leq\mathrm{i}\leq n_{\mathrm{d}}$ and $j$ runs from 0 to $n_f$.

Loss minimization

Algorithms:

gradient descent method
Newton method: second order method Training:
use all the training data to compute gradient
use only part of the training data: Stochastic gradient descent

example

two features relevant to the discrimination of benign and malignent tumors:

The data is not linearly separable.

We can train a sigmoid model to discriminate between the two types of tumors. It will assign the output class according to the value of

$z=w_{0}+\sum_{i}w_{j}x_{j}=w_{0}+(x_{1},x_{2})(\omega_{1},\omega_{2})^{T}$ where $\omega_0$, $\omega_1$ and $\omega_2$ are chosen to minimize the loss.

The delimitation is linear because the relationship between parameters and features in the model is linear.

The logistic regression gives us an estimate of the probability of an example to be in the first class

Normalising inputs

It is often important to normalise the features. We want the argument of the sigmoid function to be of order one.

Too large or too small values push the problem into the “flat” parts of the function
gradient is small
convergence is slow

It is useful to normalise features such that their mean is 0 and their variance is 1.

Study, Master

DU AI ML

This post is licensed under CC BY 4.0 by the author.