Post

IML L2.1

IML L2.1

Logistic regression

The problems we had with the perceptron were:

  • only converged for linearly separable problems 仅对线性可分问题收敛
  • stopped places that did not look like they would generalise well 不太能概括

img

We need an algorithm that takes a more balanced approach: 更平衡方法的算法

  • finds a “middle ground” decision boundary 找到“中间地带”决策边界
  • can make decisions even when the data is not separable 即使数据不可分离,也可以做出决策

#

  • for binary classification for positive $(y=1) $vs negative $(y=0)$ class

  • for a probability $p$ to belong to the positive class, the odds ratio is given by \(r = \frac{p}{1-p}\)

  • if $r>1$ we are more likely to be in the positive class

  • if $r<1$ we are more likely to be in the negative class

  • to make it more symmetric we consider the log of $r$

  • use a linear model for $z = log(r)$

\[z=\omega_{0}+\sum_{i}x_{j}w_{j}=w_{0}+(x_{1},...,x_{D})\cdot(\omega_{1},...,\omega_{D})\]

If we set $x_0=1$ we can write this as \(\mathrm{z=\sum_{i=0}^{n_{f}}x_{j}w_{j}=(x_{0},x_{1},...,x_{D})\cdot(\omega_{0},\omega_{1},...,\omega_{D})}\) We can invert to obtain probability as a function of $z$

\[p=\frac{1}{1+e^{-z}}\equiv\phi(z)\]

img

logistic regression optimisation

  • the typicall loss function optimised for logistic regression is
\[J=-\sum_{\mathrm{i}}y^{(\mathrm{i})}\log(\phi(z^{(\mathrm{i})}))+(1-y^{(\mathrm{i})})\log(1-\phi(z^{(\mathrm{i})}))\]
  • only one of the log terms is activated per training data point

    • if $y^{(i)}=0$ the loss has a term $\log(1-\phi(z^{(\mathrm{i})}))$

      • the contribution to the loss is small if $z^{i}$ is negative (assigned correctly)
      • the contribution to the loss is large if $z^{i}$ is positive (assigned incorrectly)
    • if $y^{(i)}=1$ the loss has a term $\log(\phi(z^{(\mathrm{i})}))$

      • the contribution to the loss is large if $z^{i}$ is negative (assigned incorrectly)
    • the contribution to the loss is large if $z^{i}$ is positive (assigned correctly)

img

Gradient descent

To optimize the parameters ww in the logistic regression fit to the log of the probability we calculate the gradient of the loss function JJ and go in opposite direction.

\(w_j\to w_j-\eta\frac{\partial J}{\partial w_j}\) $\eta$ is the learning rate, it sets the speed at which the parameters are adapted.

It has to be set empirically.

Finding a suitable $\eta$ is not always easy.

img

A too small learning rate leads to slow convergence.

img

Too large learning rate might spoil convergence altogether!

img

Gradient descent for logistic regression

\[\frac{\partial\phi(z)}{\partial z}=-\frac{1}{(1+e^{-z})^{2}}(-e^{-z})=\phi(z)\frac{e^{-z}}{1+e^{-z}}=\phi(z)(1-\phi(z))\] \[\frac{\partial\phi}{\partial w_{j}}=\frac{\partial\phi(z)}{\partial z}\frac{\partial z}{\partial w_{j}}=\phi(z)(1-\phi(z))x_{j}\] \[\begin{aligned} \frac{\partial J}{\partial w_{j}}& =-\sum_{\mathrm{i}}y_{\mathrm{i}}\frac{1}{\phi(z_{\mathrm{i}})}\frac{\partial\phi}{\partial w_{\mathrm{j}}}-(1-y_{\mathrm{i}})\frac{1}{1-\phi(z_{\mathrm{i}})}\frac{\partial\phi}{\partial w_{\mathrm{j}}} \\ &= - \sum_{\mathrm{i}}y_{\mathrm{i}}(1-\phi(z_{\mathrm{i}}))x_{\mathrm{j}}^{(\mathrm{i})}-(1-y_{\mathrm{i}})\phi(z_{\mathrm{i}})x_{\mathrm{j}}^{(\mathrm{i})} \\ &=-\sum_{\mathrm{i}}(y_{\mathrm{i}}-\phi(z_{\mathrm{i}}))x_{\mathrm{j}}^{(\mathrm{i})} \end{aligned}\]

where $i$ runs over all data sample, $1\leq\mathrm{i}\leq n_{\mathrm{d}}$ and $j$ runs from 0 to $n_f$.

Loss minimization

Algorithms:

  • gradient descent method
  • Newton method: second order method Training:
  • use all the training data to compute gradient
  • use only part of the training data: Stochastic gradient descent

example

two features relevant to the discrimination of benign and malignent tumors:

img

The data is not linearly separable.

We can train a sigmoid model to discriminate between the two types of tumors. It will assign the output class according to the value of

\(z=w_{0}+\sum_{i}w_{j}x_{j}=w_{0}+(x_{1},x_{2})(\omega_{1},\omega_{2})^{T}\) where $\omega_0$, $\omega_1$ and $\omega_2$ are chosen to minimize the loss.

img

The delimitation is linear because the relationship between parameters and features in the model is linear.

The logistic regression gives us an estimate of the probability of an example to be in the first class

img

Normalising inputs

It is often important to normalise the features. We want the argument of the sigmoid function to be of order one.

  • Too large or too small values push the problem into the “flat” parts of the function
  • gradient is small
  • convergence is slow

It is useful to normalise features such that their mean is 0 and their variance is 1.

This post is licensed under CC BY 4.0 by the author.

Trending Tags