Perceptron
Perceptron is a kind of simple binary classification algorithm, which can only be applied on binary dataset with linearly separable. The essence of perceptron is training a hyper plane to separate the data, the process is minimizing loss function by gradient descent algorithm. Although perceptron has plenty of limitations, it is the foundation of neural network and support vector machine.
The function of perceptron is extremely simple:
\[f(x) = sign(wx+b)\]
\[sign(x)=\begin{cases} +1 & x \ge 0 \\ -1 & x < 0 \\ \end{cases}\]
The loss function
The loss function of perceptron is the sum of the distance of all the misclassification points. It is easy to come out, the value of loss function will become to zero when there is no misclassification points by the current hyperplane.
According to the formula of Distance From a Point a Line, the distance should be: \[ distance = {1 \over ||w||}|wx+b| \]
For the misclassification points, since the results of \(wx_i+b\) are always opposite with \(y_i\), therefore, we have: \[ -y_i(wx_i+b) > 0 \] So for the misclassification points, the total distance can be changed to: \[ distance = -{1 \over ||w||} \sum_{i=1}^N y_i(wx_i+b) \] Since the term \(||w||\) is a constant in the formula, we can define the loss function as: \[ L(w, b) = -\sum_{i=1}^N y_i(wx_i+b) \] Which means the value of loss function will be zero if there is no misclassification, and the loss will become
lower when the misclassification points are getting closer to the hyperplane.
Algorithm
Here we need to minimize the above loss function by using gradient descent approach. The gradient of the two variables are shown below: \[ \nabla_w L(w,b) = -\sum_{i=1}^N y_ix_i \\ \nabla_b L(w,b) = -\sum_{i=1}^N y_i \] In the process of minimizing, we only select one misclassification point randomly to adjust the variables: \[ w = w + \eta y_ix_i \\ b = b + \eta y_i \] To sum up, the steps of the algorithm can be described as:
- initialize \(w\) and \(b\)
- for each training points, calculate \(y_i(wx_i+b)\), if the result is less or equal than zero, adjust the two variables
- loop on step 2 until there is no misclassification points
The duality format
During this approach, we are going to solve the problem on another side. We have known that: \[ w = w + \eta y_ix_i \\ b = b + \eta y_i \] In this process, we are trying to modify the value of the two variables, assume that the variables become to \(\alpha_iy_ix_i\) and \(\alpha_iy_i\) for the misclassification point \((x_i, y_i)\) after \(n\) times changes, and \(\alpha_i = n_i \eta\), therefore, the final results of two variables can be represented as: \[ w = \sum_{i-1}^N\alpha_iy_ix_i \\ b = \sum_{i=1}^N\alpha_iy_i \] The variable \(\alpha_i\) represents the number of changes due to the misclassification point \((x_i, y_i)\). According to this method, the algorithm will minimize the variable \(\alpha\) instead of \(w\). The model has been changed to this format: \[ f(x) = sign(\sum_{i=1}^N \alpha_iy_ix_i·x + b) \] The steps can be described as:
\(\alpha = (\alpha_1, \alpha_2, ... , \alpha_N)^T\), \(\alpha=0\), \(b=0\)
for each training points, calculate \(y_i(\sum_{j_1}^N \alpha_j y_j x_j · x_i + b)\), if the result is less or equal than zero, adjust the variables as: \[ \alpha_i = \alpha_i + \eta \\ b = b + \eta y_i \]
loop on step 2 until there is no misclassification points
From the above formula, it can be seen that the training data \(x_i\) and \(x_j\) only occur in the inner product format, for the convenience of computation, we can calculate the inner product of the training data and store them in a gram matrix first: \[ G = [x_i · x_j]_{N*N} \]