1 Forward Propagation

This is more like a summary section.

We set $a^{[0]} = x$ for our input to the network and $\ell = 1,2,\dots,N$ where N is the number of layers of network. Then, we have

where $g^{[\ell]}$ is the same for all the layers except for last layer. For the last layer, we can do:

    1 regression then $g(x) = x$

    2 binary then $g(x) = sigmoid(x)$

    3 multi-class then $g(x) = softmax(x)$

Finally, we can have the output of the network $a^{[N]}$ and compute its loss.

For regression, we have:

For binary classification, we have:

For multi-classification, we have:

Note that for multi-class, if we have $\hat{y}$ as a k-dimensional vector, we can calculate its cross-entropy for its loss:

2 Backpropagation

We define that:

So we have three steps for computing the gradient for any layer:

1 For output layer N, we have:

For softmax function, since it is not performed element-wise, so you can directly caculate it as a whole. For sigmoid, it is applied element-wise, so we need to:

Note this is element-wise operation.

2 For $\ell = N-1,N-2,\dots,1$, we have:

3 For each layer, we have:

This can be directly used in coding, which acts like a formula.