Why do we need the activation function \(g\)? What happens if we just use a linear function, like \(g(z) = z\)?

Consider a network with multiple layers, but no non-linear activation functions between them. The output of one layer is just a linear transformation of its input. If we stack these linear layers:

Let the first layer be \(h_1 = W_1 x + b_1\). Let the second layer be \(h_2 = W_2 h_1 + b_2\).

Substituting the first into the second:

\[h_2 = W_2 (W_1 x + b_1) + b_2\]
\[h_2 = W_2 W_1 x + W_2 b_1 + b_2\]

This can be rewritten as:

\[h_2 = (W_2 W_1) x + (W_2 b_1 + b_2)\]

Let \(W_{eq} = W_2 W_1\) and \(b_{eq} = W_2 b_1 + b_2\). Then:

\[h_2 = W_{eq} x + b_{eq}\]

This is just another linear transformation! No matter how many linear layers we stack, the entire network will only be able to compute a single linear function of the input.

Tip:

Linear network can only learn: y = mx + b (a straight line in 1D) But sin(πx) is curved - impossible with just linear transformations!

To approximate complex, non-linear functions like \(\sin(\pi x)\), we must introduce non-linearity using activation functions between the layers.

???
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)