Activation functions are applied element-wise to the output of a linear transformation within a neuron or layer. They introduce the non-linearity required for neural networks to learn complex mappings.

Some common activation functions include:

Sigmoid Squashes input to \((0, 1)\). Useful for binary classification output. Can suffer from vanishing gradients.

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Tanh Squashes input to \((-1, 1)\). Zero-centered, often preferred over Sigmoid for hidden layers. Can also suffer from vanishing gradients.

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

ReLU (Rectified Linear Unit) Outputs input directly if positive, zero otherwise. Computationally efficient, helps mitigate vanishing gradients for positive inputs. Can suffer from 'dead neurons' if inputs are always negative.

\[f(x) = \max(0, x)\]

LeakyReLU Similar to ReLU but allows a small gradient for negative inputs, preventing dead neurons.

\[f(x) = \max(\alpha x, x) \quad (\alpha \text{ is a small positive constant, e.g., 0.01})\]
 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)