Activation Functions in Neural Networks- What are they?, How they work? and Where to use them?

8 min readFeb 8, 2019

Activation functions are really important for an Artificial Neural Network to learn and make sense of something really complicated and Non-linear complex functional mappings between the inputs and response/output variable. They introduce non-linear properties to the network. Their main purpose is to convert an input signal of a node in an Artificial Neural Network to an output signal. That output signal is used as an input in the next layer in the stack. It is also known as Transfer Function.

An activation function decides whether a neuron should be “fired” or not. Its input is a “weighted sum” of the inputs plus the bias.

The input to the neuron is:

The be anything ranging from -inf to +inf. The activation checks the Y value produced by a neuron and decides whether outside connections should consider this neuron as “fired” or not i.e. “activated” or not.

An activation function can be linear or

Types of activation functions:

1) Step function:

Also known as the binary step function.

If the value of Y is above a certain value, declare it activated. If it’s less than the threshold, then say it’s not.

Advantages:

Great for binary classification.

Disadvantages:

The gradient of the step function is zero. This makes the step function not so useful since during back-propagation when the gradients of the activation functions are sent for error calculations to improve and optimize the results.
It cannot be used for multi-class classification.

2) Linear function

Advantages:

The linear function might be ideal for simple tasks where interpretability is highly desired.

Disadvantages:

The derivative of a linear function (i.e. ‘a’) is constant i.e. it does not depend upon the input value x. This means that every time we do a backpropagation, the gradient would be the same. And this is a big problem, we are not really improving the error since the gradient is pretty much the same.
If each layer has a linear transformation, no matter how many layers we have the final output is nothing but a linear transformation of the input.

3) Sigmoid Function

Also known as a logistic function. It is an S-shaped curve.

Advantages:

This is a smooth function and is continuously differentiable.
It is non-linear. Hence, the output is non-linear as well.
It is easy to understand and apply.
Easy to compute differential.

Disadvantages:

Vanishing gradient problem. Sigmoids saturate and kill gradients.
The output isn’t zero centered thus the gradient updates go too far in different directions i.e. 0 < output < 1, and it makes optimization harder.
Sigmoids have slow convergence.

4) Tanh Function

The tanh (Hyperbolic tangent) function is very similar to the sigmoid function. It is actually just a scaled version of the sigmoid function.

Advantages:

It is continuous and differentiable at all points.
It basically solves our problem of the values all being of the same sign.
The function as you can see is non-linear so we can easily backpropagate the errors.

Disadvantages:

Vanishing gradient problem.
The gradients are low.

Therefore in practice, the tanh non-linearity is always preferred to the sigmoid nonlinearity.

5) Rectified Linear Unit (ReLU)

ReLU is

Image result for relu activation function points

Advantages:

ReLU function is non-linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.
It was found to greatly accelerate the convergence of stochastic gradient descent compared to the sigmoid and tanh functions.
It does not activate all the neurons at the same time. Since the output of some neurons is zero, only a few neurons are activated making the network sparse, efficient and easy for computation.

Disadvantages:

Non-differentiable at zero and ReLU is unbounded.
The gradients for negative input are zero, which means for activations in that region, the weights are not updated during backpropagation. This can create dead neurons that never get activated. This can be handled by reducing the learning rate and bias.
ReLU output is not zero-centered and it does hurt the neural network performance. The gradient of the weights during backpropagation are either all be positive, or all negative. This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. This can be handled by batchnorm. In batchnorm, these gradients are added up across a batch of data thus the final update for the weights can have variable signs, somewhat mitigating this issue.
The mean value of activation is not zero. From ReLU, there is a positive bias in the network for subsequent layers, as the mean activation is larger than zero. Though they are less computationally expensive compared to sigmoid and tanh because of simpler computations, the positive mean shift in the next layers slows down learning.

Always keep in mind that the ReLU function should only be used in the hidden layers.

6) Leaky ReLu or Maxout function and Parametric ReLU (PReLU)

If the output of a ReLU is consistently 0 (for example, if the ReLU has a large negative bias), then the gradient through it will consistently be 0. The error signal backpropagated from later layers gets multiplied by this 0, so no error signal ever passes to earlier layers. The ReLU has died. To overcome this Leaky ReLU and PReLU are introduced.

If a = 0.01 then it is Leaky ReLU.

Advantages:

No dying ReLU units.
It speeds up training. There is evidence that having the “mean activation” be close to 0 makes training faster.

Disadvantages:

It saturates for large negative values, allowing them to be essentially inactive.

The result is not always consistent. Leaky ReLU isn’t always superior to ReLU and should be considered only as an alternative when you see a lot of dead neurons.

ReLU with Gaussian noise making it a noisy ReLU.

7) Randomized Leaky ReLU(RReLU)

In this, the coefficients are randomly generated.

8) Exponential Linear Unit (ELU) and Scaled Exponential Linear Unit (SELU)

ELU:

The ELU function is defined as:

SELU:

ELU function with λ=1.0507 and α = 1.67326 is called Scaled Exponential Linear Unit.

9) Concatenated ReLU (CReLU)

Concatenated ReLU has two outputs, one ReLU, and one negative ReLU concatenated together. In other words, for positive x it produces [x, 0], and for negative x it produces [0, x]. Because it has two outputs, CReLU doubles the output dimension.

CReLU has shown to perform very well in Convolutional Neural Networks. It preserves both positive and negative phase information while enforcing non-saturated non-linearity. The unique nature of CReLU allows a mathematical characterization of convolution layers in terms of reconstruction property, which is an important indicator of how expressive and generalizable the corresponding CNN features.

CReLU naturally reduces the redundancy of learning separate filters that differ in phase only.

CReLU is an activation scheme. The element-wise ReLU non-linearity after concatenation can be substituted by other activation functions (e.g., Leaky ReLU).

10) ReLU-6

It is a ReLU but capped at the value of 6 thus making it bounded.

The value 6 is an arbitrary choice that worked well. The upper bound encourages the model to learn sparse features early.

11) Maxout

Maxout generalizes the ReLU and its leaky version.

Maxout is equivalent to ReLU when w1 =0 and b1 = 0.

W and b are the learnable parameters, and k is the number of linear we want to use.

The Maxout neuron enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons, it doubles the number of parameters for every single neuron, leading to a high total number of parameters.

Maxout with k=2.

Maxout with k=4.

11) Softmax

Softmax is not a traditional activation function. Other activation functions produce a single output for a single input. In contrast, softmax produces multiple outputs for an input array. Thus softmax can be used to build neural networks models that can classify more than two classes instead of a binary class solution.