Table of Contents
The output of a neuron in a neural network is obtained by combining the input to that neuron with the weight(s)/trainable parameter(s) of the input. Sometimes a bias term is added – a constant vector that doesn’t interact with the training data; instead, it indicates the preference for some classes over others (e.g. training data with more images of cats than dogs would mean the bias term for cats would be a higher value). These weights contain learned information from being adjusted during training based on feedback signals; this ensues within a training loop, where the following steps are repeated for as long as required:
1. Take a batch of training samples
, and their labels
2. Do a forward pass through the network to generate the
3. Compute the network’s loss on that batch (quantification of the divergence between the
predictions and expected labels
4. Update all the weights in the network in such a manner that reduces the loss on the
Networks with a nominal loss on their training data have “learned” to map correct labels to inputs. Provided that all the operations used in the network are differentiable, we can compute the gradient (a generalisation of the derivative) of the loss regarding the weight coefficients. By moving these constants in the opposite direction from the gradient, we can reduce the loss [1, pp.46-47].
Taking a continuous function of the form
, which maps the real numbers
, it can be noted that a small change in
will cause a small change in
. The slope of the function at any given point can be found by the rate of change of
with respect to
– this is known as the derivative. The derivative depicts how the value of an operation adjusts as you modify
in the opposite direction of the derivative would decrease the value of the operation (and vice versa) [1, pp.47-48].
We can compute the gradient of a loss function numerically or analytically. Using finite difference approximation (an operation that iterates over all dimensions of our loss function, making small changes along each, and works out the derivative along that dimension using the difference in the function value) we can calculate the numerical gradient easily, but approximately, and this would be expensive to compute. Calculus is used to calculate the analytic gradient, which is exact and fast to process, but can be more error-prone to put into practice; therefore, we normally implement the analytic gradient and compare it to the numerical gradient (called a ‘gradient check’).
For loss functions that are non-differentiable, we can use the subgradient (the generalisation of the derivative of functions) instead .
The minimum of a function is where the derivative at a point is 0; it can be located by checking all the points where the derivative is 0 and finding the point for which the function has the lowest value. Applying this principle to a neural network, we can amend the training loop:
1. Take a random(/stochastic) batch of training samples
, and their labels
2. Do a forward pass through the network to generate the
3. Compute the network’s loss on that batch (quantification of the divergence between the
predictions and expected labels
4. Calculate the gradient of the loss on the current batch with respect to the network’s
weights (a backward pass);
5. Slightly move the parameters in the opposite direction of the gradient (reducing the loss
on the batch).
is the weights and
scaling factor (also called learning rate).
It’s important to mention that the
factor shouldn’t have a value that’s too small – the descent down the curve will take many iterations (/epochs) – or too large – updates to the weights could lead to totally random positions on the curve. Also note that while true SGD (single samples at each iteration) and batch SGD (all data at each iteration) result in more accurate updates, they are far more expensive than performing mini-batch SGD [1, pp.48-50].
Backpropagation is the recursive application of the chain rule to the calculation of the gradient values of a neural network; it starts with the final loss value and does a backward pass through the network (from top layers to bottom layers), employing the chain rule to compute each parameters’ influence on the loss value [1, pp.51].
In a forward pass through the network, we compute the outcome of an operation on each neuron and retain any intermediates required for calculating the gradient in memory, as the local inputs and the direct outputs of each neuron are only known to that neuron during backpropagation. At this neuron, we can compute the local gradient(s) – that of the output of that node with respect to its input(s).
During backpropagation, at each neuron, we have an upstream gradient regarding the immediate output of that neuron coming back. Using the chain rule,
, we take the upstream gradient and multiply it by the local gradient in order to get the gradient with respect to the input, which we then send back to the connected neurons. For instance, if we had a neuron of input
, and a loss of
, our upstream gradient would be
, which we would compound with the local gradient,
, using the chain rule to give us the gradient regarding the input,
, which we would pass back.
is also said to be the effect of
We always want to write any functions out as a computational graph; we can express the nodes in the graph at any level of detail that we want (e.g. the number of inputs at each node). Provided we can work out the gradient we can group nodes into a more complex node. Given the values for each term in the function, we can perform a forward pass through the graph and calculate the value at each stage in the computation. Using known expressions for derivatives, we can substitute the inputs to each computational node into the corresponding derivative equation to work out the local gradient at each node; which we chain with the upstream value (the gradient of the node before the current one) during a backward pass.
It’s worth noting that with a max gate the gradient is routed to one of the nodes (as opposed to an addition node passing the same gradient back to all the incoming branches); only the branch with the maximum incoming value affected the rest of the computational graph, so it’s the one that gets the gradient. If there’s more than one gradient flowing into a node, we sum up the gradients to find the total upstream gradient – both connected nodes would be affected by a change to the node during a forward pass; therefore, that node would be affected by both gradients during a backward pass.
Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.View our services
When working with vectors, the local gradients would now be a Jacobian matrix (the derivative of each element of the output with respect to each element of the input). In practice, we can use the value of the effect the input has on the output when calculating the gradient. A vector and its gradient will always be the same size; each element of this gradient tells us how much it affects the final function output.
A broad interpretation is that neural networks are a class of operations, where simpler functions are hierarchically stacked on top of each other to form a more complicated non-linear function; resulting in the ability to have more than one template for training for classification .
We divide our dataset of images a training set, validation set, and test set, and evaluate our model on the validation set (using the other two sets as normal). The validation set is included to avoid optimising our model to perform well on our test data (when we only have training and test sets) as we’ll have no measure of how well our model will perform on unseen data [1, pp.97].
After reserving some of our (randomly shuffled to remain representative) dataset for our test and (additional) validation sets, we partition our remaining data samples into K equally-sized groups. In each fold (of the K folds), we select the next partition that hasn’t been the validation set as the group of data to evaluate on, training the model on the left-over samples. We average the scores of each fold to give us the final score for our model. When we want to finetune our model we use the validation set we put aside. If our full dataset contains data points that appear more than once, we don’t want to shuffle our data before splitting as this would result in us effectively testing on some part of our training data [1, pp.99-100].
We can perceive the layers within a CNN as having neurons in a 3D composition; the dimensions/axes being a height and width, equivalent to that of an input image, and a ‘depth’ that corresponds to the colour channels in the image . The depth for an RGB image would be 3, as there are three colour channels (red, green, and blue), and 1 for a greyscale picture, which has only one colour channel (levels of grey) [1, pp.123]. The neurons in each layer are connected to only a section of the previous layer (as opposed to a fully-connected network, where each neuron in a layer connects to every neuron in the subsequent layer). Typically, CNNs comprise of a few types of layers:
INPUT – receives the raw data as pixels; different input neurons conventionally receive one
of the colour channel values ;
CONV – produces an output feature map based on a transformation performed on extracted
patches from the input feature map (performs a convolution with learned coefficients)
RELU – operates an element-wise activation function (doesn’t affect the width and height of
POOL – down-samples regions of the input, resulting in a reduced width and height of the
FC – fully-connected; figures the class scores and hence returns the computed class of the
network’s input if it’s the final layer in a CNN .
As convolution layers learn local patterns, the patterns they learn are ‘translation invariant’ – a CNN can identify a specific learned pattern anywhere, so they need less training samples to learn representations that can be generalised – and they can learn spatial hierarchies of patterns (e.g. first layer learns edges and the second learns larger patterns made up of edges) [1, pp.122-123].
In a fully-connected layer, we stretch all the pixels in an input 3D image and compound this vector with a weight matrix to obtain the activations(/output) for this layer. For example, if we stretch a 32x32x3 image to 3072×1 and take a dot product between this input and a row of weights, we get one number that’s roughly the value of that neuron (we’d have as many neuron outputs as weights).
We want to preserve the 3D structure in a convolutional layer, so we don’t want to stretch out the image into one long vector. In this case, our weights are going to be filters which we convolve with the input image. Filters are of a smaller spatial area but will always go through the full depth of the input image – their depth extends that of the input, e.g. 32x32x3 input with a filter of 5x5x3. We slide the filter over the image spatially, compounding each element of the filter with each corresponding element in the region of the image that the filter overlays at a given spatial location. To slide the filter over all the spatial locations, we start at the upper left-hand corner and centre our filter on every pixel in the image, computing the dot product at each position, and we add each corresponding point into our output activation map.
Each filter encodes a specific template/concept in the input data, e.g. the presence of a face; hence we have numerous filters when dealing with a convolutional layer (as many as desired). The depth of the output activation map is equivalent to how many filters we used. As we convolve a filter over an input image, areas of the input that appear to match the aspects the filter is looking for would result in a higher (more white) activation value.
As stated earlier, a CNN is a sequence of layers stacked on top of each other, where the output of each layer is the input to the next. Each layer has its own filters that produce activation maps; resulting in the CNN learning the hierarchy of filters, where those at the former layers typically represent low-level features such as edges. A conventional CNN starts with a convolutional layer, then alternates between non-linear layers (like ReLU) and convolutional layers; occasionally a pooling layer will follow the non-linear layer. Finally, at the top, a fully-connected layer connected to all the convolutional outputs (stretched into 1D to be input) computes the final scores for each class in the network .
After convolving our filters and input image, the output dimensions may differ from that of the input. Let’s assume we have a 7×7 input and a 3×3 filter. Using a ‘stride’ of 1, we’re going to slide the filter along the image by one pixel each time. The output will be a 5×5 activation map as we could only slide the filter along five spatial locations horizontally and vertically (see Figure 1); if we used a stride of 2, our output activation map would be 3×3 (see Figure 2). We can compute our output size by substituting our input width/height
, our filter width/height
, and our
into the equation:
. In practice, we don’t do convolutions where we don’t have whole numbers as this leads to asymmetric outputs.
If we would like to maintain the spatial dimensions of our input image in our output activation map, we can use padding (zero-padding works reasonably in practice) where we add a border of one or more pixels to our input image; this allows us to centre our filter at each pixel in the input. Commonly, we zero-pad with
. When we have multiple layers without padding, the size of our outputs will rapidly shrink, causing us to lose out on information (from the corner of the image and now that we’re using fewer values to represent the original image) .
Each activation map is operated on individually by the pooling layer. Pooling is only done spatially (not depth-wise) and makes the representations smaller and more controllable; reduced input sizes results in fewer required calculations upstream . Smaller representations are slightly more translation invariant – small changes in the input don’t result in large alterations in the output activation map; we can detect whether an image contains an aspect regardless of its position .
Our pooling layer also has a filter size, which is the section we pool over – it’s typical to have a stride where there’s no overlap. In the max-pooling operation, instead of calculating the dot products in the region overlaid by our filter, we take the maximum value of that area of the input image. Recently, often stride is being used to down-sample instead of pooling .
The sigmoid function takes each input number and squashes them into the range [0,1]. High (positive) input values will output a number near 1, and the outputs of very negative inputs will be near 0. We can think of output values between 0-1 as the “firing rate” of a neuron.
For large positive and very negative input values, the region of the sigmoid they’re in is virtually flat, so the gradient at this point is (effectively) 0. We get a minuscule gradient flowing back to downstream neurons when we chain this local gradient with any upstream gradient; killing the gradient and its flow.
We want to zero-mean our data so that our gradient updates don’t all move in the same direction – if our input is always positive then the chained gradients (on the weights) are always going to have the same sign as the upstream gradient coming down. Hypothetically, if we can’t make updates in the computed direction, we’d have to take a sequence of gradient updates in the permitted directions to obtain an optimal weight vector; which is very inefficient .
Similarly to the sigmoid function, tanh takes the input values and squashes them into the range [-1,1]; therefore, we get the same problem where the gradients at ‘saturated’ neurons are killed. However, our outputs would be zero-centred, so we wouldn’t have to zero-mean our data .
4.I.c ReLU (Rectified Linear Unit)
ReLU in practice converges much faster than sigmoid and tanh as it’s very computationally inexpensive. This function operates element-wise on the input and computes
, meaning it sets all negative inputs to 0 but positive values are unaffected.
There’s no saturation in this operation’s positive region, but the gradients are killed in the negative half of the regime; thus, ReLU is not zero-centred. In this half of the regime, a phenomenon of ‘dead ReLUs’ can arise – training data input doesn’t cause these neurons to fire, and no updates occur as there’s no gradient flow coming back. Despite this, ReLU still works well for training networks; however, we can initialise our neurons with slightly positive biases (e.g. 0.01) to increase the probability of them being active during initialisation, resulting in some updates .
4.I.d Leaky ReLU
This function is a modification of ReLU; instead of half of the regime being flat, there is now a slight negative slope:
. Unlike ReLU, Leaky ReLU doesn’t saturate, so we don’t get this phenomenon of dead neurons .
4.I.e PReLU (Parametric Rectifier)
PReLU also has a negative slope; however, the slope isn’t hardcoded but learned through back-propagation – giving us more flexibility:
In practice, we tend to use ReLU (or try some variation of it) .
Data being input to a neural network should be floating-point tensors. For images, we would read the picture files, decoding the content of the format into RGB pixel grids which we then convert into floating-point [1, pp.135].
- Zero-centring: the transformation of the data to centre it at the origin, e.g. subtracting the mean image (array of numbers computed from the training data) from each input image.
- Normalising: usually done so that all aspects are in the same range and therefore contribute equally (in practice, we don’t normalise pixel values as they already have similar scale and distribution).
We pre-process in both the training and testing stage. Generally, we determine our values during training and then apply those values to the test data .
We don’t want to initialise all our weights to the same number as each neuron will have basically the same output and therefore gradient, so our neurons won’t learn different things.
For small networks, we can set all the weights to be small random numbers to break the symmetry; this doesn’t work for deeper networks as our standard deviation would shrink and collapse to 0 – resulting in all our activations becoming 0, and therefore we’d get small gradients during back-propagation, and our parameters wouldn’t update .
Initialising our weights to be too big would cause them to eventually explode as we multiply our weight matrices during a forward pass; causing no learning as our gradients would be 0.
Xavier initialisation maintains good activation distributions throughout the network  – we want to derive the value of the weights so that the variance of the output is the same as that of the input (if we have many inputs, to get the same spread at the output, we would want smaller weights, and vice versa) .
After pre-processing the data, we choose the architecture (e.g. one hidden layer with 50 neurons). Once we’ve chosen our architecture, we initialise our network and do a forward pass through it, checking that our loss is reasonable compared to our expectations. First, we disable regularisation by setting it to 0 to see our loss; adding regularisation to see if the loss is affected as we predicted. Before we train over the full dataset, we train over a smaller set to overfit  (model works well on training data but doesn’t make good predictions for unseen data [L NNV 9]) it well to get a good training loss. Then we can train over the full set of data (with a small amount of regularisation) to find a good learning rate – if the loss isn’t going down then the learning rate is too low; it’s too high if the loss is exploding .
If we reach a point where the gradient is zero (such as in a saddle point or local minimum) SGD would get stuck as the opposite of 0 is still 0, so we wouldn’t advance through the loss function. As some of the loss around a saddle point increases in some directions and decreases in others, as opposed to at local minima where the loss increases in all bearings, the slope around the point of zero gradient is very small; therefore, SGD gets stuck near as well as on the saddle point.
As we use a mini-batch of samples to estimate the loss and gradient of a function, we don’t know the true gradient at a current point but rather a ‘noisy’ approximation of it; causing SGD to potentially take a long time to get towards the minima in the function .
5.I.a SGD + Momentum
SGD + Momentum is the idea that we maintain a velocity through the functions, adding our gradient estimates to the velocity to build it up. Rather than progressing based on the value of the current gradient, we use the velocity vector to make our steps; we obtain this vector by adding our current gradient to our velocity at a point and decay this by friction (typically 0.9 or 0.99). Momentum keeps our point moving even if there’s no gradient at a certain point in the operation, similar to rolling a ball down a hill. SGD + Momentum tends to overshoot the minimum before it corrects, and comes back on, itself .
During optimisation in the training phase, we keep a running total of the squared gradients which we’ll divide by (adding a small number to ensure we don’t divide by 0) when we make updates to our parameters .
RMSProp is a variation of Adagrad where we decay our squared gradients (typically by a rate of 0.9 or 0.99), resulting in an update that looks reminiscent of one of SGD + Momentum. RMSProp adjusts its course towards the minimum in such a way that we’re propagating approximately equally across all dimensions .
With Adam, we maintain two moments: an estimate of the weighted sum of our gradients (momentum) and a running approximation of our squared gradients (Adagrad/RMSProp); making our step based on our velocity divided by the square root of our second moment. We initialise our moments with 0, so to avoid making bigger steps in the beginning, before we update our moments, we bias correct their estimates by including the current time step. Using a decay rate of 0.9 and 0.999 for the first and second moments respectively, and a learning rate of 1e-3 or 5e-4 is a great outset for various models .
Attempting to improve our network’s performance on unseen data, we add something (normally batch normalisation, but we can add more if we see our model over/underfitting) to avoid fitting the training data too well .
5.II.a Batch Normalisation
Batch normalisation is a type of layer that adaptively (as the mean and variance change during training) normalises data; helping with gradient propagation, thus allowing for deeper networks [1, pp.260-261]. In the forward pass, we calculate a mean and standard deviation from a mini-batch, using these to normalise our data .
We set a (different) random subset of activations to 0 in each forward pass through the network – after computing the value of the current layer. Once we’ve applied dropout to our network, it looks like a smaller version of itself where we’re only using some (varying) subset of the neurons. Sometimes, in convolutional layers, we might drop entire channels rather than random elements. To avoid losing features, giving us redundant representations, we want our network to distribute the knowledge of what constitutes a class across many different aspects (possibly assisting in preventing over/underfitting).
An alternative interpretation is that dropout is similar to model ensembling (training, e.g., 10, different networks and averaging their scores during testing) but we’ve only a single model. After dropout, we’re calculating a subnetwork using some set of the activations – each potential dropout mask leads to a (different) possible subnetwork, all of which share the same parameters and would be learned simultaneously.
At test time, we multiply our output by our dropout probability (a hyperparameter, commonly 0.5) to remove any stochasticity .
5.II.c Data Augmentation
Data augmentation performs random transformations (flips, shifts, rotations, contrast/brightness, crops/scales, etc.) on our training samples to generate images our model hasn’t seen before while maintaining the original sample labels. We then train on these transformations which helps our model generalise better (perform better on unseen data) .
It’s effective to use a pretrained CNN when datasets are small. Provided that the original dataset was large and broad enough, then the learned features (and their hierarchy spatially) of the saved CNN can model the visual world, meaning they can be used on a diverse range of (new) problems [1, pp.143].
Features of significance are extracted from the new dataset using the knowledge of representations the pretrained CNN has. We then put these features through the convolutional base [1, pp.143] (everything before the last fully-connected layer, the trained classifier, that outputs the final class scores), reinitialise the matrix of that classifier randomly, freeze (prevent from updating during training) all the parameters of the convolutional base, and train a linear classifier on the output of the base .
Fine-tuning aims to make conceptual representations from the pretrained CNN more applicable to a given challenge by unfreezing the top few layers (encoding more problem-specific aspects) of a frozen convolutional base. Before we can fine-tune, we need to have already trained the classifier (the last layer of our CNN) for our problem; we need to have performed feature extraction before we fine-tune and simultaneously train the unfrozen layers and our classifier [1, pp.152-155].
 F. Chollet. Deep Learning with Python. Shelter Island, NY: Manning Publications, 2018, pp. 46-261.
 A. Karpathy. ” CS231n Convolutional Neural Networks for Visual Recognition”, Cs231n.github.io. [Online]. Available: http://cs231n.github.io/optimization-1/. [Accessed: Dec. 10, 2018].
 F. Li, J. Johnson and S. Yeung. Online Lecture, Topic: “Lecture 4 | Introduction to Neural Networks.” School of Engineering, Stanford University, Stanford, California, Aug. 11, 2017.
 A. Clark. Class Lecture, Topic: “Neural Networks.” LTB02, School of Computer Science and Electronic Engineering, University of Essex, Colchester, Dec. 14, 2018.
 F. Li, J. Johnson and S. Yeung. Online Lecture, Topic: ” Lecture 5 | Convolutional Neural Networks.” School of Engineering, Stanford University, Stanford, California, Aug. 11, 2017.
 I. Goodfellow, Y. Bengio and A. Courville. Deep Learning. Cambridge, MA: MIT Press, 2016. [E-book]. Available: http://www.deeplearningbook.org/contents/intro.html [Last accessed: Dec. 11, 2018].
 F. Li, J. Johnson and S. Yeung. Online Lecture, Topic: ” Lecture 6 | Training Neural Networks I.” School of Engineering, Stanford University, Stanford, California, Aug. 11, 2017.
 F. Li, J. Johnson and S. Yeung. Online Lecture, Topic: ” Lecture 7 | Training Neural Networks II.” School of Engineering, Stanford University, Stanford, California, Aug. 11, 2017.
 Jonathan Fernandes, Neural Networks and Convolutional Neural Networks Essential Training – Neural network visualization. 2018. [Streaming video]. Available: https://www.lynda.com/Keras-tutorials/Neural-network-visualization/689777/738644-4.html [Accessed: Dec. 11, 2018].
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: