The details of definitions vary, but the following is a good catch-all:
An algorithm that provides high-level abstraction and modelling of data based on large training sets
This requires explanation in itself.
Abstraction implies that the outcome and input data are significantly different, the outcomes being, for example, image classification, predictive behaviour or even language translation. Abstraction means that there is not a simple relationship between input and output and in this case, it is most likely an unknown relationship, also known as a ‘black box’.
Modelling means that we are trying to create a real-world scenario of some kind so that a real-world classification or result is output.
The terms about data and large training sets imply that the data may be diverse and there is some variability in the input data. Usually, the ‘learning’ part of deep learning or machine learning means that the important features are detected as part of the learning process.
The origin of many of the terms in this document is in the literature for neural networks and it forms a good basis for the whole white paper. A neural network is some kind of a software or hardware model of the brain, where simple decision-making or logic units (neurons, perceptrons) are combined in their inputs, outputs and decisions to make a large, complex decision-making system (network, the brain).
Originally these were called Artifical Neural Networks (ANNs) to distinguish them from biological systems. Typically these have a number of inputs and a number of outputs, large interconnectivity between neurons and a number of intermediate layers. Without intermediate layers, the system can manage only relatively simple problems. The number ‚hidden layers‘ of neurons is a critical part of the structure, as data can be combined in these intermediate neurons to allow complex decision-making. Hence the network has a depth and the concepts that can be learned are non-trivial and therefore ‚deep‘ in all meanings of the word.
In the example above there are three inputs, one output and two hidden layers. Notice that the neurons are ‚highly interconnected‘, this is an important feature of neural networks. This is what allows the complex relationships, functions or decisions, without this the input-to-output relationships would be relatively simple.
The intention is not to create an accurate model of the brain, but to replicate the brain’s learning ability and complex recognition capability. An average human might have 100 billion neurons running at around 1kHz, compared to a modern CPU with maybe 2 billion transistors running at 3GHz.
Note that there are a number of prior conditions that must be defined before a neural network can get to the desired solution, it is not just a case of ‚start learning and it will reach the correct solution eventually‘. In the examples on the page linked above, the most difficult is a spiral classification, where the pre-conditions are critical.
The idea of an artificial neuron is rather vague. In a CPU, a logic device is created using transistors – if the neural network was ‘fixed’ or ‘hard-wired’ this might also be possible, but generally the point of using a neural network is to use a ‘learning capability’. This means that a single neuron’s response to inputs must be able to change as it learns. This is generally called the ‘weighting’ of a neuron where it puts emphasis on different inputs to generate the desired output. This is more easily achieved in software than hardware, so neurons are usually a mathematical function to relate inputs to outputs.
The change in the weighting and therefore the tuning of outputs compared to inputs is the neuron’s learning phase. This means that there must be some feedback from the overall result that effects individual neurons. As a whole this means that the neural network’s inputs are known, as are the outputs, but the neuron’s values, especially in the hidden layers, are not known. Hence, it is a ‘black box’.
At the simple level, an untrained neural network ‘knows’ nothing and gives random or chaotic results until the neurons and the network have been ‘trained’ to give the desired output. For a simple problem, this could be achieved with a simpler (and easier to debug) architecture, so neural networks are typically used on complex problems – hence the need for large data sets for training. Individual neurons can give complex outputs, allowing linear and non-linear responses to inputs. This is rather a subtle point – the neurons have to be able to cope with the necessary possibilities to give a good outcome. This implies that either the neural network designer has some clue as to the desired internal workings, or the network is very complex to allow ‘all’ possibilities.
Neural networks are typically used to cope with problems that show some variability, like a human dealing with a real-world image. The human has learnt to identify parts of an image based on ‘experience’ – exposure to a large number of similar sets of data and feedback about whether the decision was correct. The same follows for artificial neural networks.
Google has access to huge data-sets of images, translations and more. It also has access to massive computing power. In late 2016 Google announced that it had been testing Machine Learning techniques for the Google Translate service. Compared to the previous ‘recipe-style’ phrase-based translations, Google found that relatively small language translation data-sets using neural networks could give similar results. With further testing and unsupervised learning, it was able to reduce translation errors by 55-85% based on expert feedback. Google has also made its internal SDK ‘TensorFlow’ available to the public in an open-source form.
Facebook’s AI group say that about 1000 objects per category is necessary to identify a brand of car, a type of plant or a dog breed. There are some shortcuts which involve either creating synthetic data from real training data (by modifying real data, maybe by scaling or rotating) or by creating some useful start-points (for example regarding the scale of features the neurons should pay attention to, or how many hidden layers the network should have).
Neural networks are not the only machine learning algorithms, just the most widely known; this chapter explores some of the other methods. This section is based partly around this handy guide., I’d also recommend this YouTube playlist from Google developers to give an introduction to Machine Learning and show what is freely available.
Supervised learning methods use labelled input images so that the algorithm generates a function to the desired output. What none of these methods can tell you is the correct inputs or variables to use in designing the algorithm! This means that there is often an iterative learning approach to get a feeling for the level of success of the outcome before a full ‚learn‘ takes place.
At its simplest, this can be a straight line fitted to a set of data-points (linear regression). This gives the relationship between two variables (at the simplest level) so that for a value of one variable, we can deduce the other. This method can also offer some error values to the regression which help to give a confidence to any result. From this description it is fairly easy to see that you could extend it from linear to multi-linear or polynomial data-fitting. Typically the fitting part of these methods would be a least-squares fitting to minimise the distance of the curve from the data.
Another commonly-used fitting technique is logistic regression. This fits the data to a sigmoid (S-shaped) curve and returns a probability (0 to 1). As a probability, it is a two-class problem (member of class x or not). This means that at its heart this is a statistical relationship of the compactness of a class in feature space and its closeness to other classes.
Compared to the regression methods mentioned earlier, this method is designed for cases where there is not perfect information – what is known as 'an ill-posed problem'. Compared to neural networks this also implies less training data is available. This means that from the training data, there is not a single, or a satisfactory, single line or curve that adequately classifies the data. Forcing one of those methods onto this problem would lead to either over-fitting or under-fitting (i.e. a bad classification). This means that some generalisation is required.
In this method, a 'regularisation' (modification) is done to the fitting algorithm, so that the favoured result has some other quality, such as simplicity of calculation or smoothness of the output curve. The regularisation is known as the Tikhonov matrix (what that involves depends on the desired output). To find the Tikhonov matrix (i.e. learning or teaching the algorithm), it is necessary to make some assumptions about the input data. Assumptions might be that the data is part of a 'normal' (Gaussian) statistical distribution and we can see enough of the variation to roughly determine the mean and standard deviation, or the standard deviations are all the same in the different variables.
CVB Polimago uses this type of algorithm, it allows Polimago to be used as a search tool for variable objects, or a classifier where there is variation in the classes – in both cases the classes are not completely defined, so there has to be some generalisation. Borderline cases are a useful way to train the algorithm to delimit the classes.
A decision tree is a little like a neural network, except that the decision nodes are generally known. CVB’s Minos tool is an example of this, where each decision node is a binary decision. This makes a very fast classifier, as it is possible to exclude 50% of the possible outcomes at each decision (if the classifier is a balanced tree). For Minos it allows very fast OCR and searching based on trained features.
In a simple case, this is a plot of features (variables) whose co-ordinates are known as ‚Support Vectors‘. In a two-class case the SVM splits the classes by defining a line with a maximum distance from each class. Again the design of the SVM defines how the line could be – linear, polynomial, logarithmic ... CVB’s Manto tool is an example of an SVM.
The graphs overleaf show that there are multiple lines that can completely separate the classes, but by putting limits on the allowable solutions (smoothness of the decision surface, for example), it is possible to change the characteristics of the solution – generalisation versus over-fitting, for example.
The term ‚Bayesian‘ refers to probability and in this case there is an assumption that a class has a number of unrelated features (e.g., colour, shape, size) and there is a probability that a value is related to an ideal set of values for a certain class. By combining the probabilities you get a probability that a certain measurement of colour, shape and size relates to a certain class. In a simple example, a white, sperical object around 250mm radius would be a good match to the class ‚football‘. Changing one of those descriptors would make it much less likely to be a football.
This is a voting system that uses the nearest neighbours in a feature space (the ‚K‘ nearest neighbours, in fact) to decide which class a test feature belongs to. Choosing the number K is a surprisingly difficult part of the design.
In these cases there isn’t any previously known or labelled data. This is the usual meaning of ‚deep learning‘, the algorithm is somehow self-learning. In some way the tools are trying to find classifications within the data without prior knowledge.
Compared to K-nearest neighbours, this ‚automatically‘ creates K clusters of data, where the clusters are fairly homogeneous and the gaps between clusters show significant differences. It can be a fairly simple iterative centroiding problem. However it is entirely dependent on the features that are being measured – the resulting clusters might have no relation to a ‚human-perceived‘ cluster.
This is an extension of the decision tree method, in that there are many trees (making a ‚forest‘) and the number of trees that vote for an outcome gives a statistical probability that an input is related to an output. In more complex forests the internal algorithms can be any of those mentioned elsewhere! So it becomes a voting method based on the assumption that ‚most of the methods are right most of the time‘. Compared to decision trees, random forests suffer from less over-fitting to the training data.
Supervised learning means that there is some prior knowledge in the training set and the problem is to create the function to connect the inputs to the desired outputs. The choice and combination of features affect the difficulty and success of doing so, in fact the features are critical.
Unsupervised learning means there is no prior knowledge and all knowledge is inferred from the training set. This might imply some clustering of data-points in a feature-space, but this can only happen if the chosen features and potential functions are chosen correctly. This suggests that unsupervised learning can be successful in two cases:
In all cases the biggest problem is conceiving the possible solutions so that features can be extracted and combined, the solutions (mappings) can be designed and the complexity of the solution (hidden layers) be decided. The type of algorithm alone is not a solution.