Deep learning – a not-so-deep overview
So, what is this deep learning that is grabbing our attention and headlines? Let's turn to Wikipedia again to form a working definition: Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple nonlinear transformations. That sounds as if a lawyer wrote it. The characteristics of deep learning are that it is based on ANNs where the machine learning techniques, primarily unsupervised learning, are used to create new features from the input variables. We will dig into some unsupervised learning techniques in the next couple of chapters, but you can think of it as finding structure in data where no response variable is available.
A simple way to think of it is the periodic table of elements, which is a classic case of finding a structure where no response is specified. Pull up this table online and you will see that it is organized based on atomic structure, with metals on one side and non-metals on the other. It was created based on latent classification/structure. This identification of latent structure/hierarchy is what separates deep learning from your run-of-the-mill ANN. Deep learning sort of addresses the question of whether there is an algorithm that better represents the outcome than just the raw inputs. In other words, can our model learn to classify pictures other than with just the raw pixels as the only input? This can be of great help in a situation where you have a small set of labeled responses but a vast amount of unlabeled input data. You could train your deep learning model using unsupervised learning and then apply this in a supervised fashion to the labeled data, iterating back and forth.
Identification of these latent structures is not trivial mathematically, but one example is the concept of regularization that we looked at in Chapter 4, Advanced Feature Selection in Linear Models. In deep learning, you can penalize weights with regularization methods such as L1 (penalize non-zero weights), L2 (penalize large weights), and dropout (randomly ignore certain inputs and zero their weight out). In standard ANNs, none of these regularization methods take place.
Another way is to reduce the dimensionality of the data. One such method is the autoencoder. This is a neural network where the inputs are transformed into a set of reduced dimension weights. In the following diagram, notice that Feature A is not connected to one of the hidden nodes:
This can be applied recursively and learning can take place over many hidden layers. What you have seen happening, in this case, is that the network is developing features of features as they are stacked on each other. Deep learning will learn the weights between two layers in sequence first and then use backpropagation to fine-tune these weights. Other feature selection methods include restricted Boltzmann machine and sparse coding model.
and http://deeplearning.net/.
Deep learning has performed well on many classification problems, including winning a Kaggle contest or two. It still suffers from the problems of ANNs, especially the black box problem. Try explaining to the uninformed what is happening inside a neural network, regardless of the use of various in vogue methods. However, it is appropriate for problems where an explanation of how is not a problem and the important question is what. After all, do we really care why an autonomous car avoided running into a pedestrian, or do we care about the fact that it did not? Additionally, the Python community has a bit of a head start on the R community in deep learning usage and packages. As we will see in the practical exercise, the gap is closing.
While deep learning is an exciting undertaking, be aware that to achieve the full benefit of its capabilities, you will need a high degree of computational power along with taking the time to train the best model by fine-tuning the hyperparameters. Here is a list of some things that you will need to consider:
- An activation function
- Size and number of the hidden layers
- Dimensionality reduction, that is, restricted Boltzmann versus autoencoder
- The number of epochs
- The gradient descent learning rate
- The loss function
- Regularization