Understanding Deep Neural Networks: Overview

Part 1 in a series on Artificial Neural Networks

In the words of the esteemed Mugatu, “Neural Networks… they’re so hot right now!” At least, I think that’s what Zoolander was about? Anyways, Deep Neural Networks (DNNs) have become the latest buzzword across the Tech sector. If you do not know much about them in detail yet, I can hardly blame you–they’re relatively new and spawned out of academia, meaning most explanations of them place considerable emphasis on the mathematics and theory behind them. This is necessary for developing a deep understanding, but is not at all requisite to gaining an appreciable knowledge of the subject or even to use DNNs in your own projects.

You have to understand–when something is this groundbreaking and revolutionary, people want to keep the club exclusive. Keeps salaries high and egos elevated. How silly. Over the course of this series on Neural Networks, I hope to be able to explain clearly to a wide range of backgrounds this groundbreaking technology. It will get progressively more in-depth the further we go, and feel free to stop at any time you find you’ve reached an adequate level of understanding for yourself. I believe I heard the following on one of Tim Ferris‘s podcasts: “Learn as much as you need to, and then no more.” There is an infinite amount of things to learn in every imaginable field. Limit yourself and come back later if and when you need to learn more.

There will be certain concepts that are not directly related to DNNs that will be helpful to know in greater depth but are outside my desired scope. In these instances, I will provide what I believe to be the best resources for gaining that deeper understanding.

Now, onto the overview.

What exactly are DNNs?

So, you know they’re important (or at least you’ve been told), but allow me to develop some context. First, it may help to see exactly where DNNs fall in the scope of Computer Science.

Concentric circle diagram of the encapsulating superfields in neural networks.

Concentric circle diagram of the encapsulating superfields for neural networks.

Oh great, they’re even part of machine learning and artificial intelligence, you know these things are going to be trendy. And, well, yeah. They are. But let’s ignore for a moment any capacity to “learn” and instead discuss what they do. Neural Networks can be thought of as very intricate classifiers or predictors. You feed them an input, and they give you a highly-tailored output. Perhaps taking a step out of abstraction and into reality will help solidify what this means. The two most widely-used applications for neural networks are in Computer Vision and Natural Language Processing (NLP). Let’s look at an example from both fields.

Computer Vision

Every year, there is a competition called the Large Scale Visual Recognition Challenge in which teams try to create a model that can take as input an image the model has never seen before and correctly identify what the image is of. It could be a type of flower, a particular model of airplane, species of elephant, etc. The point is, teams have to be able to create a model that teaches itself to recognize these images. Now, if this sounds a bit mystical at first, that’s alright–it kind of is! I mean, until you realize that it is really just some relatively simple mathematics, but we’ll get back to that later. Let’s say you take one of these pre-trained (no longer learning, final product) models and fed it an image of a cat. It would be able to tell you with what percentage it believes it is a cat, and even so far as to what species.

A type of deep neural network called a Convolutional Neural Network (CNN) has dominated this arena for years now. I wouldn’t bother looking up what a convolution is. We’ll dive into the specifics in a later section, but for now just think of this as an overly-convoluted (pun intended) way of saying image recognition neural network.

But so what? Why should you care about identifying cats? Well, what we can do is take this pre-trained model, replace a few layers, and retrain it with images that the original model was not intended for–say, images of tumors and classifications of the type of cancer and what stage it is in. Maybe you are trying to identify a particular plant disease, but there are various nuances that make it quite difficult for a human to classify. If you retrain this model with new images and tell it what the correct classification is, it will steadily learn which patterns are relevant to each particular classification.

Natural Language Processing (NLP)

You may have noticed that since late 2016, Google translate became A LOT better. Like, exponentially better. Previously, there was a huge amount of effort and infrastructure around a field called computational linguistics. The thought was–“well, we teach school children to learn languages by teaching them nouns, verbs, sentence structure, etc. when learning a new language, so that’s how we’ll teach computers as well!” But this is NOT a good way to teach languages. At all. Anyone who has learned a second language will tell you the best way to learn is to submerse yourself in the language and culture, not be taught sentence structure. Yet another example of the school system focusing on depth first rather than breadth. Drives me nuts. But I digress.

Anyways, neural networks don’t care about any of that stuff. At least not explicitly. What you can do is feed what is called a Recurrent Neural Network (RNN) a bunch of sentences and have the correct human translation as a “label” in much the same way that you would have the correct image label in a CNN as discussed in the previous section. Over time, this neural network will learn that certain words in one language equate to a certain word in another language. But in addition to this, it will learn proper syntax and sentence structure just by being fed a large number of correct translations. It’s actually quite amazing! I will go into detail in a later article, but if you want to learn more about them now, check out Andrej Karpathy’s excellent explanation.

So Google used an RNN to create near-colloquial translations, and it’s only getting better. There was a really great article about it in The New York Times if you’re interested in investigating further.

So how do they do it?!

This section is where things can start to get a bit technical. In this first overview, I don’t want to start introducing too much math into the mix yet, so for now all that is necessary to know is that Neural Networks are just a series of matrix multiplications and transformation functions stacked in layers. That’s it. If you are unfamiliar with matrix multiplication, that is perfectly fine. It’s really not all that necessary to know, even for implementing a DNN, but if you have 20 minutes and want to learn about matrix multiplication, I suggest watching these Kahn Academy videos. Now, let’s take a look a general DNN architecture to get a better understanding.

Graphic showing Deep Neural Network visualization

While at first this may look complicated, every single layer is doing the same thing–passing values to every single other node in the next layer. Let’s break this down further.

First, there is an input layer. This is where you would feed the pixel values of an image or words in a translation. There are a series of hidden layers of varying, arbitrary width. A neural network is “deep” if it has multiple layers (though how many designates as technically deep up for debate). Finally there is an output layer, which could be anywhere from a single node representing a prediction for the next word in a sentence, to a series of nodes detailing what percentage it believes this image is a shoe, a football or a saxophone.

You’ll notice that the number of nodes in the input layer and output layer are predetermined based upon the problem you are trying to solve, but the intermediate architecture is up to the neural network designer. Or engineer. I think they prefer the term engineer.

The input layer is going to pass its values, multiplied by a set of weights, to each node in the next layer. Lets call these weights a level of importance (this is not accurate, but can help in comprehension). Now each node contains a new value. This value is then transformed, generally using a method called ReLU, which is short for Rectified Linear Unit–yet another needlessly complicated phrase. This simply means the greater of 0 and the value, or max(0,x). So now layer two has received the input, performed matrix multiplication across a set of weights, and changed the value to zero if it was negative. That’s it.

The first hidden layer does the exact same thing for the next layer, and so on until you get to the output layer. The “activation function”, which you’ll recall was previously ReLU, is going to be different for the output layer and tailored to the specific problem you are trying to solve. In the case of a classification problem, you would use something like softmax, which would transform the final output into a series of values that all add up to 1. These therefore would represent the probabilities the network assigns the image to belonging to each possible classification. In the end, the output layer is just a function of a function of a function of a function of a… so on and so forth.

A truly groundbreaking discovery and one of the critical pieces to why neural networks can be used so extensively is because they can model any given function. The intuition behind this is outside my scope, but and excellent explanation can be found in Michael Nielsen’s excellent chapter on the subject.

But wait, they also learn?

That’s right. Recall the different parts of a neural network. You’ve got an input layer that you have no control over, an output layer you have no control over, a hidden layer architecture that is arbitrarily designed, but static once picked. There is no learning to be done here. But what do the layers do? They multiply their input by a series of weights and then transform them with an activation function. Well, the activation function is also static. The only thing left is the weights, which until now I have not really explained. These weights are what the model is trying to learn–you can think of it as the model asking “what is the optimum level to multiply this value by before passing it on to the node for activation?” Let’s take our Computer Vision example from earlier to better understand the learning process.

These weights, when the model is first created, are randomly initialized to values close to zero, so the first predictions it makes will be complete nonsense. So you show it an image of a polar bear and out of the 1,000 possible things in the output layer of the model that it could classify the polar bear as (The imagenet competition has 1,000, so we’ll use that for our example), the model will at first give relatively equal probability to every single of the 1,000 output nodes. It should only have 1 very high-value polar bear node and 999 very low probabilities for all the rest. So how does it change the weights so that eventually it recognizes with somewhere around 99.99999% certainty that it is a polar bear? It uses the chain rule in a process called back propagation. Why exactly they didn’t just call it the chain rule, I have no idea. Another one of those things to help ensure job security I suppose. The chain rule requires a bit of derivative calculus, and you can refresh your knowledge with these Kahn Academy videos.

We’re not going to worry about this too much right now, but with these classification problems, there is an associated cost function, one of the most popular being categorical cross-entropy, also known as log loss. Again, not that important right now, but know it exists. So the model got the prediction wrong. How wrong? Well, that’s where our “loss function” comes into play. Our model, if it were 100% accurate, would say that polar bear had a probability of 1 and a probability for 0 for all other 999 nodes. So there is some substantial loss on every single output node with the initial random weights.But if we were to change these weights on the previous layers, eventually those changes would permeate to the output layer. So you can see that the amount of loss is a function of the different weights, so if you change the weights a bit, you can change the loss. We want to minimize this level of loss.

With this in mind, imagine a plot with different weights equating to different levels of loss, such as in the contour plot on the right. If we take the derivative of our current position within this weight-space, we can start changing the weights slightly, thereby taking small steps towards minimizing the loss. This is a process called gradient descent. Once the model has reached the minima, it will have minimized its cost function and the model has now LEARNED the correct weights to correctly identify the images. That’s all there is to it. You now have a trained model. Now, you’ll have been feeding it a LOT of images, and the loss function is really the sum of all these losses. This means that the model is generalized. It doesn’t just recognize a polar bear, but it recognizes corn, candles, and lollipops as well, assuming these images were used in training the network.

Practical Considerations

The necessary chain rule calculations needed to be computed when starting with a fresh model can be very computationally expensive. In fact, it is probably impossible to train one of these sophisticated models on your home computer without it taking 3 lifetimes. But GPUs are particularly well-suited for the task, especially NVIDIA ones. The reasoning behind NVIDIA specifically is a bit in the weeds, but they use a particular architecture (CUDA) that is the underlying base of most modern machine learning libraries and APIs (Theano, Tensorflow, Keras). Even with all of this computing power, it could potentially take days or weeks to fully tune the weights of a model from scratch, depending on the level of sophistication in the model architecture and computing capabilities.

Additionally, the amount of data needed is very large. You’d want at least 20,000 images just to train a model to adequately recognize the difference between a cat an a dog reliably. So these models are very data-dependent. You need access to a large public dataset, or be prepared to procure your own.

If you wanted to build and train a neural network on your own, you will probably want to use something like a P2 instance from AWS (Amazon Web Services). But really, if you want to learn well enough for implementation, you should participate in the MOOC Practical Deep Learning for Coders, Part 1. It will guide you through AWS setup and have you building a CNN in the first lecture. I highly recommend it, though 1 year of coding experience is expected.

Concluding Thoughts

Deep Neural Networks are one of the most bleeding-edge techniques in machine learning today. It’s not easy to understand, but it doesn’t have to be nearly as difficult as everyone wants to make it out to be, either. I hope that the level of explanation has been high-level enough to have given a good grasp on the concepts and you now feel a bit more confident in the event you are a nerd like me and find yourself in conversations where DNNs come up. I’m not entirely sure how frequent I will be releasing the follow-up articles for greater depth, as I have many other topics of interest I am itching to discuss, but stay tuned if you’re interested in learning more!

Please let me know in the comments if anything is still unclear and I’ll do my best to explain.

3 Comments
  1. avatar image
    Brandon Albert at June 17, 2017 Reply

    Do you create the diagrams for your articles? Taking the time to write these should translate to one well-written dissertation brother and maybe teach a few people along the way! Keep it up!

    1. avatar image
      Corbin at June 17, 2017 Reply

      The gradient descent diagram is from Wikipedia, but yeah, I made the other 2. And all the diagrams in my other posts as well. I use TikZ, a LaTeX package, if you’re interested.

  2. […] of derivatives calculations when performing gradient descent, a process you can read up on in my overview of Deep Neural Nets.  These calculations benefit tremendously from being parallelized, as opposed to being fed through […]

Leave a Comment