How to Configure the Number of Layers and Nodes in a Neural Network

Artificial neural networks have two main hyperparameters that control the architecture or topology of the network: the number of layers and the number of nodes in each hidden layer.

You must specify values for these parameters when configuring your network.

The most reliable way to configure these hyperparameters for your specific predictive modeling problem is via systematic experimentation with a robust test harness.

This can be a tough pill to swallow for beginners to the field of machine learning, looking for an analytical way to calculate the optimal number of layers and nodes, or easy rules of thumb to follow.

In this post, you will discover the roles of layers and nodes and how to approach the configuration of a multilayer perceptron neural network for your predictive modeling problem.

After reading this post, you will know:

The difference between single-layer and multiple-layer perceptron networks.

The value of having one and more than one hidden layers in a network.

Five approaches for configuring the number of layers and nodes in a network.

Let’s get started.

How to Configure the Number of Layers and Nodes in a Neural NetworkPhoto by Ryan, some rights reserved.

Overview

This post is divided into four sections; they are:

The Multilayer Perceptron

How to Count Layers?

Why Have Multiple Layers?

How Many Layers and Nodes to Use?

The Multilayer Perceptron

A node, also called a neuron or Perceptron, is a computational unit that has one or more weighted input connections, a transfer function that combines the inputs in some way, and an output connection.

Nodes are then organized into layers to comprise a network.

A single-layer artificial neural network, also called a single-layer, has a single layer of nodes, as its name suggests. Each node in the single layer connects directly to an input variable and contributes to an output variable.

Single-layer networks have just one layer of active units. Inputs connect directly to the outputs through a single layer of weights. The outputs do not interact, so a network with N outputs can be treated as N separate single-output networks.

A single-layer network can be extended to a multiple-layer network, referred to as a Multilayer Perceptron. A Multilayer Perceptron, or MLP for sort, is an artificial neural network with more than a single layer.

It has an input layer that connects to the input variables, one or more hidden layers, and an output layer that produces the output variables.

The standard multilayer perceptron (MLP) is a cascade of single-layer perceptrons. There is a layer of input nodes, a layer of output nodes, and one or more intermediate layers. The interior layers are sometimes called “hidden layers” because they are not directly observable from the systems inputs and outputs.

Hidden Layers: Layers of nodes between the input and output layers. There may be one or more of these layers.

Output Layer: A layer of nodes that produce the output variables.

Finally, there are terms used to describe the shape and capability of a neural network; for example:

Size: The number of nodes in the model.

Width: The number of nodes in a specific layer.

Depth: The number of layers in a neural network.

Capacity: The type or structure of functions that can be learned by a network configuration. Sometimes called “representational capacity“.

Architecture: The specific arrangement of the layers and nodes in the network.

How to Count Layers?

Traditionally, there is some disagreement about how to count the number of layers.

The disagreement centers around whether or not the input layer is counted. There is an argument to suggest it should not be counted because the inputs are not active; they are simply the input variables. We will use this convention; this is also the convention recommended in the book “Neural Smithing“.

Therefore, an MLP that has an input layer, one hidden layer, and one output layer is a 2-layer MLP.

The structure of an MLP can be summarized using a simple notation.

This convenient notation summarizes both the number of layers and the number of nodes in each layer. The number of nodes in each layer is specified as an integer, in order from the input layer to the output layer, with the size of each layer separated by a forward-slash character (“/”).

For example, a network with two variables in the input layer, one hidden layer with eight nodes, and an output layer with one node would be described using the notation: 2/8/1.

I recommend using this notation when describing the layers and their size for a Multilayer Perceptron neural network.

Why Have Multiple Layers?

Before we look at how many layers to specify, it is important to think about why we would want to have multiple layers.

A single-layer neural network can only be used to represent linearly separable functions. This means very simple problems where, say, the two classes in a classification problem can be neatly separated by a line. If your problem is relatively simple, perhaps a single layer network would be sufficient.

Most problems that we are interested in solving are not linearly separable.

A Multilayer Perceptron can be used to represent convex regions. This means that in effect, they can learn to draw shapes around examples in some high-dimensional space that can separate and classify them, overcoming the limitation of linear separability.

In fact, there is a theoretical finding by Lippmann in the 1987 paper “An introduction to computing with neural nets” that shows that an MLP with two hidden layers is sufficient for creating classification regions of any desired shape. This is instructive, although it should be noted that no indication of how many nodes to use in each layer or how to learn the weights is given.

A further theoretical finding and proof has shown that MLPs are universal approximators. That with one hidden layer, an MLP can approximate any function that we require.

Specifically, the universal approximation theorem states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.

This is an often-cited theoretical finding and there is a ton of literature on it. In practice, we again have no idea how many nodes to use in the single hidden layer for a given problem nor how to learn or set their weights effectively. Further, many counterexamples have been presented of functions that cannot directly be learned via a single one-hidden-layer MLP or require an infinite number of nodes.

Even for those functions that can be learned via a sufficiently large one-hidden-layer MLP, it can be more efficient to learn it with two (or more) hidden layers.

Since a single sufficiently large hidden layer is adequate for approximation of most functions, why would anyone ever use more? One reason hangs on the words “sufficiently large”. Although a single hidden layer is optimal for some functions, there are others for which a single-hidden-layer-solution is very inefficient compared to solutions with more layers.

How Many Layers and Nodes to Use?

With the preamble of MLPs out of the way, let’s get down to your real question.

How many layers should you use in your Multilayer Perceptron and how many nodes per layer?

In this section, we will enumerate five approaches to solving this problem.

1) Experimentation

In general, when I’m asked how many layers and nodes to use for an MLP, I often reply:

I don’t know. Use systematic experimentation to discover what works best for your specific dataset.

I still stand by this answer.

In general, you cannot analytically calculate the number of layers or the number of nodes to use per layer in an artificial neural network to address a specific real-world predictive modeling problem.

The number of layers and the number of nodes in each layer are model hyperparameters that you must specify.

You are likely to be the first person to attempt to address your specific problem with a neural network. No one has solved it before you. Therefore, no one can tell you the answer of how to configure the network.

You must discover the answer using a robust test harness and controlled experiments. For example, see the post:

Regardless of the heuristics you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific dataset.

2) Intuition

The network can be configured via intuition.

For example, you may have an intuition that a deep network is required to address a specific predictive modeling problem.

A deep model provides a hierarchy of layers that build up increasing levels of abstraction from the space of the input variables to the output variables.

Given an understanding of the problem domain, we may believe that a deep hierarchical model is required to sufficiently solve the prediction problem. In which case, we may choose a network configuration that has many layers of depth.

Choosing a deep model encodes a very general belief that the function we want to learn should involve composition of several simpler functions. This can be interpreted from a representation learning point of view as saying that we believe the learning problem consists of discovering a set of underlying factors of variation that can in turn be described in terms of other, simpler underlying factors of variation.

This intuition can come from experience with the domain, experience with modeling problems with neural networks, or some mixture of the two.

In my experience, intuitions are often invalidated via experiments.

3) Go For Depth

In their important textbook on deep learning, Goodfellow, Bengio, and Courville highlight that empirically, on problems of interest, deep neural networks appear to perform better.

Specifically, they state the choice of using deep neural networks as a statistical argument in cases where depth may be intuitively beneficial.

Empirically, greater depth does seem to result in better generalization for a wide variety of tasks. […] This suggests that using deep architectures does indeed express a useful prior over the space of functions the model learns.

We may use this argument to suggest that using deep networks, those with many layers, may be a heuristic approach to configuring networks for challenging predictive modeling problems.

This is similar to the advice for starting with Random Forest and Stochastic Gradient Boosting on a predictive modeling problem with tabular data to quickly get an idea of an upper-bound on model skill prior to testing other methods.

4) Borrow Ideas

A simple, but perhaps time consuming approach, is to leverage findings reported in the literature.

Find research papers that describe the use of MLPs on instances of prediction problems similar in some way to your problem. Note the configuration of the networks used in those papers and use them as a starting point for the configurations to test on your problem.

Transferability of model hyperparameters that result in skillful models from one problem to another is a challenging open problem and the reason why model hyperparameter configuration is more art than science.

Nevertheless, the network layers and number of nodes used on related problems is a good starting point for testing ideas.

5) Search

Design an automated search to test different network configurations.

You can seed the search with ideas from literature and intuition.

Some popular search strategies include:

Random: Try random configurations of layers and nodes per layer.

Grid: Try a systematic search across the number of layers and nodes per layer.

Heuristic: Try a directed search across configurations such as a genetic algorithm or Bayesian optimization.

Exhaustive: Try all combinations of layers and the number of nodes; it might be feasible for small networks and datasets.

This can be challenging with large models, large datasets and combinations of the two. Some ideas to reduce or manage the computational burden include:

Fit models on a smaller subset of the training dataset to speed up the search.

More

I have seen countless heuristics of how to estimate the number of layers and either the total number of neurons or the number of neurons per layer.

I do not want to enumerate them; I’m skeptical that they add practical value beyond the special cases on which they are demonstrated.

If this area is interesting to you, perhaps start with “Section 4.4 Capacity versus Size” in the book “Neural Smithing“. It summarizes a ton of findings in this area. The book is dated from 1999, so there are another nearly 20 years of ideas to wade through in this area if you’re up for it.

Also, see some of the discussions linked in the Further Reading section (below).

Did I miss your favorite method for configuring a neural network? Or do you know a good reference on the topic?
Let me know in the comments below.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Hi, Very nice summary. Thank you very much! I’m a deep learning researcher working in an inter-disciplinary team in Univ Edi. May I ask about the template you used to create this site? It looks quite professional and great!