Tag Archives: Publications

Quantum computing integrates the two largest technological revolutions of the last half century, information technology and quantum mechanics. If we compute using the rules of quantum mechanics, instead of binary logic, some intractablecomputational tasks become feasible. An important goal in the pursuit of a universal quantum computer is the determination of the smallest computational task that is prohibitively hard for today’s classical computers. This crossover point is known as the “quantum supremacy” frontier, and is a critical step on the path to more powerful and useful computations.

In “Characterizing quantum supremacy in near-term devices” published in Nature Physics (arXiv here), we present the theoretical foundation for a practical demonstration of quantum supremacy in near-term devices. It proposes the task of sampling bit-strings from the output of random quantum circuits, which can be thought of as the “hello world” program for quantum computers. The upshot of the argument is that the output of random chaotic systems (think butterfly effect) become very quickly harder to predict the longer they run. If one makes a random, chaotic qubit system and examines how long a classical system would take to emulate it, one gets a good measure of when a quantum computer could outperform a classical one. Arguably, this is the strongest theoretical proposal to prove an exponentialseparation between the computational power of classical and quantum computers.

Determining where exactly the quantum supremacy frontier lies for sampling random quantum circuits has rapidly become an exciting area of research. On one hand, improvements in classicalalgorithms to simulate quantum circuits aim to increase the size of the quantum circuits required to establish quantum supremacy. This forces an experimental quantum device with a sufficiently large number of qubits and low enough error rates to implement circuits of sufficient depth (i.e the number of layers of gates in the circuit) to achieve supremacy. On the other hand, we now understand better how the particular choice of the quantum gates used to build random quantum circuits affects the simulation cost, leading to improved benchmarks for near-term quantum supremacy (available for download here), which are in some cases quadratically more expensive to simulate classically than the original proposal.

Sampling from random quantum circuits is an excellent calibration benchmark for quantum computers, which we call cross-entropy benchmarking. A successful quantum supremacy experiment with random circuits would demonstrate the basic building blocks for a large-scale fault-tolerant quantum computer. Furthermore, quantum physics has not yet been tested for highly complex quantum states such as this.

Space-time volume of a quantum circuit computation. The computational cost for quantum simulation increases with the volume of the quantum circuit, and in general grows exponentially with the number of qubits and the circuit depth. For asymmetric grids of qubits, the computational space-time volume grows slower with depth than for symmetric grids, and can result in circuits exponentially easier to simulate.

In “A blueprint for demonstrating quantum supremacy with superconducting qubits” (arXiv here), we illustrate a blueprint towards quantum supremacy and experimentally demonstrate a proof-of-principle version for the first time. In the paper, we discuss two key ingredients for quantum supremacy: exponential complexity and accurate computations. We start by running algorithms on subsections of the device ranging from 5 to 9 qubits. We find that the classical simulation cost grows exponentially with the number of qubits. These results are intended to provide a clear example of the exponential power of these devices. Next, we use cross-entropy benchmarking to compare our results against that of an ordinary computer and show that our computations are highly accurate. In fact, the error rate is low enough to achieve quantum supremacy with a larger quantum processor.

Beyond achieving quantum supremacy, a quantum platform should offer clear applications. In our paper, we apply our algorithms towards computational problems in quantum statistical-mechanics using complex multi-qubit gates (as opposed to the two-qubit gates designed for a digital quantum processor with surface code error correction). We show that our devices can be used to study fundamental properties of materials, e.g. microscopic differences between metals and insulators. By extending these results to next-generation devices with ~50 qubits, we hope to answer scientific questions that are beyond the capabilities of any other computing platform.

Photograph of two gmon superconducting qubits and their tunable coupler developed by Charles Neill and Pedram Roushan.

These two publications introduce a realistic proposal for near-term quantum supremacy, and demonstrate a proof-of-principle version for the first time. We will continue to decrease the error rates and increase the number of qubits in quantum processors to reach the quantum supremacy frontier, and to develop quantum algorithms for useful near-term applications.

Quantum computing integrates the two largest technological revolutions of the last half century, information technology and quantum mechanics. If we compute using the rules of quantum mechanics, instead of binary logic, some intractablecomputational tasks become feasible. An important goal in the pursuit of a universal quantum computer is the determination of the smallest computational task that is prohibitively hard for today’s classical computers. This crossover point is known as the “quantum supremacy” frontier, and is a critical step on the path to more powerful and useful computations.

In “Characterizing quantum supremacy in near-term devices” published in Nature Physics (arXiv here), we present the theoretical foundation for a practical demonstration of quantum supremacy in near-term devices. It proposes the task of sampling bit-strings from the output of random quantum circuits, which can be thought of as the “hello world” program for quantum computers. The upshot of the argument is that the output of random chaotic systems (think butterfly effect) become very quickly harder to predict the longer they run. If one makes a random, chaotic qubit system and examines how long a classical system would take to emulate it, one gets a good measure of when a quantum computer could outperform a classical one. Arguably, this is the strongest theoretical proposal to prove an exponentialseparation between the computational power of classical and quantum computers.

Determining where exactly the quantum supremacy frontier lies for sampling random quantum circuits has rapidly become an exciting area of research. On one hand, improvements in classicalalgorithms to simulate quantum circuits aim to increase the size of the quantum circuits required to establish quantum supremacy. This forces an experimental quantum device with a sufficiently large number of qubits and low enough error rates to implement circuits of sufficient depth (i.e the number of layers of gates in the circuit) to achieve supremacy. On the other hand, we now understand better how the particular choice of the quantum gates used to build random quantum circuits affects the simulation cost, leading to improved benchmarks for near-term quantum supremacy (available for download here), which are in some cases quadratically more expensive to simulate classically than the original proposal.

Sampling from random quantum circuits is an excellent calibration benchmark for quantum computers, which we call cross-entropy benchmarking. A successful quantum supremacy experiment with random circuits would demonstrate the basic building blocks for a large-scale fault-tolerant quantum computer. Furthermore, quantum physics has not yet been tested for highly complex quantum states such as this.

Space-time volume of a quantum circuit computation. The computational cost for quantum simulation increases with the volume of the quantum circuit, and in general grows exponentially with the number of qubits and the circuit depth. For asymmetric grids of qubits, the computational space-time volume grows slower with depth than for symmetric grids, and can result in circuits exponentially easier to simulate.

In “A blueprint for demonstrating quantum supremacy with superconducting qubits” (arXiv here), we illustrate a blueprint towards quantum supremacy and experimentally demonstrate a proof-of-principle version for the first time. In the paper, we discuss two key ingredients for quantum supremacy: exponential complexity and accurate computations. We start by running algorithms on subsections of the device ranging from 5 to 9 qubits. We find that the classical simulation cost grows exponentially with the number of qubits. These results are intended to provide a clear example of the exponential power of these devices. Next, we use cross-entropy benchmarking to compare our results against that of an ordinary computer and show that our computations are highly accurate. In fact, the error rate is low enough to achieve quantum supremacy with a larger quantum processor.

Beyond achieving quantum supremacy, a quantum platform should offer clear applications. In our paper, we apply our algorithms towards computational problems in quantum statistical-mechanics using complex multi-qubit gates (as opposed to the two-qubit gates designed for a digital quantum processor with surface code error correction). We show that our devices can be used to study fundamental properties of materials, e.g. microscopic differences between metals and insulators. By extending these results to next-generation devices with ~50 qubits, we hope to answer scientific questions that are beyond the capabilities of any other computing platform.

Photograph of two gmon superconducting qubits and their tunable coupler developed by Charles Neill and Pedram Roushan.

These two publications introduce a realistic proposal for near-term quantum supremacy, and demonstrate a proof-of-principle version for the first time. We will continue to decrease the error rates and increase the number of qubits in quantum processors to reach the quantum supremacy frontier, and to develop quantum algorithms for useful near-term applications.

Posted by Jeff Dean, Google Senior Fellow, Head of Google Research and Machine Intelligence

This week, Vancouver, Canada hosts the 6th International Conference on Learning Representations (ICLR 2018), a conference focused on how one can learn meaningful and useful representations of data for machine learning. ICLR includes conference and workshop tracks, with invited talks along with oral and poster presentations of some of the latest research on deep learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.

At the forefront of innovation in cutting-edge technology in neural networks and deep learning, Google focuses on both theory and application, developing learning approaches to understand and generalize. As Platinum Sponsor of ICLR 2018, Google will have a strong presence with over 130 researchers attending, contributing to and learning from the broader academic research community by presenting papers and posters, in addition to participating on organizing committees and in workshops.

If you are attending ICLR 2018, we hope you'll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about our research being presented at ICLR 2018 in the list below (Googlers highlighted in blue)

Posted by Jeff Dean, Google Senior Fellow, Head of Google Research and Machine Intelligence

This week, Vancouver, Canada hosts the 6th International Conference on Learning Representations (ICLR 2018), a conference focused on how one can learn meaningful and useful representations of data for machine learning. ICLR includes conference and workshop tracks, with invited talks along with oral and poster presentations of some of the latest research on deep learning, metric learning, kernel learning, compositional models, non-linear structured prediction, and issues regarding non-convex optimization.

At the forefront of innovation in cutting-edge technology in neural networks and deep learning, Google focuses on both theory and application, developing learning approaches to understand and generalize. As Platinum Sponsor of ICLR 2018, Google will have a strong presence with over 130 researchers attending, contributing to and learning from the broader academic research community by presenting papers and posters, in addition to participating on organizing committees and in workshops.

If you are attending ICLR 2018, we hope you'll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about our research being presented at ICLR 2018 in the list below (Googlers highlighted in blue)

Posted by Yuxuan Wang, Research Scientist and RJ Skerry-Ryan, Software Engineer, on behalf of the Machine Perception and Google Brain teams

At Google, we're excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the Tacotron systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software. To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don't explicitly model prosody, meaning they can't control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation. Today, we are excited to share two new papers that address these problems.

We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to the paper.

This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing. At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference.

Text: *Is* that Utah travel agency?

Reference prosody (Australian)

Synthesized without prosody embedding (American)

Synthesized with prosody embedding (American)

The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works best when the reference and target phrases are similar in length and structure.

Reference Text: For the first time in her life she had been danced tired.

Synthesized Text: For the last time in his life he had been handily embarrassed.

Reference prosody (American)

Synthesized without prosody embedding (American)

Synthesized with prosody embedding (American)

Excitingly, we observe prosody transfer even when the reference audio comes from a speaker whose voice is not in Tacotron's training data.

Despite their ability to transfer prosody with high fidelity, the embeddings from the paper above don't completely disentangle prosody from the content of a reference audio clip. (This explains why they transfer prosody best to phrases of similar structure and length.) Furthermore, they require a clip of reference audio at inference time. A natural question then arises: can we develop a model of expressive speech that alleviates these problems?

The model works by adding an extra attention mechanism to Tacotron, forcing it to represent the prosody embedding of any speech clip as the linear combination of a fixed set of basis embeddings. We call these embeddings Global Style Tokens (GSTs), and find that they learn text-independent variations in a speaker's style (soft, high-pitch, intense, etc.), without the need for explicit style labels.

Model architecture of Global Style Tokens. The prosody embedding is decomposed into “style tokens” to enable unsupervised style control and transfer. For technical details, please refer to the paper.

At inference time, we can select or modify the combination weights for the tokens, allowing us to force Tacotron to use a specific speaking style without needing a reference audio clip. Using GSTs, for example, we can make different sentences of varying lengths sound more "lively", "angry", "lamenting", etc:

Text: United Airlines five six three from Los Angeles to New Orleans has Landed.

Style 1

Style 2

Style 3

Style 4

Style 5

The text-independent nature of GSTs make them ideal for style transfer, which takes a reference audio clip spoken in a specific style and transfers its style to any target phrase we choose. To achieve this, we first run inference to predict the GST combination weights for an utterance whose style we want to imitate. We can then feed those combination weights to the model to synthesize completely different phrases — even those with very different lengths and structure — in the same style.

Finally, our paper shows that Global Style Tokens can model more than just speaking style. When trained on noisy YouTube audio from unlabeled speakers, a GST-enabled Tacotron learns to represent noise sources and distinct speakers as separate tokens. This means that by selecting the GSTs we use in inference, we can synthesize speech free of background noise, or speech in the voice of a specific unlabeled speaker from the dataset. This exciting result provides a path towards highly scalable but robust speech synthesis. You can listen to the full set of demos for "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis" on this web page.

We are excited about the potential applications and opportunities that these two bodies of research enable. In the meantime, there are new important research problems to be addressed. We'd like to extend the techniques of the first paper to support prosody transfer in the natural pitch range of the target speaker. We'd also like to develop techniques to select appropriate prosody or speaking style automatically from context, using, for example, the integration of natural language understanding with TTS. Finally, while our first paper proposes an initial set of objective and subjective metrics for prosody transfer, we'd like to develop these further to help establish generally-accepted methods for prosodic evaluation.

Posted by Yuxuan Wang, Research Scientist and RJ Skerry-Ryan, Software Engineer, on behalf of the Machine Perception, Google Brain and TTS Research teams

At Google, we’re excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the Tacotron systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software. To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don’t explicitly model prosody, meaning they can’t control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation. Today, we are excited to share two new papers that address these problems.

We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to the paper.

This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing. At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference.

Text: *Is* that Utah travel agency?

Reference prosody (Australian)

Synthesized without prosody embedding (American)

Synthesized with prosody embedding (American)

The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works best when the reference and target phrases are similar in length and structure.

Reference Text: For the first time in her life she had been danced tired.

Synthesized Text: For the last time in his life he had been handily embarrassed.

Reference prosody (American)

Synthesized without prosody embedding (American)

Synthesized with prosody embedding (American)

Excitingly, we observe prosody transfer even when the reference audio comes from a speaker whose voice is not in Tacotron’s training data.

Despite their ability to transfer prosody with high fidelity, the embeddings from the paper above don’t completely disentangle prosody from the content of a reference audio clip. (This explains why they transfer prosody best to phrases of similar structure and length.) Furthermore, they require a clip of reference audio at inference time. A natural question then arises: can we develop a model of expressive speech that alleviates these problems?

The model works by adding an extra attention mechanism to Tacotron, forcing it to represent the prosody embedding of any speech clip as the linear combination of a fixed set of basis embeddings. We call these embeddings Global Style Tokens (GSTs), and find that they learn text-independent variations in a speaker’s style (soft, high-pitch, intense, etc.), without the need for explicit style labels.

Model architecture of Global Style Tokens. The prosody embedding is decomposed into “style tokens” to enable unsupervised style control and transfer. For technical details, please refer to the paper.

At inference time, we can select or modify the combination weights for the tokens, allowing us to force Tacotron to use a specific speaking style without needing a reference audio clip. Using GSTs, for example, we can make different sentences of varying lengths sound more “lively”, “angry”, “lamenting”, etc:

Text: United Airlines five six three from Los Angeles to New Orleans has Landed.

Style 1

Style 2

Style 3

Style 4

Style 5

The text-independent nature of GSTs make them ideal for style transfer, which takes a reference audio clip spoken in a specific style and transfers its style to any target phrase we choose. To achieve this, we first run inference to predict the GST combination weights for an utterance whose style we want to imitate. We can then feed those combination weights to the model to synthesize completely different phrases — even those with very different lengths and structure — in the same style.

Finally, our paper shows that Global Style Tokens can model more than just speaking style. When trained on noisy YouTube audio from unlabeled speakers, a GST-enabled Tacotron learns to represent noise sources and distinct speakers as separate tokens. This means that by selecting the GSTs we use in inference, we can synthesize speech free of background noise, or speech in the voice of a specific unlabeled speaker from the dataset. This exciting result provides a path towards highly scalable but robust speech synthesis. You can listen to the full set of demos for “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis” on this web page.

We are excited about the potential applications and opportunities that these two bodies of research enable. In the meantime, there are new important research problems to be addressed. We’d like to extend the techniques of the first paper to support prosody transfer in the natural pitch range of the target speaker. We’d also like to develop techniques to select appropriate prosody or speaking style automatically from context, using, for example, the integration of natural language understanding with TTS. Finally, while our first paper proposes an initial set of objective and subjective metrics for prosody transfer, we’d like to develop these further to help establish generally-accepted methods for prosodic evaluation.

The brain has evolved over a long time, from very simple worm brains 500 million years ago to a diversity of modern structures today. The human brain, for example, can accomplish a wide variety of activities, many of them effortlessly — telling whether a visual scene contains animals or buildings feels trivial to us, for example. To perform activities like these, artificial neural networks require careful design by experts over years of difficult research, and typically address one specific task, such as to find what's in a photograph, to call a genetic variant, or to help diagnose a disease. Ideally, one would want to have an automated method to generate the right architecture for any given task.

One approach to generate these architectures is through the use of evolutionary algorithms. Traditional research into neuro-evolution of topologies (e.g. Stanley and Miikkulainen 2002) has laid the foundations that allow us to apply these algorithms at scale today, and many groups are working on the subject, including OpenAI, Uber Labs, Sentient Labs and DeepMind. Of course, the Google Brain team has been thinking about AutoML too. In addition to learning-based approaches (eg. reinforcement learning), we wondered if we could use our computational resources to programmatically evolve image classifiers at unprecedented scale. Can we achieve solutions with minimal expert participation? How good can today's artificially-evolved neural networks be? We address these questions through two papers.

In “Large-Scale Evolution of Image Classifiers,” presented at ICML 2017, we set up an evolutionary process with simple building blocks and trivial initial conditions. The idea was to "sit back" and let evolution at scale do the work of constructing the architecture. Starting from very simple networks, the process found classifiers comparable to hand-designed models at the time. This was encouraging because many applications may require little user participation. For example, some users may need a better model but may not have the time to become machine learning experts. A natural question to consider next was whether a combination of hand-design and evolution could do better than either approach alone. Thus, in our more recent paper, “Regularized Evolution for Image Classifier Architecture Search” (2018), we participated in the process by providing sophisticated building blocks and good initial conditions (discussed below). Moreover, we scaled up computation using Google's new TPUv2 chips. This combination of modern hardware, expert knowledge, and evolution worked together to produce state-of-the-art models on CIFAR-10 and ImageNet, two popular benchmarks for image classification.

A Simple ApproachThe following is an example of an experiment from our first paper. In the figure below, each dot is a neural network trained on the CIFAR-10 dataset, which is commonly used to train image classifiers. Initially, the population consists of one thousand identical simple seed models (no hidden layers). Starting from simple seed models is important — if we had started from a high-quality model with initial conditions containing expert knowledge, it would have been easier to get a high-quality model in the end. Once seeded with the simple models, the process advances in steps. At each step, a pair of neural networks is chosen at random. The network with higher accuracy is selected as a parent and is copied and mutated to generate a child that is then added to the population, while the other neural network dies out. All other networks remain unchanged during the step. With the application of many such steps in succession, the population evolves.

Progress of an evolution experiment. Each dot represents an individual in the population. The four diagrams show examples of discovered architectures. These correspond to the best individual (rightmost; selected by validation accuracy) and three of its ancestors.

The mutations in our first paper are purposefully simple: remove a convolution at random, add a skip connection between arbitrary layers, or change the learning rate, to name a few. This way, the results show the potential of the evolutionary algorithm, as opposed to the quality of the search space. For example, if we had used a single mutation that transforms one of the seed networks into an Inception-ResNet classifier in one step, we would be incorrectly concluding that the algorithm found a good answer. Yet, in that case, all we would have done is hard-coded the final answer into a complex mutation, rigging the outcome. If instead we stick with simple mutations, this cannot happen and evolution is truly doing the job. In the experiment in the figure, simple mutations and the selection process cause the networks to improve over time and reach high test accuracies, even though the test set had never been seen during the process. In this paper, the networks can also inherit their parent's weights. Thus, in addition to evolving the architecture, the population trains its networks while exploring the search space of initial conditions and learning-rate schedules. As a result, the process yields fully trained models with optimized hyperparameters. No expert input is needed after the experiment starts.

In all the above, even though we were minimizing the researcher's participation by having simple initial architectures and intuitive mutations, a good amount of expert knowledge went into the building blocks those architectures were made of. These included important inventions such as convolutions, ReLUs and batch-normalization layers. We were evolving an architecture made up of these components. The term "architecture" is not accidental: this is analogous to constructing a house with high-quality bricks.

Combining Evolution and Hand DesignAfter our first paper, we wanted to reduce the search space to something more manageable by giving the algorithm fewer choices to explore. Using our architectural analogy, we removed all the possible ways of making large-scale errors, such as putting the wall above the roof, from the search space. Similarly with neural network architecture searches, by fixing the large-scale structure of the network, we can help the algorithm out. So how to do this? The inception-like modules introduced in Zoph et al. (2017) for the purpose of architecture search proved very powerful. Their idea is to have a deep stack of repeated modules called cells. The stack is fixed but the architecture of the individual modules can change.

The building blocks introduced in Zoph et al. (2017). The diagram on the left is the outer structure of the full neural network, which parses the input data from bottom to top through a stack of repeated cells. The diagram on the right is the inside structure of a cell. The goal is to find a cell that yields an accurate network.

In our second paper, “Regularized Evolution for Image Classifier Architecture Search” (2018), we presented the results of applying evolutionary algorithms to the search space described above. The mutations modify the cell by randomly reconnecting the inputs (the arrows on the right diagram in the figure) or randomly replacing the operations (for example, they can replace the "max 3x3" in the figure, a max-pool operation, with an arbitrary alternative). These mutations are still relatively simple, but the initial conditions are not: the population is now initialized with models that must conform to the outer stack of cells, which was designed by an expert. Even though the cells in these seed models are random, we are no longer starting from simple models, which makes it easier to get to high-quality models in the end. If the evolutionary algorithm is contributing meaningfully, the final networks should be significantly better than the networks we already know can be constructed within this search space. Our paper shows that evolution can indeed find state-of-the-art models that either match or outperform hand-designs.

A Controlled ComparisonEven though the mutation/selection evolutionary process is not complicated, maybe an even more straightforward approach (like random search) could have done the same. Other alternatives, though not simpler, also exist in the literature (like reinforcement learning). Because of this, the main purpose of our second paper was to provide a controlled comparison between techniques.

Comparison between evolution, reinforcement learning, and random search for the purposes of architecture search. These experiments were done on the CIFAR-10 dataset, under the same conditions as Zoph et al. (2017), where the search space was originally used with reinforcement learning.

The figure above compares evolution, reinforcement learning, and random search. On the left, each curve represents the progress of an experiment, showing that evolution is faster than reinforcement learning in the earlier stages of the search. This is significant because with less compute power available, the experiments may have to stop early. Moreover evolution is quite robust to changes in the dataset or search space. Overall, the goal of this controlled comparison is to provide the research community with the results of a computationally expensive experiment. In doing so, it is our hope to facilitate architecture searches for everyone by providing a case study of the relationship between the different search algorithms. Note, for example, that the figure above shows that the final models obtained with evolution can reach very high accuracy while using fewer floating-point operations.

One important feature of the evolutionary algorithm we used in our second paper is a form of regularization: instead of letting the worst neural networks die, we remove the oldest ones — regardless of how good they are. This improves robustness to changes in the task being optimized and tends to produce more accurate networks in the end. One reason for this may be that since we didn't allow weight inheritance, all networks must train from scratch. Therefore, this form of regularization selects for networks that remain good when they are re-trained. In other words, because a model can be more accurate just by chance — noise in the training process means even identical architectures may get different accuracy values — only architectures that remain accurate through the generations will survive in the long run, leading to the selection of networks that retrain well. More details of this conjecture can be found in the paper.

The state-of-the-art models we evolved are nicknamed AmoebaNets, and are one of the latest results from our AutoML efforts. All these experiments took a lot of computation — we used hundreds of GPUs/TPUs for days. Much like a single modern computer can outperform thousands of decades-old machines, we hope that in the future these experiments will become household. Here we aimed to provide a glimpse into that future.

The brain has evolved over a long time, from very simple worm brains 500 million years ago to a diversity of modern structures today. The human brain, for example, can accomplish a wide variety of activities, many of them effortlessly — telling whether a visual scene contains animals or buildings feels trivial to us, for example. To perform activities like these, artificial neural networks require careful design by experts over years of difficult research, and typically address one specific task, such as to find what's in a photograph, to call a genetic variant, or to help diagnose a disease. Ideally, one would want to have an automated method to generate the right architecture for any given task.

One approach to generate these architectures is through the use of evolutionary algorithms. Traditional research into neuro-evolution of topologies (e.g. Stanley and Miikkulainen 2002) has laid the foundations that allow us to apply these algorithms at scale today, and many groups are working on the subject, including OpenAI, Uber Labs, Sentient Labs and DeepMind. Of course, the Google Brain team has been thinking about AutoML too. In addition to learning-based approaches (eg. reinforcement learning), we wondered if we could use our computational resources to programmatically evolve image classifiers at unprecedented scale. Can we achieve solutions with minimal expert participation? How good can today's artificially-evolved neural networks be? We address these questions through two papers.

In “Large-Scale Evolution of Image Classifiers,” presented at ICML 2017, we set up an evolutionary process with simple building blocks and trivial initial conditions. The idea was to "sit back" and let evolution at scale do the work of constructing the architecture. Starting from very simple networks, the process found classifiers comparable to hand-designed models at the time. This was encouraging because many applications may require little user participation. For example, some users may need a better model but may not have the time to become machine learning experts. A natural question to consider next was whether a combination of hand-design and evolution could do better than either approach alone. Thus, in our more recent paper, “Regularized Evolution for Image Classifier Architecture Search” (2018), we participated in the process by providing sophisticated building blocks and good initial conditions (discussed below). Moreover, we scaled up computation using Google's new TPUv2 chips. This combination of modern hardware, expert knowledge, and evolution worked together to produce state-of-the-art models on CIFAR-10 and ImageNet, two popular benchmarks for image classification.

A Simple ApproachThe following is an example of an experiment from our first paper. In the figure below, each dot is a neural network trained on the CIFAR-10 dataset, which is commonly used to train image classifiers. Initially, the population consists of one thousand identical simple seed models (no hidden layers). Starting from simple seed models is important — if we had started from a high-quality model with initial conditions containing expert knowledge, it would have been easier to get a high-quality model in the end. Once seeded with the simple models, the process advances in steps. At each step, a pair of neural networks is chosen at random. The network with higher accuracy is selected as a parent and is copied and mutated to generate a child that is then added to the population, while the other neural network dies out. All other networks remain unchanged during the step. With the application of many such steps in succession, the population evolves.

Progress of an evolution experiment. Each dot represents an individual in the population. The four diagrams show examples of discovered architectures. These correspond to the best individual (rightmost; selected by validation accuracy) and three of its ancestors.

The mutations in our first paper are purposefully simple: remove a convolution at random, add a skip connection between arbitrary layers, or change the learning rate, to name a few. This way, the results show the potential of the evolutionary algorithm, as opposed to the quality of the search space. For example, if we had used a single mutation that transforms one of the seed networks into an Inception-ResNet classifier in one step, we would be incorrectly concluding that the algorithm found a good answer. Yet, in that case, all we would have done is hard-coded the final answer into a complex mutation, rigging the outcome. If instead we stick with simple mutations, this cannot happen and evolution is truly doing the job. In the experiment in the figure, simple mutations and the selection process cause the networks to improve over time and reach high test accuracies, even though the test set had never been seen during the process. In this paper, the networks can also inherit their parent's weights. Thus, in addition to evolving the architecture, the population trains its networks while exploring the search space of initial conditions and learning-rate schedules. As a result, the process yields fully trained models with optimized hyperparameters. No expert input is needed after the experiment starts.

In all the above, even though we were minimizing the researcher's participation by having simple initial architectures and intuitive mutations, a good amount of expert knowledge went into the building blocks those architectures were made of. These included important inventions such as convolutions, ReLUs and batch-normalization layers. We were evolving an architecture made up of these components. The term "architecture" is not accidental: this is analogous to constructing a house with high-quality bricks.

Combining Evolution and Hand DesignAfter our first paper, we wanted to reduce the search space to something more manageable by giving the algorithm fewer choices to explore. Using our architectural analogy, we removed all the possible ways of making large-scale errors, such as putting the wall above the roof, from the search space. Similarly with neural network architecture searches, by fixing the large-scale structure of the network, we can help the algorithm out. So how to do this? The inception-like modules introduced in Zoph et al. (2017) for the purpose of architecture search proved very powerful. Their idea is to have a deep stack of repeated modules called cells. The stack is fixed but the architecture of the individual modules can change.

The building blocks introduced in Zoph et al. (2017). The diagram on the left is the outer structure of the full neural network, which parses the input data from bottom to top through a stack of repeated cells. The diagram on the right is the inside structure of a cell. The goal is to find a cell that yields an accurate network.

In our second paper, “Regularized Evolution for Image Classifier Architecture Search” (2018), we presented the results of applying evolutionary algorithms to the search space described above. The mutations modify the cell by randomly reconnecting the inputs (the arrows on the right diagram in the figure) or randomly replacing the operations (for example, they can replace the "max 3x3" in the figure, a max-pool operation, with an arbitrary alternative). These mutations are still relatively simple, but the initial conditions are not: the population is now initialized with models that must conform to the outer stack of cells, which was designed by an expert. Even though the cells in these seed models are random, we are no longer starting from simple models, which makes it easier to get to high-quality models in the end. If the evolutionary algorithm is contributing meaningfully, the final networks should be significantly better than the networks we already know can be constructed within this search space. Our paper shows that evolution can indeed find state-of-the-art models that either match or outperform hand-designs.

A Controlled ComparisonEven though the mutation/selection evolutionary process is not complicated, maybe an even more straightforward approach (like random search) could have done the same. Other alternatives, though not simpler, also exist in the literature (like reinforcement learning). Because of this, the main purpose of our second paper was to provide a controlled comparison between techniques.

Comparison between evolution, reinforcement learning, and random search for the purposes of architecture search. These experiments were done on the CIFAR-10 dataset, under the same conditions as Zoph et al. (2017), where the search space was originally used with reinforcement learning.

The figure above compares evolution, reinforcement learning, and random search. On the left, each curve represents the progress of an experiment, showing that evolution is faster than reinforcement learning in the earlier stages of the search. This is significant because with less compute power available, the experiments may have to stop early. Moreover evolution is quite robust to changes in the dataset or search space. Overall, the goal of this controlled comparison is to provide the research community with the results of a computationally expensive experiment. In doing so, it is our hope to facilitate architecture searches for everyone by providing a case study of the relationship between the different search algorithms. Note, for example, that the figure above shows that the final models obtained with evolution can reach very high accuracy while using fewer floating-point operations.

One important feature of the evolutionary algorithm we used in our second paper is a form of regularization: instead of letting the worst neural networks die, we remove the oldest ones — regardless of how good they are. This improves robustness to changes in the task being optimized and tends to produce more accurate networks in the end. One reason for this may be that since we didn't allow weight inheritance, all networks must train from scratch. Therefore, this form of regularization selects for networks that remain good when they are re-trained. In other words, because a model can be more accurate just by chance — noise in the training process means even identical architectures may get different accuracy values — only architectures that remain accurate through the generations will survive in the long run, leading to the selection of networks that retrain well. More details of this conjecture can be found in the paper.

The state-of-the-art models we evolved are nicknamed AmoebaNets, and are one of the latest results from our AutoML efforts. All these experiments took a lot of computation — we used hundreds of GPUs/TPUs for days. Much like a single modern computer can outperform thousands of decades-old machines, we hope that in the future these experiments will become household. Here we aimed to provide a glimpse into that future.

Posted by Jonathan Shen and Ruoming Pang, Software Engineers, on behalf of the Google Brain and Machine Perception Teams

Generating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved. Incorporating ideas from past work such as Tacotron and WaveNet, we added more improvements to end up with our new system, Tacotron 2. Our approach does not use complex linguistic and acoustic features as input. Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.

A full description of our new system can be found in our paper “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.

A detailed look at Tacotron 2's model architecture. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram. For technical details, please refer to the paper.

You can listen to some of the Tacotron 2 audio samples that demonstrate the results of our state-of-the-art TTS system. In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings.

While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises. Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own.

This week, Long Beach, California hosts the 31st annual Conference on Neural Information Processing Systems (NIPS 2017), a machine learning and computational neuroscience conference that includes invited talks, demonstrations and presentations of some of the latest in machine learning research. Google will have a strong presence at NIPS 2017, with over 450 Googlers attending to contribute to, and learn from, the broader academic research community via technical talks and posters, workshops, competitions and tutorials.

Google is at the forefront of machine learning, actively exploring virtually all aspects of the field from classical algorithms to deep learning and more. Focusing on both theory and application, much of our work on language understanding, speech, translation, visual processing, and prediction relies on state-of-the-art techniques that push the boundaries of what is possible. In all of those tasks and many others, we develop learning approaches to understand and generalize, providing us with new ways of looking at old problems and helping transform how we work and live.

If you are attending NIPS 2017, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people, and to see demonstrations of some of the exciting research we pursue. You can also learn more about our work being presented in the list below (Googlers highlighted in blue).