Amazon Alexa is the Founding Sponsor of Interspeech 2019!

The Alexa team looks forward to meeting you at Interspeech 2019! Come and visit our booth to learn more about our research and career opportunities. Below is more information about our technology and team.

Technologies We Focus On

The Alexa Science team made the magic of Alexa possible, but that was just the beginning. Our goal is to make voice interfaces ubiquitous and as natural as speaking to a human. We have a relentless focus on the customer experience and customer feedback. We use many real-world data sources including customer interactions and a variety of techniques like highly scalable deep learning. Learning at this massive scale requires new research and development. The team is responsible for cutting-edge research and development in virtually all fields of Human Language Technology: Automatic Speech Recognition (ASR), Artificial Intelligence (AI), Natural Language Understanding (NLU), Question Answering, Dialog Management, and Text-to-Speech (TTS). See an interview with VP Rohit Prasad here.

Alexa scientists and developers have large-scale impact on customer’s lives and on the industry-wide shift to voice user interfaces. Scientists and engineers in the Alexa team also invent new tools and APIs to accelerate development of voice services by empowering developers through the Alexa Skills Kit and the Alexa Voice Service. For example, developers can now create a new voice experience by simply providing a few sample sentences.

Alexa Research

Your discoveries in speech recognition, natural language understanding, deep learning, and other disciplines of machine learning can fuel new ideas and applications that have direct impact on peoples’ lives. We firmly believe that our team must engage deeply with the academic community and be part of the scientific discourse. There are many opportunities for presentations at internal machine learning conferences, which act as a springboard for publications at premier conferences. We also partner with universities through the Alexa Prize.

3 questions with Dilek Hakkani-Tür, the senior principal scientist leading Alexa's research on dialogue systems

You joined Alexa a year ago, and your team has three papers at this year’s Interspeech. What are they about?

Two of them are for multidomain task-oriented dialogues. The first paper is about unifying the dialogue-act schema of multiple task-oriented-dialogue data sets, with the goal of building a universal dialogue-act tagger. Dialogue acts have been studied for a very long time, and there are multiple data sets that are available, each annotated with a set of core dialogue acts that describe interactions at the level of intentions. Unfortunately, there are differences in the annotation schemas used for these datasets.

Our thinking was, to benefit from these data sets, we can try to map them into the same space. After making trivial mappings manually, we train a dialogue-act-tagging model on one corpus to label the other one, and vice versa, to automatically detect alignments between acts of different schemas.

Then we can train the universal dialogue-act tagger to annotate new data, such as human-human task-oriented conversations. Our ultimate goal is to be able to train a complete dialogue system from these human-human conversations.

The other paper is about multidomain dialogue state tracking. After each user turn in the conversation, state tracking aims to estimate what the user’s request is given all the turns till that time in a conversation. Let’s say you’re looking for restaurants, and the backend restaurant API has three slots, for cuisine, price, and area. You may say, “I want Indian food in the center”. And the system may say, “Oh, sorry, there’s no Indian restaurant in the center of town. How about north?”

When you agree to this utterance, the constraint for restaurant cuisine, Indian, is still valid, but the area constraint changes from center to north. Dialogue state tracking estimates which information stays the same and which is changed to a new value.

Many previous approaches do not scale to real applications, where one can observe rich natural-language utterances that can include previously unseen slot-value mentions and a large, possibly unlimited space of dialogue states. Our proposed approach is a hybrid one. Building on the previous work, one model forms slot-value vocabularies from the training examples, and at inference time, given the dialogue context and each slot, it estimates a probability for each value in the slot vocabulary.

The other model is an open-vocabulary model, which estimates a list of targets from the dialogue context — for example, considering named-entity results or all possible n-grams in the context — and makes a binary decision for each target to estimate if that should be the value of the slot. We have observed that the first approach performs well on closed-vocabulary slots, and the other one performs well on slots that have a large set of values. The final decision is made by combining these two methods.

That covers the two papers on task-oriented dialogues. How about the third paper?

The third is related to the Alexa Prize. The Alexa Prize is a fantastic framework to engage universities and enable their systems to interact with real users. I’m quite proud that Amazon came up with this idea.

While we have seen significant progress over the last two years of the Alexa Prize Challenge, we are still far away from the grand challenge of 20-minute engaging conversations. We have observed common problems, such as lack of conversational depth — oftentimes, bots run out of things to say and try to change the topic of conversation —or loops in the conversation flows, where the same user can go through a similar flow in the same dialogue. Furthermore, we observed that university teams have difficulty publishing their original work, due to challenges related to repeatable research. To enable research on deeper conversations and enable teams to publish their findings, we decided to release a topical-conversation data set and a set of benchmarks.

We started with a set of entities that Alexa Prize participants like to talk about, and starting with the most common ones, we looked into groups of entities — in this case, three entities. We curated knowledge and reading sets consisting of articles and fun facts about all three entities. Then we paired crowd workers and gave them reading content. Sometimes we gave the two the exact same content; sometimes we created some differences in the reading content — for example, split fun facts between the workers, to create engaging and lively interactions.

Then we asked them to engage in a conversation on the topic of the reading set. We also asked them to write down where they found their facts, so that they are grounding their responses on the loosely structured knowledge in the reading set.

You’ve been a regular attendee at Interspeech since it launched, in 2000. What appeals to you so much about the conference?

I think I only missed one, the time I was pregnant.

One of the reasons I like Interspeech is that it has a breadth of topics about speech, language, and conversation from all kinds of perspectives. I don’t think that richness has changed much over the years. I can find papers interesting to me — for example, on language, dialogue, and spoken conversations — as well as on broader speech-processing areas.

I usually try to submit something to Interspeech so that I can get feedback from others. One of our papers got in as a poster; the other two are orals. I really like the posters because I see them as a way of learning more from other people through interactions when presenting the posters. Yes, you know the problem and solution better than anyone, as you’re the one who worked on it and came up with the ideas. But other participants have all kinds of new ideas or questions, and sometimes some of them are things that you didn’t think of before. I don’t see this as much at some other conferences.

The other thing is you also kind of measure the pulse of the audience. In the oral sessions you can see how many people participated. But when you enter a poster session, you literally see where the interesting ideas are by looking at where all the people are. And as an author, you also get that form of feedback. It’s quite overwhelming — you get tired after these two hours of presentation. But it’s a great experience. I still like it.

Connect with us at Interspeech!

If you would like to meet with us in person at the conference, please contact interspeech2019@amazon.com.

Are you ready for your next opportunity? Check out our open positions on this page here, and learn more about the Alexa team here. We have global opportunities available, and speech and machine learning scientists will be available to meet at Interspeech.