[An edited transcript of the interview follows.]
What makes you say we are on the cusp of voice becoming the primary way we communicate with our devices?
Think about 1976, when [Apple co-founder] Steve Wozniak built the first PC with a monitor and a keyboard. Fast-forward to today, and people are still using a monitor and keyboard to interact with most of their devices. Even with smartphones you either type on or touch a screen to get output. This is a problem because it actually immobilizes us. Even though you might be walking around, your attention is still focused on a screen. That’s changing with voice—for three reasons: increases in computing power in smaller devices; the ability to collect and analyze large amounts of data; and advances in machine learning, in particular deep learning. Those types of AI algorithms are making speech recognition and natural-language understanding more accurate.
What have been the biggest challenges to making voice interfaces that work well with consumer tech?
There are component-level challenges and user-experience challenges with regard to speech recognition. But if the conditions are relatively quiet, it’s very accurate. If there’s background noise or multiple people are speaking, however, that’s a challenge that we still need to deal with. You want to be able track different voices when multiple people are speaking at the same time. With regard to helping devices understand natural language, context is the critical challenge. If a digital personal assistant is limited to just a few domains or functions—it’s dedicated to playing music, for example—it’s easy to understand the user’s intent. Add to that the responsibility of sifting through data about movies, videos and audiobooks, and all of a sudden the command, “Play X” becomes ambiguous. It could refer to content in any of those categories.
Why is context so important when interacting with smart devices?
If you and I are chatting right now, I might carry over information from the last time we spoke. We don’t need to repeat everything that we discussed previously in order to have a seamless conversation. That’s natural for people but not the case when speaking with machines, where you currently have to use precise wording to be understood. You would expect that if a machine is smart enough, it would be able to carry over information from an earlier conversation. If I ask, “Alexa, how is the weather in Seattle?” and then I ask, “How about this weekend?”, I expect to hear about the weather in Seattle this weekend without explicitly saying that in the second question. If I ask, “Alexa, what is my schedule for today?” the system responds using information stored in its calendar. If I ask, “How about this weekend?” I expect calendar information, not weather information, for this weekend. There is no right answer to that second question without context—there could be any number of answers. That’s referred to as “session context,” and it allows a machine to answer the question correctly based on the current conversation.
How does a machine learn context?
How can you improve Alexa’s ability to pick out speech and understand words even when there is significant background noise?
That’s an open problem, although we are making progress. Having worked on developing voice technology in the past, I can say there are a few different approaches. One is focusing on cleaning, or removing, background noise and then performing speech recognition on the data that’s left. When you do that, though, a side effect is that you may remove some of the data related to the speech itself. Another technique is collecting as much of the sound in a particular environment as possible and having the system map, or identify, different sounds—whether they are background noise or speech. The challenge is there are so many different noises that it’s difficult to be able to identify where each of them is coming from, especially when the TV is on.
I can only speak to the machine-learning part of Alexa. Machine learning relies on data collected from Alexa users. We don’t use all of that data—we annotate certain types in order to teach Alexa to recognize different acoustic cues, tones (both male and female) and accents. Our customers are diverse, and we want Alexa to be able to recognize different users. We can’t build a technology that will work only for one type of voice.