Talk To The Machine Progress In Speech Recognition Software By David Pogue

In the past couple of years speech-recognition software has quietly grown tendrils into every corner of our lives. It’s at the other end of customer-support hotlines and airline reservation systems. It’s built into Microsoft Windows. It’s an alternative text-input method for touch-screen phones such as the iPhone and the Android. But let’s face it: most people who use this software wish they didn’t have to. That’s because speech recognition is usually plan B: a least terrible alternative to typing or actual human conversation. Corporations use it for their phone systems because it’s cheaper than hiring real people. Many people who dictate into their computers do it because they must, perhaps because of a disability. And speech recognition is cropping up on touch-screen phones because typing on an on-screen keyboard is slow and fussy. So what would it take to make speech recognition more than a work-around? How close are we to the Star Trek ideal of conversational computers that never get it wrong? Well, we’re getting there. It turns out that after a decade of buyouts, mergers and embezzlement scandals, there is only one major speech-recognition company left: Nuance Communications. It sells the only commercial dictation software for Windows, for Macintosh and for iPhone. Its technology drives the voice-command systems in cars from Audi, BMW, Ford and Mercedes and cell phones from Motorola, Nokia, Samsung, Verizon and T-Mobile. It powers voice-activated toys, GPS units and cash machines, and it answers the phone at AT&T, Bank of America, CVS and many others. Every year Nuance releases another new version of its consumer dictation programs, such as Dragon NaturallySpeaking. Usually it doesn’t add many new features. Instead it devotes most of its resources to a single goal: improving accuracy. In the beginning, you had to train these programs by reading a 45-minute script into your microphone so that the program could learn your voice. As the technology improved over the years, that training session fell to 20 minutes, to 10, to five—and now you don’t have to train the software at all. You just start dictating, and you get (by my testing) 99.9 percent accuracy. That’s still one word wrong every couple of pages, but it’s impressive. Speech engineers use all kinds of tricks to boost accuracy. The earliest dictation programs required you to pause after each word; the software had no clue how to distinguish “their” from “there” and “they’re.” But in time, ever more powerful PC processors made continuous-speech analysis possible. Today you are encouraged to speak in longer phrases, so the software has more context to analyze for accuracy. Another trick: Last year Nuance offered a free dictation app for the iPhone, called Dragon Dictation. What you say is transmitted to the company’s servers, where it is analyzed, converted to text and zapped back to your screen within seconds. What nobody knew, though, is that the company stored all those millions of speech samples, in effect creating an immense storehouse of different voices, ages, inflections and accents against which to test different recognition algorithms. So, yes, the technology is improving. But readers often ask me: “If dictation software is so good, can I use it to transcribe phone calls and interviews?” The answer is still no. The software isn’t much good unless you are speaking into a microphone, without background noise, preferably without an accent. You still have to speak all punctuation (“comma”), like this (“period”). And goodness knows, we humans have enough trouble understanding each other; it’s a bit much to ask for a computer to get it all right. No wonder today’s dictation apps still make mistakes such as “mode import” for “modem port,” “move eclipse” for “movie clips,” and “oak wrap” for—well, you get it. So, no, the keyboard isn’t going away in our lifetime. Conversational-style Star Trek computing is still decades away. Sure, 99.9 percent accuracy is darned good—but until it reaches 100, speech-recognition technology is still plan B.

That’s because speech recognition is usually plan B: a least terrible alternative to typing or actual human conversation. Corporations use it for their phone systems because it’s cheaper than hiring real people. Many people who dictate into their computers do it because they must, perhaps because of a disability. And speech recognition is cropping up on touch-screen phones because typing on an on-screen keyboard is slow and fussy.

So what would it take to make speech recognition more than a work-around? How close are we to the Star Trek ideal of conversational computers that never get it wrong?

Well, we’re getting there. It turns out that after a decade of buyouts, mergers and embezzlement scandals, there is only one major speech-recognition company left: Nuance Communications. It sells the only commercial dictation software for Windows, for Macintosh and for iPhone. Its technology drives the voice-command systems in cars from Audi, BMW, Ford and Mercedes and cell phones from Motorola, Nokia, Samsung, Verizon and T-Mobile. It powers voice-activated toys, GPS units and cash machines, and it answers the phone at AT&T, Bank of America, CVS and many others.

Every year Nuance releases another new version of its consumer dictation programs, such as Dragon NaturallySpeaking. Usually it doesn’t add many new features. Instead it devotes most of its resources to a single goal: improving accuracy.

In the beginning, you had to train these programs by reading a 45-minute script into your microphone so that the program could learn your voice. As the technology improved over the years, that training session fell to 20 minutes, to 10, to five—and now you don’t have to train the software at all. You just start dictating, and you get (by my testing) 99.9 percent accuracy. That’s still one word wrong every couple of pages, but it’s impressive.

Speech engineers use all kinds of tricks to boost accuracy. The earliest dictation programs required you to pause after each word; the software had no clue how to distinguish “their” from “there” and “they’re.” But in time, ever more powerful PC processors made continuous-speech analysis possible. Today you are encouraged to speak in longer phrases, so the software has more context to analyze for accuracy.

Another trick: Last year Nuance offered a free dictation app for the iPhone, called Dragon Dictation. What you say is transmitted to the company’s servers, where it is analyzed, converted to text and zapped back to your screen within seconds.

What nobody knew, though, is that the company stored all those millions of speech samples, in effect creating an immense storehouse of different voices, ages, inflections and accents against which to test different recognition algorithms.

So, yes, the technology is improving. But readers often ask me: “If dictation software is so good, can I use it to transcribe phone calls and interviews?”

The answer is still no. The software isn’t much good unless you are speaking into a microphone, without background noise, preferably without an accent. You still have to speak all punctuation (“comma”), like this (“period”). And goodness knows, we humans have enough trouble understanding each other; it’s a bit much to ask for a computer to get it all right. No wonder today’s dictation apps still make mistakes such as “mode import” for “modem port,” “move eclipse” for “movie clips,” and “oak wrap” for—well, you get it.

So, no, the keyboard isn’t going away in our lifetime. Conversational-style Star Trek computing is still decades away. Sure, 99.9 percent accuracy is darned good—but until it reaches 100, speech-recognition technology is still plan B.