When it was revealed at the Seattle World’s Fair in 1962, the IBM Shoebox was the most advanced speech recognition machine, understanding 16 words spoken in English. Throughout the 1970s and ’80s, the development of speech recognition accelerated. As computing power grew, so too did the number of words recognized by these systems. Today, speech recognition software is available for a broad range of languages and can recognize a virtually limitless number of words.
From the beginning, IBM took a statistical approach to speech recognition technology, grouping sound into thousands of different units based on their characteristic combinations of frequencies.
Hidden Markov Models, statistical language modeling, the use of Viterbi and Stack Decoders, all now completely ubiquitous, were all pioneered by IBM Research in the 1970s by Fred Jelinek and his team.
The 1980s saw the development of real-time speech recognition systems embodying statistical methodologies. The first real-time large vocabulary dictation system was demonstrated in 1984 by the IBM speech team. At the time, an IBM mini-computer and three array processors filling a whole room was needed. Within only a couple of years, the technology had all been ported by the team to special purpose hardware that ran on an IBM-PC AT.
In 1992, IBM released its first dictation system, the IBM Speech Server Series. The next year brought the IBM Personal Dictation System, the first dictation system for the personal computer. It was later renamed IBM VoiceType Dictation, and was capable of recognizing 32,000 words at a rate of approximately 70 to 100 words per minute, with 97 percent accuracy. Both systems were used mostly in the medical and legal fields, and in business and government.
In 1996, VoiceType Simply Speaking was released. This voice recognition software worked with
In 1997, IBM introduced IBM ViaVoice, the first ever continuous dictation product that was offered in multiple languages. It was no surprise that the technology could work for languages such as German, Spanish, French, and Italian, but the team continued to demonstrate the power of the statistical methodologies by also creating high successful dictation systems for Mandarin and Japanese in conjunction with colleagues from the China Research Lab and Tokyo Research Lab. The Mandarin system was so impressive it was demonstrated to the President of China, Jiang Zemin, when it was initially launched.
Today, speech recognition technology appears in a very broad variety of applications that go beyond the desktop. Speech analytics systems, automated speech self-service, mobile devices, automobile navigation systems, car infotainment with climate control systems and media players, hands-free phones, personal navigation devices and other smart devices are all examples of the way speech recognition has penetrated our lives, all originating from IBM’s early vision in this area.
A new way of listening
“Earlier attempts to make machines recognize spoken words have run into trouble because they tried to copy the human ear, which analyzes the complicated mixture of sound frequencies in human speech. IBM Engineer William C. Dersch, inventor of Shoebox, thinks that this is like designing an airplane by copying a bird’s feathers. His machine does not depend on sound frequencies; it recognizes words by listening for their ‘asymmetry,’ an esoteric quality of speech that human ears cannot distinguish but that Shoebox finds as clear as the beat of a bass drum.”
“Science: Shoebox Is Listening,” TIME MagazineNovember 24, 1961
“We thought it was wrong to ask a machine to emulate people. After all, if a machine has to move, it does it with wheels - not by walking. If a machine has to fly, it does so as an airplane does - not by flapping its wings. Rather than exhaustively studying how people listen to and understand speech, we wanted to find the natural way for the machine to do it.”
“Once the computer distinguishes the sounds, it has to figure out how they are combined into meaningful words. Again statistics help in the IBM approach, since the computer knows the relative frequency of all groups of three words in the language being recognized. For example, it knows that in English ‘going to go’ is more frequent than ‘going, too, go’ and far more frequent than ‘going two go.’”
“Talking to Machines,” Think Research, IBM.com
“From my standpoint of having responsibility for generating over 400,000 radiology reports each year, I can only call the IBM MedSpeak/Radiology system a spectacular breakthrough.”
“Talking to Machines,” Think Research, IBM.com