From HAL in 2001: A Space Odyssey to C-3PO in Star Wars, humans have long imagined being able to talk to machines. Scientists have pursued voice recognition for almost as long as they have been building computers. Now after almost half a century, millions of people routinely interact by voice with computers in cars, smartphones and customer service call centers.
The effectiveness of speech recognition today comes out of decades of research by hundreds of scientists and engineers working on statistics, linguistics, semantics, predictive algorithms and audio processing. As far back as the 1950s, IBMers such as Nathaniel Rochester, designer of the
In 1962, William C. Dersch unveiled the Shoebox—a machine that could do simple math calculations via voice commands. Dersch, an engineer based at IBM’s laboratory in San Jose, California, demonstrated the Shoebox on television and at the 1962 World’s Fair in Seattle, Washington. The device recognized ten digits and six control worlds—including “plus,” “minus” and “total”—spoken to it through a microphone.
By 1971, IBM had developed its next experimental application of speech recognition. The Automatic Call Identification system enabled engineers anywhere in the US to talk to and receive “spoken” answers from a computer in Raleigh, NC. It was IBM’s first speech recognition system to operate over telephone lines and respond to a range of different voices and accents.
IBM then commissioned a task force to investigate the long term potential for speech recognition. They came back with a strong positive recommendation for a multidisciplinary approach that would leverage IBM’s computing powers to achieve breakthroughs.
Fred Jelinek, already a distinguished professor at Cornell in Information Theory, was brought in to lead the effort at the Thomas J. Watson Research Center in the 1970s and 1980s.
While others favored approaches based on human-derived expert knowledge, Jelinek believed that a data-driven approach based on statistical modeling was the way to push machine recognition of speech forward. As Jelinek told THINK magazine in 1987, “We thought it was wrong to ask a machine to emulate people. After all, if a machine has to move, it does it with wheels—not by walking. If a machine has to fly, it does so as an airplane does—not by flapping its wings. Rather than exhaustively studying how people listen to and understand speech, we wanted to find the natural way for the machine to do it.”
Jelinek and his team established the basic validity of the approach through a set of groundbreaking experiments in the 1970s, but that was not enough. The community criticized the techniques as completely impractical for actual implementation. Jelinek took this as a challenge and embarked upon an ambitious plan resulting in the development of a voice-activated typewriter in the 1980s. The experimental transcription system, called Tangora, used an IBM PC AT to recognize spoken words and type them on paper. Each speaker had to individually train the typewriter to recognize his or her voice, and pause briefly between each word. By the mid 1980s, Tangora boasted a 20,000-word vocabulary demonstrating the validity of the statistical approach.
However, there remained a long road to transform this speech recognition innovation into commercially feasible products. The journey would require leaps in processing power and reduced cost of computing.
The groundbreaking work by Jelinek was carried forward for by David Nahamoo, who succeeded Jelinek in leading the effort. Nahamoo and many other IBMers paved the way for products such as the first packaged speech recognition product, the IBM Speech Server Series (1992), and the first large vocabulary continuous speech recognition product, the IBM MedSpeak product (1996) which would become more widely available as IBM
By 2003, IBM licensed the exclusive marketing of ViaVoice to Nuance Communications, maker of Dragon Naturally Speaking, and IBM exited the consumer play for speech recognition. By the late 1990s, IBM had decided to focus on telephony and embedded offerings, such as IBM
Finally, the pioneering efforts of the last decades to help computers understand human language is reflected in the natural language processing capabilities of the Watson machine that competed on Jeopardy! against human champions in 2011. Watson “read” the written clues rather than “heard” them spoken, but drew on many of the same advancements in statistics and linguistics to make sense of the questions. In addition, Watson also spoke the answers, using speech synthesis technology developed by the IBM speech team that heavily leveraged statistical methodologies.
In 1993, Fred Jelinek became a computer engineering professor at Johns Hopkins University and taught in the school’s Center for Language and Speech Processing before passing away in September 2010 at age 77. “He was not a pioneer of speech recognition,” noted Steve Young, chair of the IEEE Speech and Language Processing Technical Committee. “He was the pioneer of speech recognition.”