We are more comfortable with the idea of talking to machines than ever before. That is due mainly to the growing popularity of smart speakers and virtual assistants, such as Google Home, Amazon Echo and Apple’s Siri. Indeed, as they have proven capable of carrying out tasks such as playing music and setting alarms, we have been happy to ask them to do so.
But those are simple, carefully enunciated commands, and we have little trust in their ability to recognise normal words, in normal situations, spoken in a normal voice. Speech recognition software has, however, come on leaps and bounds.
Last week, Google launched an app for Android phones called Live Transcribe, which does exactly what it suggests: it automatically transcribes everyday conversations on to the screen of your smartphone in real time.
It is intended as a tool for the hard of hearing, but it is a perfect demonstration of how computers are becoming skilled at recognising even the most unusual phrases and converting them correctly into written text.
The rise of automatic transcription
Last year, there were signs of growing confidence in speech recognition technology from companies and consumers alike. Automatic transcription and delivery of voicemails started to become a standard part of mobile phone contracts in some countries, relieving customers from the drudgery of having to listen back to them. Along similar lines, some networks introduced a type of call screening, in which transcriptions of incoming calls were delivered as text. Amazon, Microsoft and Google all launched services that allowed developers to build speech recognition into their apps, and new uses for this technology – such as live subtitles for video conversations – are beginning to flourish.
There are three reasons for the recent improvements, says Nils Lenke, head of innovation management at Nuance, whose speech recognition technology has been used by Apple. “Firstly, there was the discovery of how to use artificial neural networks to learn from existing data,” he says. “This came hand in hand with advances on the hardware side, where powerful GPUs [graphical processing units] are being used for neural network training. And there’s a lot more data available, because of the number of cloud-based speech recognition services being used.”
In other words, our new-found willingness to speak to machines, such as smartphones, tablets and smart speakers, has contributed directly to their improvement. It is a snowballing effect: the more we use them, the better they get, the more we want to use them. “Ten years ago, you had to justify why you were working on speech recognition, but today it’s quite normal,” Lenke says.
The need for accurate speech transcription
A number of start-ups have capitalised on the trend, including Verbit, a company in New York and Tel Aviv, which last month raised $23 million (Dh84.5m) in funding for its transcription services for legal and academic work. It is a priority for Silicon Valley, too, with Facebook reportedly joining the throng, with a speech recognition service called “Aloha” to be introduced into its messaging apps.
Virtual assistants aside, accurate speech transcription has many uses. One of the most significant is making audio and video material searchable by text phrases, which brings new accessibility to an enormous amount of knowledge. Microsoft has introduced such a feature for people using its OneDrive cloud service to store audio and video, while last year, a new app, Otter, was launched to store transcriptions of conversations and meetings with that precise aim: to make them easy to search. At Otter’s launch, its founder predicted various uses for the app, most notably in healthcare, where doctors would be unburdened from the wearisome task of writing up patient visits.
The tools being developed
Many new transcription services focus on helping deaf users. A Dutch firm, Speak-See, last summer smashed a crowdfunding target for its app that transcribes multi-person conversations. It works in a similar way to Google Transcribe, but individual voices are recognised and highlighted in different colours on the screen.
A smart hearing aid called Livio boasts a translation facility which, in tandem with a smartphone, displays translations in real time from words being spoken, and then uses text-to-speech to transmit words back to the earpiece. Google has made a similar feature available for headphones compatible with Google Assistant, and while this is frequently misreported as an instant translator earpiece – a sci-fi dream in which words are spoken, with a translation played instantly into the ear – there is little doubt that this will, one day, be possible.
But machine learning is making things easier. We no longer have to come up with the rules. We can show data to the systems and have them learn it themselves.
Nils Lenke, head of innovation management at Nuance
Last month, a start-up called Timekettle launched just such a product, the WT2; it can only translate one phrase at a time currently, making it usable in only very formal settings, but it shows what capabilities are on the horizon.
As our hopes for speech recognition grow, so too do the challenges, not least the quality of recordings. “Previously, we’d be working with audio from someone sitting at their desk, dictating into a microphone,” Lenke says. “Now we have audio from smart speakers in the corner of a room, or a mobile phone being used in crowded and noisy places.”
Another knotty problem is punctuation, the gaps between words that contribute so much to their meaning. “But machine learning is making things easier,” Lenke says. “We no longer have to come up with the rules. We can show data to the systems and have them learn it themselves.”
The subtleties of language will continue to pose problems for computers. “Take Arabic,” says Lenke. “You need to have different data sets for Egypt, for Lebanon, for Syria, the Emirates and so on. It’s really multiple languages under the same name, and so you need to invest in the collection of all that data for speech recognition to improve.”
It is clear that our inventive use of language, including jokes and puns, will continue to confuse machines; context is everything, and the difference between Turkey (the country) and turkey (the bird) may send transcription algorithms into a tailspin.
But the message from the industry is clear: if we have patience, the words we speak will gain greater power.