The recent announcement from Amazon that they would be reducing staff and budget for the Alexa department has deemed the voice assistant as “a colossal failure.” In its wake, there has been discussion that voice as an industry is stagnating (or even worse, on the decline).
I have to say, I disagree.
While it is true that that voice has hit its use-case ceiling, that doesn’t equal stagnation. It simply means that the current state of the technology has a few limitations that are important to understand if we want it to evolve.
Simply put, today’s technologies do not perform in a way that meets the human standard. To do so requires three capabilities:
- Superior natural language understanding (NLU): There are lots of good companies out there that have conquered this aspect. The technology capabilities are such that they can pick up on what you’re saying and know the usual ways people might mention what they want. For example, if you say, “I’d like a hamburger with onions,” it knows that you want the onions on the hamburger, not in a separate bag.
- Voice metadata extraction: Voice technology needs to be able to pick up whether a speaker is happy or frustrated, how far they are from the mic and their identities and accounts. It needs to recognize voice enough so that it knows when you or somebody else is talking.
- Overcome crosstalk and untethered noise: The ability to understand in the presence of cross-talk even when other people are talking and when there are noises (traffic, music, babble) not independently accessible to noise cancellation algorithms.
There are companies that achieve the first two. These solutions are typically built to work in sound environments that assume there is a single speaker with background noise mostly canceled. However, in a typical public setting with multiple sources of noise, that is a questionable assumption.
Achieving the “holy grail” of voice technology
It is important to also take a moment and explain what I mean by noise that can and can’t be canceled. Noise to which you have independent access (tethered noise) can be canceled. For example, cars equipped with voice control have independent electronic access (via a streaming service) to the content being played on car speakers.
This access ensures that the acoustic version of that content as captured on the microphones can be canceled using well-established algorithms. However, the system does not have independent electronic access to content spoken by car passengers. This is what I call untethered noise, and it can’t be canceled.
This is why the third capability — overcoming crosstalk and untethered noise — is the ceiling for current voice technology. Achieving this in tandem with the other two is the key to breaking through the ceiling.
Each on its own gives you important capabilities, but all three together — the holy grail of voice technology — give you functionality.
Talk of the town
With Alexa set to lose $10 billion this year, it’s natural that it will become a test case for what went wrong. Think about how people typically engage with their voice assistant:
“What time is it?”
“Set a timer for…”
“Remind me to…”
“Call mom—no CALL MOM.”
Voice assistants don’t meaningfully engage with you or provide much assistance that you couldn’t accomplish in a few minutes. They save you some time, sure, but they don’t accomplish meaningful, or even slightly complicated tasks.
Alexa was certainly a trailblazing pioneer in general voice assistance, but it had limitations when it came to specialized, futuristic commercial deployments. In these situations, it is critical for voice assistants or interfaces to have use-case specialized capabilities such as voice metadata extraction, human-like interaction with the user and cross-talk resistance in public places.
As Mark Pesce writes, “[Voice assistants] were never designed to serve user needs. The users of voice assistants aren’t its customers — they’re the product.”
There are a number of industries that can be transformed by high-quality interactions driven by voice. Take the restaurant and hospitality industries. We desire personalized experiences.
Yes, I do want to add fries to my order.
Yes, I do want a late check-in, thank you for reminding me that my flight gets in late on that day.
National fast-food chains like Mcdonald’s and Taco Bell are investing in conversational AI to streamline and personalize their drive-through ordering systems.
Once you have voice technology that meets the human standard, it can go into commercial and enterprise settings where voice technology is not just a luxury, but actually creates higher efficiencies and provides meaningful value.
Play it by ear
To enable intelligent control by voice in these scenarios, however, technology needs to overcome untethered noise and the challenges presented by cross-talk.
It not only needs to hear the voice of interest but have the ability to extract metadata in voice, such as certain biomarkers. If we can extract metadata, we can also start to open up voice technology’s ability to understand emotion, intent and mood.
Voice metadata will also allow for personalization. The kiosk will recognize who you are, pull up your rewards account and ask whether you want to put the charge on your card.
If you’re interacting with a restaurant kiosk to order food via voice, there will likely be another kiosk nearby with other people talking and ordering. It should not only recognize your voice as different, but it also needs to distinguish your voice from theirs and not confuse your orders.
This is what it means for voice technology to perform to the level of the human standard.
Hear me out
How do we ensure that voice breaks through this current ceiling?
I would argue that it is not a question of technological capabilities. We have the capabilities. Companies have developed incredible NLU. If you can box together the three most important capabilities for voice technology to meet the human standard, you’re 90% of the way there.
The final mile of voice technology demands a few things.
First, we need to demand that voice technology is tested in the real world. Too often, it’s tested in laboratory settings or with simulated noise. When you’re “in the wild,” you’re dealing with dynamic sound environments where different voices and sounds interrupt.
Voice technology that is not real-world tested will always fail when it is deployed in the real world. Furthermore, there should be standardized benchmarks that voice technology has to meet.
Second, voice technology needs to be deployed in specific environments where it can really be pushed to its limits and solve critical problems and create efficiencies. This will lead to wider adoption of voice technologies across the board.
We’re very nearly there. Alexa is in no way the signal that voice technology is on the decline. In fact, it was exactly what the industry needed to light a new path forward and fully realize all that voice technology has to offer.
Hamid Nawab, Ph.D. is cofounder and chief scientist at Yobe.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!
Read More From DataDecisionMakers