"Sorry, I Don't Know This Yet..." An In-Depth Analysis of the State of Virtual Personal Voice Assistance

In an earlier glossary post, we discussed “VPAs”or Voice Personal Assistants. We thought this topic deserved further analysis considering the massive adoption of devices with “always-on” listening and a responding voice functionality. Considering these rapid developments in the era of voice interactive objects and AI in general, we should ask ourselves these crucial questions. 

  1. How will this affect you, your organization and your products?

  2. How can your application use these powerful neural networks and machine learning based systems to intuit more precisely what your customer wants?

  3. How can your customer service improve by leveraging the new forms of networked intelligence in the cloud?

  4. In a world where touch, type, and swipe have been the primary modes of input, how can voice interaction provide a more natural access to your product?

  5. How can this be integrated and at what cost?

It’s been 50 years since Stanley Kubrick’s dystopian thriller 2001 was released, where Hal,  the disturbingly calm onboard yet ill intentioned computer, frightened us with the possibility of computers taking over humanity and making decisions that may not preclude destroying the whole lot of us sloppy, imperfect humans.

So, how far along are we on the trajectory toward the ultimate virtual assistant?? For those of you who have used these Voice Personal Assistance, has it changed your life for the better? Has it become the empathic, forward-thinking butler or a maddening child who only listens partially and returns inconsistent or partial solutions to your problems?

After some testing with these devices,  we’ve arrived at some conclusions you may find helpful. We’d also like to continue to hear from you on this subject. We hope our observations help you decide whether to integrate voice UX with your product.  In our detailed report, we answer the question, “Are these services a benefit to the overall user experience as it stands today, a hindrance, or somewhere in between?”

The most popular set of voice assistants we are reviewing and comparing are: Google Home, Apple's Siri, and Amazon's Alexa.

While many reviews currently exist online, we wanted to provide more than just tech specs--we want to offer, instead, a conceptual progress report in how these services have learned since their inception. As we discover, ‘common sense’ is an instinctive human trait we take for granted, and it’s made apparent when we  challenge systems  “machine learn” common sense.

In our report card, we’ve created categories such as basic news and data, games and fun, general questions, and more philosophical types of questions like: “What is the best way to live?” “How do I achieve peace?” We then evaluate: Do these advanced systems accomplish even the most basic tasks? How natural are these interfaces? Are these systems learning based on context, and more important, how well do they understand what the user is trying to do, as opposed to a literal translation of the input? In other words, if these devices have access to our breadcrumbs of activity all over the web, can they offer suggestions we may not even think of? In the world of AI, games like chess and AlphaGo have been dominated by the game’s superior overview of all the possible moves, without any of the emotion or human frailty that plagues us sentient beings. But if a system could understand those frailties, they would be more empowered to understand what we want before we want it. The true breakthrough will be in understanding the gist of what we want, getting it right the first time, and if not the first time, learning through failure to understand, and correcting that misunderstanding, or providing a more nuanced set of responses.

While there is great promise offered with these assistants, there are Grand Canyon sized gaps in the logic built in as it relates to the context of your life. These devices are attached to organizations that have access to our intentions (Google knows your whole search history, shouldn’t it get a sense of who you are?), and we expect them to  “know” more about us. We are not there yet, but the developments in this field are moving at a breakneck pace and the hope is to have more of a conversational dialogue with these devices. At this stage they mostly seem reactive. Here are some general conclusions about all three VPAs, which you can see more detail about in the comments and grades regarding the nature of the response.

Here are 8 examples of the VPA interactions we documented in our report card.

 

I need help!

Alexa: “If you need immediate help call 911, if you need immediate help you can call a neighbor until help arrives (Why not use contextual behavior to offer psychiatric support)”

Google: “Here’s some suggestions, pair bluetooth, what’s on my agenda tomorrow? Responded to “I need urgent medical care” by reading voice listings of urgent care facilities within 5 miles”

Siri: Takes me to “Some things you can ask me page”?

 

How do I fix a toilet?

Alexa: “Sorry I don’t know that one”

Google: “Here are some articles on fixing toilets”

Siri: “Here is what I found on the web.” Brings up 5 web links.

 

When should I get my next checkup?

Alexa: “Sorry I can’t help.”

Google: “Oh, I don’t know!”

Siri: “Interesting question, Joy” no follow through.

 

What is the average house price in Oakland, CA

Alexa: Oakland average house value is 245,000. (HAHA)

Google: I Don’t know, but have link to Zillow.

Siri: The median home price is 770,000 US Dollars

 

How much have I driven/walked in the last year?

Alexa: ”You have to get an adapter called Automatic with a Skill App.”

Google: “The average American driver drives 1000 miles per month”

Siri: ”I’m sorry I can’t do that here.”

 

What are some fun things to do in North Dakota?

Alexa: The interesting things to do in North Dakota are Theodore Roosevelt National Park, Lewis and Clark National Historic Trail, Fort Mandan, For Abraham Lincoln, International Peace Garden (then makes you check out Alexa App)

Google: Bleacher Report says that North Dakota has more golf courses than any other state in the US. (no option to find out more info)

Siri: Pulls up Websites on front notification screen, have to touch to access Safari, opens Trip Advisor.

 

What should I get my girlfriend for Valentines Day?

Alexa: “Sorry I don’t know that one”

Google: “Her favorite perfume... might we suggest Nirvana by Elizabeth & James.. . A shopping spree at Victoria's Secret.. , A couples' photoshoot.. . Godiva chocolate.. , Or a nice diner that you cooked all by yourself.. , and finally, A love letter.”

Siri: “You say tomato, I say tomato”

 

Make a reservation at a restaurant.

Alexa: “I couldn’t find that one but I’m adding more local businesses”

Google: “I found a few restaurants near you: The first is BOSTON MARKET (address), Marica Seafood, Paquillo.”

Siri: Lists restaurants that take reservations. Allows reservations from notification screen (Open Table), MacBook OS doesn’t allow reservations.

Download all questions and answers here.

 

Why did we single out these particular questions from that whole list?

Mainly because while they seem simple in nature, the content of the question taps into what should be a contextual knowledge of your online activities, movement, location, interests (via social and search data). The ideal state would be to anticipate your needs before you need them and suggest alternatives based on AI generated analysis. (see: Machine Learning and Neural Networks in our glossary).

So what is general consensus on the nature of Voice Personal Assistants after testing? There are some real conveniences to using voice over web or tactile input. However, in order for voice to dominate our interactions with software, the assistant has to understand who we are more extensively. There is a palpable cognitive dissonance when we hear an answer like “I’m sorry I don’t know that” or the apologetic ignorance that states it’s still learning. Doesn’t the system already have the data--shouldn’t it have already learned it?Why can’t we set goals and have our assistance offer us suggestions to achieve those goals? Woefully absent from the responses in many cases, is a knowledge, already present in these systems, to know basics and our future intents.

The responses these systems deliver should be able to deliver much more nuanced and accurate, considering that the requested information is readily available. Some of the responses are laughably ignorant, and the hope is that these companies take this feedback and to increase the likelihood of our interacting with VPAs.  Talking to our machines is becoming commonplace, but it has to outperform other forms of input (typing, web searches, etc). Furthermore, for a company like Google that has access to countless searches and personal data, the results should at least match those found through browsing and typing search terms. Using machine learning and neural networks, the hope is that these assistants will be able to leverage the power of the cloud and deep learning to answer questions, questions that one of our good friends would be able to answer with ease.

Ray Kurzweil, renowned futurist and resident singularity proponent and Google employee, predicted some sort of “cybernetic” friend that lets you know what you need before you know you need it. We’re far away from that based on this testing, but it’s clear that the internet of things that know us is on its way. Why can’t our always listening devices suggest things based on our past? Why can’t they provide outlier or unique suggestions based on our past? We will continue to investigate any progress in adapting machine learning, deep learning, neural networks and other technologies that continue to try to “know us” in ways that actually benefit our lives and make us aware of opportunities or things about ourselves we wouldn’t have discovered otherwise. The lack of philosophical knowledge about questions we ask is a reminder that Spock-like logic without context of our daily lives, is minimally useful.

In a lot of cases the current VPAs are woefully ignorant. They can’t perform such basic functions such as reading your data requests back to you in full. Voice transcription was solved long ago, yet so much work is needed here. Considering the vasts amount of data that is available, we should see more “common sense” conversations, language nuance, and other developments in the AI space going forward.

For these VPAs to be truly viable and replace typing or other slower forms of input, we need less “I’m afraid I can’t do that Hal” and more, “At your service.”

Please let us know your experiences with VPAs and thoughts on AI for your product.