VAX or Voice Augmented eXperience is the concept of augmenting or enhancing an existing experience (usually GUI and/or touch-based app) with voice.
Human-Computing Interfaces have evolved a long way since the early days of the punch cards.
The voice assistant journey can be broken down into two phases:
Phase 1 was all about getting consumers introduced to the idea of using voice to perform tasks.
Phase 2 is about voice becoming a pervasive interaction mode that has more capabilities and is used more frequently on more devices, apps and in different contexts.
If you are reading this in 2019, chances are you have experienced voice recognition in some form or the other already or have read articles touting its imminent arrival. Here are some of the advantages that make Voice one of the best interfaces out there.
You think of something and you can get it done almost immediately. It’s magical when you can just speak out what you want without worrying about how to convey what you want to convey your intent using a series of individual clicks. “Play raakamma kaiyatattu” (a popular Tamil song) is so much faster than “find the search icon, click, type r.. a.. a.. k.. a.. m.. m.. a” (hopefully auto-complete will kick-in, else) “.. k.. a.. (are we there yet) i.. y.. a.. t.. (are we there yet).. t.. u..”.
While most of us (the folks reading this blog) might feel comfortable with the modern UI paradigm — menus, buttons, text boxes, etc — for many others (folks who are mobile first, not so tech savvy, elderly, etc), these are actually strange concepts. It’s actually quite intimidating. A button does not naturally mean “click on it”. A menu icon does not naturally mean “discover or navigate to other capabilities”
Thought experiment — how many of our parents can use the apps that we build?
Voice on the other hand does not intimidate. It does not need any additional training. You don’t need to teach your parents how to use Voice. They are already a pro just as you are.
The language of the user is a key constraint that Voice handles better. Even if the user is familiar with the UI elements and is able to navigate the app, they tend to be intimidated by the language on the screen.
Here is another thought experiment — Change the language setting of your phone to a language you don’t know. Reboot your phone. Try and change the setting back to English. Do you think you will be able to do this easily? If that phone had a Voice interface which would have understood English (“Change language to English”), would you have used that or still tried to use touch to carry out this action?
We (well most of us) are all born with the ability to communicate using our voice. Imagine doing something as simple as getting your bank statement in your banking app. If you are like me, I assume this would be your journey — “hmm.. which menu item should I choose?.. okay let me try that.. nice.. a date selector.. click.. click.. shit.. how I do change the year?.. click.. click.. click.. click.. one done.. submit.. damn.. forgot to set the end date.. another yummy date selector.. today’s date (duh.. shouldn’t u have already done that).. submit (finally).. oh.. Did I really spend another 2000Rs on Swiggy last week? Wonder if I can claim stake to be an investor?”.
Contrast this with being able to do the same via “Show my statement from Jan 1st”.
Unless you are trying to exercise your fingers, presumably the later feels like “Hey, why was it not like this all the time?”
For most of us, Voice is associated with the notion of using generic and intelligent assistants that users can speak to (or optionally type). Brands or services that users want are placed behind these assistants. The assistants understand the intent of the user and either explicitly or implicitly connect with a service provider and provide an output. Typically the assistants prefer to keep both the input (what the user is asking — “Set an alarm for 6 am”) and the response (what the service provider generates — The actual setting of the alarm and the response text “Sure. Alarm set for 6 am”) inside its own surface. In some cases, the response could also be done by deep linking into the service providers app and opening it up.
So if there is a rise of the generic assistants with their own interface (voice-first or in some cases voice-only) and surface, does that mean the end of special purpose, brand-specific surfaces and other interfaces like touch and GUI?
I think not. Let me try and articulate my thoughts on why.
Purpose built, custom and direct-to-brand experiences, delivered using mobile apps, web apps/sites, PWA, etc are also important and are in fact growing in usage.
Another thought experiment to match the data shown above — imagine your usage through the day on your phone. Do you spend most of the time with an assistant or directly with one or more apps?
Here are some of the key advantages that mobile apps (and their likes) provide (as compared to generic voice assistants)
App discovery was a challenge during the early days of apps. But the idea of app stores caught on quickly and standardized the idea. Discovering and adding capabilities (skills or actions) to the assistants are currently a challenge. There are 30K plus Voice apps but most people are oblivious to their existence. Maybe over time as brands start advertising their skills/actions and the assistants come up with an easy to discover/add them, it would be lesser of a problem, but today they are a major challenge.
When using a mobile app, the context and the capabilities of the app is typically well understood by the user. The visual structure (buttons, menus, etc) also gives enough clues as to what is possible. Contrast that with the experience of talking to a specific Alexa skill and then struggling to know what its capable of.
There are lots of use-cases where visuals beat voice straight out. One such example is anything that needs a list. Imagine a user trying to book a flight or ordering a pizza. He or she needs to see the available choices in order to actually perform the transaction. Imagine doing that with Voice?
You: “I want to order a pizza”
Assistant: “Sure. What type of a Pizza would you like?”
You: “What are my vegetarian choices?”
Assistant: “You can order a garden veggie, a farm fresh, a Mediterranean… “
You have probably zoned out and don’t remember the first one when the 3rd one is being spoken.
This is where UI lists make perfect sense. Imagine the response to the request above was instead something like this -
When building apps, the pieces of the puzzle needed to make it happen is all well understood. How does someone login? How do you personalize? How do you store the details of the user? All the key ingredients needed is well understood. But when building voice only experiences under assistants, many of them need to rediscovered or rebuilt. For eg how do you login to a voice only assistants, maintaining context across sessions, etc.
Last but not least is the notion of privacy. Both for the customer of the app/brand and the brand itself. When a customer is interacting with an app, they are placing the trust on the app. Both their input (their intent) and the response from the brand is based on a trusted relationship between the two. But if there is an assistant in the middle, who processes both the input and the output and also knows who you are (because you are logged into the assistant), it triggers privacy concerns. Both for the user — because they are trusting their data with someone who understands a lot of details about them (across brands), and also for the brand — because they are sharing details about their customer to someone else.
Voice is great and overcomes the traditional problems in Apps. And Apps are great and don’t have the newer problems of Voice.
What if you could have them both in the same medium? What if we could augment the strength of Apps using the power of Voice instead of trying to replace apps with voice?
VAX or Voice Augmented eXperience is what you get when you add a multi-modal and multi-lingual voice interface to mobile apps, thereby allowing the user to interact with them via voice or touch.
Imagine being able to do things like the following —
The possibilities are quite vast and only limited to how a brand wants to use the idea of having access to a smart microphone (which *understands* the intent as opposed to just understanding voice signals).
The article was written to explain the rationale behind the idea of adding voice experiences inside apps. The What and Why. With regards to the How, there are potentially multiple ways to do this. My company built an entire platform to simplify this process. The platform takes care of all the elements needed to add a smart microphone to your app and lets you focus on your core business logic related actions.