This article takes you through a project by Connected Labs — Connected’s dedicated R&D function — that explores the applications of computer vision in developing more intuitive and frictionless, conversational interfaces.
A smart speaker is just a vessel — a conduit for us to engage with the disembodied voice of our always-on, and at-your-service, virtual, personal assistants. The on-call nature of these wake-word invoked assistants has unlocked a paradigm shift in ambient computing. It’s only been 4 years since the launch of the Alexa Echo, the first smart speaker, and in those years we’ve witnessed the fastest adoption of any technology ever. Today, over 50% of American households have at least one smart speaker, and often times several more. You’ll often see a household start with one speaker in the kitchen. Soon after, they find themselves shouting to their kitchen from their bedroom at night. “OK Google, Turn out the lights”. This leads to the purchase of additional speakers, and soon the entire living space is within range of various smart speakers and displays. In our homes, they act as conduits, allowing us to engage with our own, personalized, and omnipresent assistant. A personal helper that knows you; your calendars, your interests, your network, your habits, and your home.
If we look backward and examine how computer interfaces have evolved we can see that each breakthrough has been a shift towards a more intuitive and accessible computing experience. From text-based operating systems to object-oriented and mouse controlled ‘windows’ followed by the multi touch-based input of the modern smartphone to the advent of conversational interfaces and smart speakers; each step has been moving us towards more frictionless and intuitive human-computer interactions.
Conversational interfaces are intuitive to us because the art of conversation is something we’re taught from the very start — the sound of a parent’s voice, the way they engage with others, the expressions on their face when they speak with us, and the way they move their hands in gesture. So much of what shapes us is learned and lived by us through natural language and the culmination of the conversations that we have while growing up with our family, friends, and peer groups. The ability for humans to have this kind of instant telematic exchange of information has been a critical advantage to us as a species and this has led to our brains being wired to intuitively adopt language and conversation skills at an early age. A well-designed voice interface should therefore be very simple for anyone to interact with right? An interface we know by nature, one that you speak with it as you would any other conversational partner.
Unfortunately for computers, spoken language is only half of how humans communicate information. There is an entirely separate set of signals and cues that are transmitted during a conversation; these include subtle physical micro gestures, eye contact, cultural norms, tonality, body language, among many others. As we strive to build conversational interfaces that are truly intuitive and empathetic to our needs something is apparent — computers will have to not only understand our words and their meanings but also how to understand and interpret the intent behind the many non-verbal cues we make as well. Without it, they’ll continue to lack the necessary context and understanding required to have intuitive, cooperative, and successful conversational exchanges with their humans.
Let’s think through an example where a computer’s inability to interpret non-verbal cues is causing conversational friction and confusion. If you’ve ever interacted with a smart speaker in the past you’ll know that ‘invoking’ the assistant requires you to use their name in a “wake word” phrase, this often feels forced and can get tedious when you’re making frequent requests. Now think about how we try to hold the attention of our peers normally, we don’t use their “wake words” (or, ahem, names) at every single turn in dialog — we use our bodies and our eyes to imply our desire to communicate. For instance; when we want to have a conversation with someone we don’t always just start speaking. We sometimes try to capture their attention using slight gesture, or even just a gaze that loiters in their direction until eye contact is made. Then, voila. No need to use “Andrew” to get their attention. The requirement to preface every single exchange with “OK Google” or “Hey Siri” is introducing friction that prevents us from realizing the true value of ambient computing interfaces.
When developing digital products it’s always important to know what features will provide the most value and impact for your users so that you have a roadmap of where to invest your time and what to build first. When it comes to conversational interfaces our research labs team felt that the detection of this non-verbal cue — detecting the intent that someone intends to instigate conversation, would make the most impact and provide the most value. Being able to intuitively understand a user’s intent to engage would help reduce wake-word friction, and improve the fluidity of the voice-based conversational interactions. This is how we framed our hypothesis in this research project, and over a 4 week period, with 1 product designer and 2 engineers we were able to develop 4 functional prototypes that allowed us to further validate these ideas as being technically feasible, performative, data secure and desirable.
Prototype 1: A bot that can see
So how do you go about building a voice assistant that’s not only always-listening but also always-looking? And what are the key product risks and assumptions that we need to identify in order to validate our concept? We asked ourselves these questions in our first week to help us identify some of our key objectives and constraints, set goals, and define experiments.
Whenever you’re designing and developing an ambient interface one thing that is non-negotiable to get right is privacy. Our homes are intimate spaces where we need to feel safe and private. It’s taken several years for the consumer market to get comfortable with always-listening smart speakers. And it was only when privacy concerns became addressable that smart speakers were even possible, and the technology that addressed those concerns is on-device predictive machine learning models, and it’s the same technology that will drive the always-looking interfaces of tomorrow. It’s because of on-device natural language machine learning models that Amazon or Google’s smart speakers do not need to record all the time or send anything to the cloud, and will wait silently until the moment the user displays their intent to engage by using the devices ‘wake-word’ before leveraging any of the microphones audio in order to support a request.
What we’d like to explore in our prototyping for this project is whether or not the ability to determine a user’s intent-to-engage is now possible with always-looking cameras and on-device machine learning models. Achieving this, we hypothesized, would help to eliminate the tedium of constantly using wake words, and unlock a more fluid and natural voice-first, conversational interface.
In the first week of our exploration, we built a proof of concept Android application that would constantly analyze the frames of our devices forward-facing camera until it detected a user’s intent to engage by satisfying a condition that we would define, this will trigger the launching of Google Assistant. What we were trying to detect is what can we described as a ‘loitering gaze’. A non-verbal cue that denotes the intent to engage in conversation. In our first prototype, the non-verbal cue that we were detecting is a gaze where the eyes are held open longer than usual. While this is not a very natural non-verbal cue, it served as a proof-of-concept and gave us a starting point.
Prototype 2: When our eyes meet
In our second week, we wanted to improve the user experience of our intent-to-engage condition and make it a more intuitive experience. This time instead of holding your eyes open, we wanted to be able to detect a direct gaze and discriminate against off-angle eye positions.
In this video demonstration, you can see the interaction in context, used as a desktop-based, home office voice assistant. The user is able to work on their computer and only trigger their assistant when looking at the device.
At closer distances, this approach of detecting a loitering directional gaze works well. However, when users were at a distance greater than 2 meters the resolution of our camera is not sufficient to accurately resolve and detect gaze direction, we would need a new non-verbal cue.
Prototype 3: Wave to me
Realizing that we needed a new solution for farther distances of 6–15 ft distances we began exploring gesture and pose detection. We used an open-source vision model called PostNet that estimates the pose of a person in an image or video by detecting the positions of key body parts. As an example, the model can estimate the position of a person’s elbow and/or knee in an image. The pose estimation model does not identify who’s in an image; only the positions of key body parts. It’s also a machine learning model that we can run locally on an Android phone, meaning that the image data being analyzed and all biometric and positional data is staying on-device. If you’re curious, you can try a demo of PoseNet using your laptop’s front-facing camera.
With PoseNet, we’re able to monitor the position of specific body parts, while keeping all this tracking on-device, and not recording any images or identifying any users. We’ve used this ability to detect the wave of a hand by monitoring the position of the users arm. A hand wave is a fairly universal gesture and was relatively easy to create the conditional requirements to detect it. It works well in the 6–15ft range as you can see in the video below.
We believe it would be possible with more time to build more complex conditions that could utilize more subtle body and head position detections that denote intent to engage, and that a secondary local NLU (natural language understanding) model might be able to discriminate against the increases in false positives that might increase with this more intuitive approach.
Prototype 4: The Virtual Gaze
Excited by our last experiment, we wanted to see where else we could apply these insights, and we decide to bring them into the virtual realm. Virtual Reality is also a great use case for conversational interfaces because their hands-free nature make input with something like a keyboard rather difficult. The application of gaze detection to virtual reality experiences seemed like a great fit given the amount of real-time positional data that VR headsets produce and the detailed environmental context and metadata that VR simulations make available.
To build this prototype we used a web-based game engine called Sumerian. Amazon Sumerian is a service that lets you create and run 3D, Augmented Reality or and Virtual Reality applications. You can design scenes directly within your browser and, because Sumerian is a web-based application, you can quickly add connections in your scenes to existing AWS services.
One of the services you can integrate into your Sumerian application is Lex. Amazon Lex is a service for building conversational interfaces into any application using voice and text. It provides a conversational manager that uses speech-to-text (STT), and natural language understanding (NLU) to recognize the intent of the user’s conversational inputs and helps to find the appropriate responses, allowing product developers to build 3D applications with engaging and lifelike conversational experiences.
The problem with the default chatbot integration is that in order to initiate conversation and the application requires a button-press event, with the default being set to wait for a spacebar press. We felt this was a great opportunity to apply what we’ve learned about detecting non-verbal intent to engage cues, allowing us to initiate our chatbot without requiring the user to use a keyboard.
To accomplish this, we used Sumerians In View transition action (primarily used to trigger narrative events in games) and reappropriate it to help us detect the ‘loitering gaze’, the non-verbal cue that would initiate our avatar shop keeper to engage with our user. In View is a machine state that can trigger an event when certain conditions are met, it allows you to monitor the objects or entities in a virtual camera’s frustum and we used it to observe the user and the chatbots virtual camera perspectives in order to detect when the two were making eye contact, determine if that contact was within close range and watch to see if it’s held for more than 3 seconds. If these 3 conditions are satisfied it will trigger the shopkeeper to engage. This allows our chatbot to interact with users in a more intuitive way, without the friction introduced by requiring keyboard input.
Within our virtual simulation we have the additional semantic context of the scenes various objects as well as an ability to monitor both the users and chatbots viewpoints. This allows us to set up an additional condition that give our shopkeeper contextual understanding of what items in the shop might their customer may be interested in based on their gaze. In the following video the user moves closer to inspect an orchid, triggering our shopkeeper to engage with them about this particular plant.
It’s clear that when these new sources for vision and positional data are combined with always-on cameras, and on-device predictive models they can provide powerful conversational context and more subtle intent understanding. This type of data that computer vision products produce is also very sensitive — it can translate our emotional state, predict our personal motivations and intentions, and even be used for biometric identification. It would probably be a digital marketers’ dream, can you imagine the sentiment analysis sales chatbots? Constantly analyzing our body language and behaviour, and optimized for persuasive manipulation and maximizing profits.
We’re talking about connected always-on devices that have cameras, which consumers are being asked to bring into their homes — and in some cases, even worn on their faces in the form of camera laden smart glasses. While these experiments have demonstrated to our team the value of this technology, the critical nature of the privacy considerations is also clear. These considerations will have to be made by product developers when developing these interfaces, and by consumers when adopting them.
With the introduction of a range of camera-laden smart displays from the likes of Facebook, Google, and Amazon — as well as the mainstreaming of virtual reality headsets from Oculus(Facebook), we’re about to see if consumers are willing to trust these devices and bring them into their lives. Just like with the always-listening smart speakers we have in our homes today, these companies will need to educate consumers about data governance and privacy as it pertains to the always-looking devices of tomorrow. At this point in time, many people are rightfully distrusting of ‘big tech’ companies and their ceaseless hunger for data and its power to drive profits. These technologies are advancing quickly but the challenge here is not just in technological capabilities, but also in building trust between people, tech companies and their ambassadors; our always-on, always-listening, always-looking, virtual personal assistants.