Inside Facebook Reality Labs: The Next Era of Human-Computer Interaction

by Facebook Reality Labs

In today’s post — the first in a series exploring the future of human-computer interaction (HCI) — we’ll begin to unpack the 10-year vision of a contextually-aware, AI-powered interface for augmented reality (AR) glasses that can use the information you choose to share, to infer what you want to do, when you want to do it.

Next week, we’ll share some nearer-term research: wrist-based input combined with usable but limited contextualized AI, which dynamically adapts to you and your environment. And later in the year, we’ll pull back the curtain on some groundbreaking work in soft robotics to build comfortable, all-day wearable devices and share an update on our haptic glove research.

Imagine a world where a lightweight, stylish pair of glasses could replace your need for a computer or smartphone. You’d have the ability to feel physically present with friends and family — no matter where in the world they happened to be — and contextually-aware AI to help you navigate the world around you, as well as rich 3D virtual information within arm’s reach. Best of all, they’d let you look up and stay present in the world around you rather than pulling your attention away to the periphery in the palm of your hand. This is a device that wouldn’t force you to choose between the real world and the digital world.

It may sound like science fiction, but it’s a future that Facebook is building inside our labs. And today, we’ll share our vision for how people will interact with that future.

The AR Interaction Challenge

Facebook Reality Labs (FRL) Chief Scientist Michael Abrash has called AR interaction “one of the hardest and most interesting multi-disciplinary problems around,” because it’s a complete paradigm shift in how humans interact with computers. The last great shift began in the 1960s when Doug Engelbart’s team invented the mouse and helped pave the way for the graphical user interfaces (GUIs) that dominate our world today. The invention of the GUI fundamentally changed HCI for the better — and it’s a sea change that’s held for decades.

Doug Englebart’s Mother of All Demos, 1968 (video courtesy of SRI International: www.sri.com)

But all-day wearable AR glasses require a new paradigm because they will be able to function in every situation you encounter in the course of a day. They need to be able to do what you want them to do and tell you what you want to know when you want to know it, in much the same way that your own mind works — seamlessly sharing information and taking action when you want it, and not getting in your way otherwise.

“In order for AR to become truly ubiquitous, you need low-friction, always-available technology that’s so intuitive to use that it becomes an extension of your body,” says Abrash. “That’s a far cry from where HCI is today. So, like Engelbart, we need to invent a completely new type of interface — one that places us at the center of the computing experience.”

This AR interface will need to be proactive rather than reactive. It will be an interface that turns intention into action seamlessly, giving us more agency in our own lives and allowing us to stay present with those around us.

Importantly, it will need to be socially acceptable in every respect — secure, private, unobtrusive, easy to learn, easy to use, comfortable/all-day wearable, effortless, and reliable.

As we build the next computing platform centered around people, we’re committed to driving this innovation forward in a responsible, privacy-centric way. That’s why we’ve crafted a set of principles for responsible innovation that guide all our work in the lab and help ensure we build products that are designed with privacy, safety, and security at the forefront.

In short, the AR interface will require a complete rethinking of how humans and computers interact, and it will transform our relationship with the digital world every bit as much as the GUI has.

The Problem Space, Explored

Say you decide to walk to your local cafe to get some work done. You’re wearing a pair of AR glasses and a soft wristband. As you head out the door, your Assistant asks if you’d like to listen to the latest episode of your favorite podcast. A small movement of your finger lets you click “play.”

As you enter the cafe, your Assistant asks, “Do you want me to put in an order for a 12-ounce Americano?” Not in the mood for your usual, you again flick your finger to click “no.”

You head to a table, but instead of pulling out a laptop, you pull out a pair of soft, lightweight haptic gloves. When you put them on, a virtual screen and keyboard show up in front of you and you begin to edit a document. Typing is just as intuitive as typing on a physical keyboard and you’re on a roll, but the noise from the cafe makes it hard to concentrate.

Recognizing what you’re doing and detecting that the environment is noisy, the Assistant uses special in-ear monitors (IEMs) and active noise cancellation to soften the background noise. Now it’s easy to focus. A server passing by your table asks if you want a refill. The glasses know to let their voice through, even though the ambient noise is still muted, and proactively enhance their voice using beamforming. The two of you have a normal conversation while they refill your coffee despite the noisy environment — and all of this happens automatically.

A friend calls, and your Assistant automatically sends it to voicemail so as not to interrupt your current conversation. And when it’s time to leave to pick up the kids based on your calendared event, you get a gentle visual reminder so you won’t be late due to the current traffic conditions.

Building the AR Interface

FRL Research has brought together a highly interdisciplinary team made up of research scientists, engineers, neuroscientists, and more, led by Research Science Director Sean Keller, all striving to solve the AR interaction problem and arrive at computing’s next great paradigm shift.

“We classically think of input and output from the computer’s perspective, but AR interaction is a special case where we’re building a new type of wearable computer that’s sensing, learning, and acting in concert with users as they go about their day,” says Keller, who joined FRL Research to build a five-person team which has since grown to a team of hundreds of world-class experts in the span of just six years. “We want to empower people, enabling each and every one of us to do more and to be more — so our AR interaction models are human-centric.”

At Facebook Connect in 2020, Abrash explained that an always-available, ultra-low-friction AR interface will be built on two technological pillars:

The first is ultra-low-friction input, so when you need to act, the path from thought to action is as short and intuitive as possible.

Facebook Connect: The AR Angle

You might gesture with your hand, make voice commands, or select items from a menu by looking at them — actions enabled by hand-tracking cameras, a microphone array, and eye-tracking technology. But ultimately, you’ll need a more natural, unobtrusive way of controlling your AR glasses. We’ve explored a range of neural input options, including electromyography (EMG). While several directions have potential, wrist-based EMG is the most promising. This approach uses electrical signals that travel from the spinal cord to the hand, in order to control the functions of a device based on signal decoding at the wrist. The signals through the wrist are so clear that EMG can detect finger motion of just a millimeter. That means input can be effortless — as effortless as clicking a virtual, always-available button — and ultimately it may even be possible to sense just the intention to move a finger.

The second pillar is the use of AI, context, and personalization to scope the effects of your input actions to your needs at any given moment. This is about building an interface that can adapt to you, and it will require building powerful AI models that can make deep inferences about what information you might need or things you might want to do in various contexts, based on an understanding of you and your surroundings, and that can present you with the right set of choices. Ideally, you’ll only have to click once to do what you want to do or, even better, the right thing may one day happen without you having to do anything at all. Our goal is to keep you in control of the experience, even when things happen automatically.

While the fusion of contextually-aware AI with ultra-low-friction input has tremendous potential, important challenges remain — like how to pack the technology into a comfortable, all-day wearable form factor and how to provide the rich haptic feedback needed to manipulate virtual objects. Haptics also let the system communicate back to the user (think about the vibration of a mobile phone).

To address these challenges, we need soft, all-day wearable systems. In addition to their deep work across ultra-low-friction input and contextualized AI, Keller’s team is leveraging soft, wearable electronics — devices worn close to or on the skin’s surface where they detect and transmit data — to develop a wide range of technologies that can be comfortably worn all day on the hand and wrist, and that will give us a much richer bi-directional path for communication. These include EMG sensors and wristbands.

AR glasses interaction will ultimately benefit from a novel integration of multiple new and/or improved technologies, including neural input, hand tracking and gesture recognition, voice recognition, computer vision, and several new input technologies like IMU finger-click and self-touch detection. It will require a broad range of contextual AI capabilities, from scene understanding to visual search, all with the goal of making it easier and faster to act on the instructions that you’d already be sending to your device.

And to truly center human needs in these new interactions, they will need to be built responsibly from the ground up, with a focus on the user’s needs for privacy and security. These devices will change the way we interact with the world and each other, and we will need to give users total control over those interactions.

Building the AR interface is a difficult, long-term undertaking, and there are years of research yet to do. But by planting the seeds now, we believe we can get to AR’s Engelbart moment and then get that interface into people’s hands over the next 10 years, even as it continues to evolve for decades to come.

More Context

The biggest difference between the future AR interface and everything that’s come before is that there will be much more contextual information available to our AR devices. The glasses will see and hear the world from your perspective, just as you do, so they will have vastly more personal context than any previous interface has ever had. Coupled with powerful AI inference models, this context will give them the ability to help you in an ever-increasing variety of personalized ways and free your mind up to do other things.

Imagine having a pair of glasses that could feed you key statistics in a business meeting, guide you to destinations, translate signs on the fly, tell you where you’ve left your car keys, or even help you with almost any sort of task. Asking what else this interface will enable is kind of like asking what the GUI would enable back in 1967 — the possibilities are vast and open-ended.

Another difference is that most existing interfaces are modal. You pick the mode by running an app, and your set of choices is then altered to match that mode. And as you switch from one app to another, the context of what you’re doing at any given moment is lost as you move to your next task. Bcut AR glasses don’t have that luxury. They will work best if they operate seamlessly in all the contexts you encounter in a day — contexts that change constantly and often overlap. This means that the interface will treat every interaction as an intent inference problem. And it can then use its predictions to present you with a simple set of choices, without having you navigate through menu after menu of options to find the information you might be looking for, as today’s interfaces do.

Critically, the interface of the future will be amplified by a key feedback loop. Not only can the AI learn from you, but because the input is ultra-low-friction (and only requires an “intelligent click”), the AI will ask questions to improve its understanding of you and your needs more quickly. The ability to instruct the system in real time will be hugely valuable and will leapfrog systems that rely on traditional data collection and training.

The ultimate goal is to build an interface that accurately adapts to you and meets your needs — and is able to ask a simple question to disambiguate when it isn’t sure — but this system is years off. That’s partly because the sensing technology and egocentric data needed to train the AI inference models simply do not exist. By collecting first-person perspective data, our recently launched Project Aria will move us one step closer to this goal.

In the nearer term, we’ll see usable but limited contextual AI with predictive features like the ability to proactively suggest a playlist you might want to listen to on your daily jog. Stay tuned to the blog next week, when we’ll pull back the curtain on some of our work with HCI at the wrist and what we call an adaptive interface.

People at the Center

Today’s devices have allowed us to connect with people far away from us, unconstrained by time and space, but too often, these connections have come at the expense of the people physically next to us. We tell ourselves that if only we had more willpower we would put down our smartphone and focus on the conversation in front of us. That’s a false choice. Our world is both digital and physical, and we shouldn’t have to sacrifice one to truly embrace the other.

We need to build devices that won’t force us to choose between people and our devices. These future devices will let us look up and stay in the world so that we can do more of what we are built to do as humans — to connect and collaborate.

But for this next great wave of human-oriented computing to come to fruition, we need a paradigm shift that truly places people at the center. That means our devices will need to adapt to us, rather than the other way around. It means AR needs its own Englebart moment.