Ever since Ray-Ban Meta Smartglasses rolled out, the term multimodal AI has gained stock in the AR world. It describes the technology that allows cameras to sense one’s surroundings (visual input) to inform object identification and other AI assistant functions (audible output).
The reason this is notable is that it helps sidestep a design challenge that’s plagued consumer smart glasses for years. The dilemma is that you can’t achieve a graphically-rich UX with dimensional graphics and holography… while also having stylistically-viable eyewear.
In other words, the computing, battery capacity, heat dissipation, and several other requirements for S.L.A.M-based AR (that which interacts dimensionally with physical spaces) necessitates heftier headgear whose bulk deems it only commercially viable in enterprise contexts.
Back to AI – Ray-Ban Metas’ Multimodal AI to be specific – it lessens the reliance on horsepower-hungry optics as smartglasses’ main draw. Instead, personalized and relevant information in “lighter” forms (in this case, audio) replaces graphical richness as the core value driver.
Or as Mark Zuckerberg said in an underplayed comment during Ray-Ban Meta Smartglasses’ release:
Before the last year’s AI breakthroughs, I kind of thought that smartglasses were only really going to become ubiquitous once we really dialed in the holograms and the displays, which we are making progress on but is somewhat longer. But now I think that the AI part of this is going to be just as important in smartglasses being widely adopted as any of the augmented reality features.
Parallel Paths
Meta isn’t the only company dabbling in multimodal AI as an AR accelerant. Google has been doing it for longer, given its work in visual search and other multimodal inputs. This includes blending voice, visual, and text search as a way to contextualize physical-world items.
Backing up, these search modalities have separately been on parallel paths for years. We all know what text and voice search are, but visual search – for those unfamiliar – is using your camera to identify and contextualize things, a la Google Lens. It’s an underrated flavor of AR.
The idea behind all three search formats is optionality: to accommodate any or all inputs when situationally relevant. For example, use text search when you have full keyboard control; voice search while driving, and visual search to identify objects (think: fashion items) with your camera.
More recently, Google has combined them into the same workflow. This was first seen in multisearch, unveiled at Google I/O in 2022. In short, it lets you perform visual searches for objects you encounter, then refine the results using text or voice (e.g., “the same jacket in blue.”).
What Am I Looking At?
This concept more recently advanced when Google added a feature for these voice and text inputs to be more integrated into a visual search query. Specifically, users can now long-press the shutter button in Google Lens, during which spoken inputs can refine the query in real-time.
For example, point Google Lens at a landscaping shrub in your neighborhood while long-pressing and saying “What kind of tree is this and what local nurseries carry it?” Similarly, point it at a new restaurant in your neighborhood and say “Does this place require a reservation?”
To put that another way, Google is getting an integrated – rather than sequential – mix of visual and voice inputs to compute the best result. It’s a tighter and more intuitive UX, similar to Ray-Ban Meta Smartglasses’ feature that lets you ask “What am I looking at right now?”
And that brings us back to Meta. Though it’s not the most prominent topic in the world of AR, multimodal AI may be an unsung hero that sparks a much-needed jolt of utility and appeal. Given Ray-Ban Meta Smartglasses’ surprise-hit status, it may already be well on its way.