
AR and AI continue to converge. This includes everything from “generative AR” in creative workflows to AI’s user-facing role in smart glasses, such as Ray-Ban Metas. The latter involves multimodal AI to “see” one’s surroundings and return audible intelligence.
Google is working on something similar with visual search. The idea is that the smartphone camera – or smart glasses eventually – can contextualize the world around you. This happens through a combination of computer vision and object recognition, which is in turn powered by AI.
Google is in prime position to do this as the world’s search engine for the past 25 years. Through that, it’s built a knowledge graph that’s essentially the best AI training set you could ask for. For example, Google Images provides the training data that lets Google Lens recognize objects.
Similarly, Street View provides the training data for object recognition and device localization in Google’s AR navigation efforts. The device can see and understand where it is before AR can do its thing, such as tell you information about storefronts or display 3D wayfinding arrows.
AI Arms Race
With that backdrop, Google’s continued convergence of AR and AI took a step forward recently with AI Mode. For those unfamiliar, this brings Google into the realm of AI assistants like ChatGPT, including natural language and query strings that remember and build on previous questions.
This has several implications for Google’s evolution as a company and how it has to disrupt itself to stay competitive in the AI arms race. But saving that broader story for another day, we’ll focus here on AI mode’s latest update – its integration with images and visual content.
Specifically, Google is combining AI Mode with multimodal search – it’s longstanding play around searching across media formats like text, images, and video. Specifically, you can launch a search using an image, then use text to refine the search and zero in on what you want.
For example, using Google Lens you can capture an image of a jacket that you see someone wearing on the street. You can then launch a product search and type to refine (think: “the same jacket in red”). It’s all about increasing the surface area of search to as many inputs as possible.
Now, bringing AI Mode into the mix, you can do the same thing but with all of the natural language enhancements and multi-part questions that the feature is known for. So with the same jacket example, you can continue asking questions until you zero in on the right item.
Force Multiplier
Going under the hood, Google accomplishes this using a technique called “query fan-out.” This relies on object recognition that Google has developed through years of knowledge graph training and image search. This lets it understand images and their elements holistically and contextually.
Google provides the example of a bookshelf. Using Google Lens, you can capture an image of a row of books then ask, “If I enjoyed these, what are some similar books that are highly rated?” You could then follow up to say, “I’m looking for a quick read… what book should I buy next?”
The non-AI-Mode comparison is what you’ve likely done historically – start over with new searches after each one. This method was sometimes successful in zeroing in on a given need. But AI Mode makes it more cohesive and progressive, remembering past questions as it goes.
As all the examples above suggest – books, jackets, etc.,– user intent for visually-driven AI Mode will be commercial in nature. Outcomes include e-commerce or nearby/IRL impulse shopping. That boosts monetization potential for all the above… and thus Google’s motivations.
Coming full circle, AI mode is a big deal for Google. And visual/multimodal search is likewise a key endeavor, with strong ties to AR. Put them together, and it’s one of the places that AI will be a force multiplier for AR. And based on AI’s pace of advancement, it will be a moving target.
