How often do you see an artwork, a plant, or an item of clothing that you’d like to look up on the web but can’t put words to it? Probably not all that often. However, when you occasionally come across something you’d like to search which words cannot describe, it’s very frustrating and ultimately a lost cause. Text-based search is imperfect. That’s the idea behind visual search. To fill in during those moments which words won’t help.
Supposedly visual search is the next great frontier in building upon our current search engines – Amazon, Google, Pinterest, etc. And many efforts have been made over the last few years to make it a hit.
Visual Search Giants
Indexing the world’s visual knowledge is a big undertaking because of the large variability in physical objects. For instance, think about how many different models of chairs exist in the world. That’s a very basic example. However, the complexity in training an algorithm to spot every type of chair and then do something with that information is grandiose.
Now apply that to every object, plant, building, and visual input you can imagine… It’s a nearly insurmountable task.
Indexing the world’s visual knowledge is far harder than written knowledge.
Naturally, Google is heavily involved here with the Google Lens. They want you to be able to “Search what you see”.
But Pinterest is equally invested in this idea. The Pinterest Lens recently crossed the mark of being able recognize more than 2.5 billion objects. I’d give you a reference, but I don’t have one. There might be 5 billion or 50 trillion unique objects in the world. I don’t know. My guess would be closer to trillions.
While Pinterest Lens is mainly focused on recognizing things you can buy – clothing, furniture, household items, etc – ideally to pair visual search with partnered retailers. Google Lens is focused on the breadth of visual knowledge – books, business cards, landmarks, buildings, artwork, plants.
Visual search lacks a compelling use case, though. All the applications seem to impact such a small subset of people. Except for retail, which is the visual search feature that is marketed more than any other field.
Visual Search in Retail
Pinterest and Google both heavily promote their Lens being able to make style recommendations based on a user’s visual searches of their own clothing. Snap a pic of a shirt and they’ll show you how others styled it (or recommend a pair of pants to buy).
Similarly, companies like Syte and Markable are white-labeling visual search for clothing retailers to use in their apps. For instance, Syte gave Pretty Little Thing a visual search feature on their mobile site. Shoppers can snap any pic of their outfit, upload it to the visual search feature, and see if PLT has any similar items of clothing to purchase.
I ran a few tests on the Syte x PLT visual search feature. Using pictures taken from the FashionNova Instagram, the visual search on PLT found similar matches in their online inventory in about ¼ of the outfits. Not very impressive, in my opinion.
Google Lens was minimally better at this task. Imagine searching for something on Google Search and only getting a good result 25% of the time. That would be awful and you’d stop using Google altogether.
It’s important to note, though, that Amazon has already made two major attempts at visual search in retail with the Echo Look.
Why Didn’t the Echo Look Work
In 2017, the Echo Look was Amazon’s latest installment into the Echo lineup. It was a smart camera that took full-body outfit selfies and helped you choose what looked best on you via a machine learning style picker.
Two years later, the average rating of the Echo Look on Amazon is 3.8 stars (horrible for an Amazon product). The reasons people who bought it were dissatisfied:
- Poor image quality and selfie effects
- Unreliable outfit suggestions
- Weak execution of app
Additionally, the cultural appeal was severely lacking, which deterred buyers for these reasons:
- People worried about the spying. Amazon peering into our homes, wardrobes, etc.
- Buying a device that is just going to sell more stuff to you.
- Why buy a $200 selfie camera? I’ll just buy a stand for my phone and set it on a timer.
Ultimately, it’s extremely difficult to train an algorithm to detect and suggest style. It’s far too ambiguous.
Amazon recently launched the Amazon Personal Shopper which seems doomed to the same mediocre fanfare.
If the biggest company with the most resources and best talent can’t make visual search in retail a hit… then who can?
Visual Search: A Lost Cause?
It won’t take much for you to realize that visual search of any kind is lacking. Go ahead, play around with Google Lens and see how much simple stuff it fails to identify.
The limitations to visual search are entirely technical. Both Sridhar Mahadevan – Director of Data Science at Adobe Research – and Alan Yuille – Bloomberg Distinguished Professor of visual cognition at Johns Hopkins – sum it up:
The first test I did was point it to my living room; the network classified it as a ‘barbershop’. Repeated tests showed performance accuracy was lower than 20%. Only with great difficulty were even simple objects like cups or plants recognized for what they were. Most often, the classifications produced were hilarious.
Deep Nets perform well on benchmarked datasets, but can fail badly on real world images outside the dataset. Deep Nets are overly sensitive to changes in the image which would not fool a human observer.
Basically, visual search can be boiled down to:
This stuff works great in a lab, but horribly in the real world.
I don’t want to completely disregard the possibility of visual search being influential in our lifetimes. However, I just can’t imagine the route to making a visual search engine that is acceptable.
What are we supposed to run up to everything we want to program, take a billion photos, and still have a computer barely understand it?
Sorry. I’m not buying in. The solution to this problem is far over my head. Although, from what I understand, Alan Yuille has a good proposal.