Google just combined DeepMind and Google Brain into one big AI team, and on Wednesday, the new Google DeepMind shared details on how one of its visual language models (VLM) is being used to generate descriptions for YouTube Shorts, which can help with discoverability.
“Shorts are created in just a few minutes and often don’t include descriptions and helpful titles, which makes them harder to find through search,” DeepMind wrote in the post. Flamingo can make those descriptions by analyzing the initial frames of a video to explain what’s going on. (DeepMind gives the example of “a dog balancing a stack of crackers on its head.”) The text descriptions will be stored as metadata to “better categorize videos and match search results to viewer queries.”
I really recommend watching DeepMind’s video explaining how it works, which I’ve embedded below. It’s only about a minute long, and it breaks things down in a digestible way.
This solves a real problem, Google DeepMind’s chief business officer Colin Murdoch tells The Verge: for Shorts, creators sometimes don’t add metadata because the process of creating a video is more streamlined than it is for a longer-form video. Todd Sherman, the director of product management for Shorts, added that because Shorts are mostly watched on a feed where people are just swiping to the next video instead of actively browsing for them, there isn’t as much incentive to add the metadata.
“This Flamingo model — the ability to understand these videos and provide us descriptive text — is just really so valuable for helping our systems that are already looking for this metadata,” Sherman says. “It allows them to more effectively understand these videos so that we can make that match for users when they’re searching for them.”
The generated descriptions won’t be user-facing. “We’re talking about metadata that’s behind the scenes,” Sherman says. “We don’t present it to creators, but there’s a lot of effort going into making sure that it’s accurate.” As for how Google is making sure these descriptions are accurate, “all of the descriptive text is going to align with our responsibility standards,” Sherman says. “It’s very unlikely that a descriptive text is generated that somehow frames a video in a bad light. That’s not an outcome that we anticipate at all.”
Let’s hope that’s true, given AI’s occasional tendency to make things up or tag things incorrectly: eight years after Google Photos tagged two Black people as gorillas, the service still won’t label anything as a monkey because of potential harm. Any serious mistakes from Flamingo could be hurtful to creators and open Google up to significant criticism.
Flamingo is already applying auto-generated descriptions to new Shorts uploads, and it has done so for “a large corpus of existing videos, including the most viewed videos,” according to DeepMind spokesperson Duncan Smith.
I had to ask if Flamingo would be applied to longer-form YouTube videos down the line. “I think it’s completely conceivable that it could,” Sherman says. “I think that the need is probably a little bit less, though.” He notes that for a longer-form video, a creator might spend hours on things like pre-production, filming, and editing, so adding metadata is a relatively small piece of the process of making a video. And because people often watch longer-form videos based on things like a title and a thumbnail, creators making those have incentive to add metadata that helps with discoverability.
So I guess the answer there is that we’ll have to wait and see. But given Google’s major push to infuse AI into nearly everything it offers, applying something like Flamingo to longer-form YouTube videos doesn’t feel outside the realm of possibility, which could have a huge impact on YouTube search in the future.