Meta is open-sourcing an AI tool called ImageBind that predicts connections between data similar to how people perceive or imagine an environment. While image generators such as Midjourney, Stable Diffusion and DALL-E 2 pair words with images, allowing you to create visual scenes based only on a textual description, the ImageBind casts a wider net. It can link text, images/video, audio, 3D measurements (depth), temperature data (thermal), and motion data (from inertial measurement units) — and it does so without should train for each possibility first. This is an early stage of a framework that can create complex environments from an input as simple as a text prompt, image or audio recording (or some combination of the three). .
You can look at ImageBind which moves machine learning closer to human learning. For example, when you are standing in a stimulating environment such as a busy city street, your brain (mostly unconsciously) absorbs the sights, sounds and other sensory experiences to learn. information about passing cars and pedestrians, tall buildings, weather and more. Humans and other animals have evolved to process this data for our genetic advantage: to survive and pass on our DNA. (The more aware you are of your surroundings, the better you can avoid danger and adapt to your environment for better survival and prosperity.) As computers come closer to replicating the multi-sensory connections of animals, they will use the links to generate full awareness. views based only on limited bits of data.
So, while you can use Midjourney to evoke “a basset hound wearing a Gandalf outfit while balancing on a beach ball” and get a fairly realistic photo of this amazing scene, a A multimodal AI tool like ImageBind can eventually create a video of the dog with corresponding sounds, including a detailed suburban living room, room temperature and the exact location of the dog and whoever is there. in the landscape. “This creates unique opportunities to create animations from static images by combining them with audio prompts,” Meta researchers said today in a developer-focused blog post. “For example, a creator might combine an image with an alarm clock and a rooster crowing, and use a crowing audio prompt to segment the rooster or the sound of an alarm to segment the clock and animate both in a video sequence.”
As for what else this new toy can do, it clearly points to one of Meta’s main ambitions: VR, mixed reality and the metaverse. For example, imagine a future headset that will be able to fully realize 3D scenes (with sound, movement, etc.) on the fly. Or, virtual game developers can finally use it to get most of the work done in their design process. Similarly, content creators can create immersive videos with realistic soundscapes and motion based solely on text, image or audio input. It’s also easy to imagine a tool like ImageBind opening new doors in the area of accessibility, creating real-time multimedia descriptions to help people with visual impairments. or hearing to better understand their surroundings.
“In typical AI systems, there is a specific embedding (that is, vectors of numbers that can represent data and their relationships in machine learning) for each modality, ” said Meta. “ImageBind shows that it is possible to create a unified embedding space across multiple modalities without having to train data on each different combination of modalities. This is important because researchers cannot create datasets with samples containing, for example, audio data and thermal data from a busy city street, or depth data and a text description of a beach cliff.
Meta sees technology eventually expanding beyond the current six “insights,” so to speak. “While we have explored six modalities in our current research, we believe that introducing new modalities that link as many senses as possible – such as touch, speech, smell, and brain signals to fMRI – enables richer human-centered AI models.” Developers interested in exploring this new sandbox can start by diving into Meta’s open-source code.