What is Multimodal AI? This is a question everyone is often wondering about due to the recent release of GPT-4, which boasts a range of advanced capabilities, particularly in the realm of multimodal AI.
Our world comprises multiple modalities – we see objects, hear sounds, feel textures, smell odours, and so on. Generally, a modality refers to how something happens or is experienced. Multimodal research is a subfield that aims to train AI models to process and find relationships between different data types (modalities), typically, but not limited to, images, video, audio, and text. To truly understand our world, AI must learn to interpret these modalities together and understand their interlinking. Multimodal AI is a step towards achieving AGI (Artificial General Intelligence) by developing systems that can process information from diverse sources and replicate human cognition.
Real-world data is usually multimodal, with videos being an excellent example as they often come with an audio track and may even have text subtitles. Social media posts, news articles, and internet-published content routinely mix text with images, videos, and audio recordings.
In the past, AI research has focused on one modality at a time. For instance, a facial recognition system on mobile devices takes a video feed and uses face images to check and unlock the phone. However, such systems can be fooled with a photo or video of the person. What if we also input the voice into the system along with the image, as the agent is unique to each individual? This would create a more robust and secure phone unlocking system. This is what we mean by multimodal systems.
Most Well known Multimodal AI
Automatic Speech Recognition (ASR) systems are one of the oldest multimodal approaches, dating back to the 1980s. ASR takes audio as the input and can generate text as an output. This system uses products like Alexa and Google home to identify the user’s instructions. Virtual assistants like Apple Siri or Amazon Alexa use ASR to interact with users.
AI image generation has been around for a while now. DALL-E, developed by OpenAI, was the earliest image generation model and gained much traction. It was soon followed by models like DALL-E 2, MidJourney, and Stable Diffusion, which were much better at generating images. These models take text as input and create ideas that match the text. The generated images can be realistic or abstract and convey a specific theme or message. Incremental development in these models has brought in capabilities to edit photos, such as removing a particular person from the image, removing red eyes, or replacing an object with a different one.
Big competitors and Recent Advancements:
OpenAI, Microsoft and Google are leading the way for multimodal AI with these recent developments in this field.
GPT-4, the latest development by OpenAI in the field of Large Language models, can work on top of text and images, unlike GPT-3 or 3.5, which could only process text. With the addition of one more modality, there are tremendous possibilities of things that can be achieved.
Microsoft researchers released a Multimodal Large Language Model (MLLM) named Kosmos-1, trained on text and image data. Apart from generating text like ChatGPT, this model can be used for more creative tasks such as image generation based on text prompts, generating captions for the photos we have taken, answering questions based on the images, and answering nonverbal test questions.
Google has developed a model named PaLM-E that can control different robots to perform tasks such as moving around an environment, picking up objects, and carrying them to a destination. The model exhibits capabilities like visual chain-of-thought reasoning, breaking down its answering process into smaller steps.
Stanford University is developing cutting-edge technology for understanding how people react to traumatic events or adverse healthcare experiences, such as heart attacks. They are utilizing a variety of advanced tools, including IoT sensors, audio, images, and video, to gather a lot of data that can be used to improve the consultation process. With this approach, Stanford can offer faster and more effective care for needy patients.
PaLM-E controlling a robot with the task of getting a chip bag
PaLM-E controlling a tabletop robot to complete the long-horizon task successfully
What the Future holds:
Multimodal AI is a relatively new field that has yet to be fully understood, and its use cases and potential benefits still need to be explored. Developing multimodal AI poses several challenges, one of the most important being the computational power required to run these large models. Another major challenge is successfully transferring knowledge between modalities, known as co-learning.
One area that has been overlooked in AI research is the development of self-learning AI. With the capacity of multimodal models to understand data from various sources, they are better suited to consume real-world data. This makes the development of an accurate self-learning AI more feasible than ever. This will push the boundaries of AI capabilities and provide a more comprehensive understanding of AI’s potential for real-world applications.
Multimodal models can open up new avenues for innovative use cases. Imagine a fridge with a camera inside that takes a picture of the inside and analyzes it to list all the food items. Then, it asks you what cuisine you want to make today, such as Italian food. It generates a list of dishes that can be made with the ingredients in the fridge and suggests additional components you need to buy. It can also provide step-by-step instructions on how to make the dish. Awesome right?
Multimodal AI is undoubtedly the way forward in AI, and as the field continues to develop, we can expect to see increasingly innovative ways to use such AI systems.