Multimodal
AI systems that can process and generate multiple types of content - text, images, audio, and video.
AI-generated
A multimodal AI model can work with more than just text. GPT-4, Claude, and Gemini can all process both text and images. Some models can also handle audio and video. "Multimodal" means the model understands and can reason across different types of media.
Multimodal capability dramatically expands what AI can do. You can show it a screenshot and ask what is wrong with the UI. You can photograph a math problem and get a solution. You can describe a chart and ask for analysis. As models become natively multimodal (trained on all modalities from the start), their cross-modal reasoning improves.
Google DeepMind: Gemini (native multimodal) - https://deepmind.google/technologies/gemini/
OpenAI: GPT-4 vision capabilities - https://openai.com/research/gpt-4