Language Is Not All You Need: Aligning Perception with Language Models

By Shaohan Huang et al
Published on March 1, 2023
Read the original document by opening this link in a new tab.

Table of Contents

Abstract A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. ... Introduction: From LLMs to MLLMs ... K OSMOS -1: A Multimodal Large Language Model ... Model Training ...

Summary

The document discusses the importance of aligning perception with language models for achieving artificial general intelligence. It introduces KOSMOS-1, a Multimodal Large Language Model capable of perceiving various modalities and learning in context. The model is trained on web-scale multimodal corpora, including text data, images, and image-caption pairs. Experimental results demonstrate the model's impressive performance on language understanding, generation, perception-language tasks, and vision tasks. The document highlights the shift from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) as a crucial step in advancing AI capabilities.
×
This is where the content will go.