The Dawn of Lmms: Preliminary Explorations with GPT-4V(ision)
By Zhengyuan Yang et al
Read the original document by opening this link in a new tab.
Table of Contents
List of Figures
1 Introduction
1.1 Motivation and Overview
1.2 Our Approach in Exploring GPT-4V
1.3 How to Read this Report
2 GPT-4V’s Input Modes
2.1 Text-only Inputs
2.2 Single Image-text Pair
2.3 Interleaved Image-text Inputs
3 GPT-4V’s Working Modes and Prompting Techniques
3.1 Following Text Instructions
3.2 Visual Pointing and Visual Referring Prompting
3.3 Visual + Text Prompting
3.4 In-context Few-shot Learning
4 Vision-Language Capability
4.1 Image Description on Diverse Domains
4.2 Object Localization, Counting, and Dense Captioning
4.3 Multimodal Knowledge and Commonsense
4.4 Scene Text, Table, Chart, and Document Reasoning
4.5 Multilingual Multimodal Understanding
4.6 Coding Capability with Vision
5 Interaction with Humans: Visual Referring Prompting
5.1 Understand Pointing Inputs
5.2 Visual Referring Prompting
5.3 Generate Pointing Outputs
6 Temporal and Video Understanding
6.1 Multi-image Sequencing
6.2 Video Understanding
6.3 Visual Referring Prompting for Grounded Temporal Understanding
7 Abstract Visual Reasoning and Intelligence Quotient Test
7.1 Abstract Visual Stimuli
7.2 Discovery and Association of Parts and Objects
7.3 Wechsler Adult Intelligence Scale
7.4 Raven’s Progressive Matrices
8 Emotional Quotient Test
8.1 Read Emotion from Facial Expressions
8.2 Understand How Visual Content Arouses Emotions
8.3 Emotion Conditioned Output
9 Emerging Application Highlights
9.1 Spot the Difference
9.2 Industry
9.3 Medical
9.4 Auto Insurance
9.5 Customized Captioner
9.6 Image Generation
9.7 Embodied Agent
9.8 GUI Navigation
10 LMM Powered Agents
10.1 Multimodal Plugins
10.2 Multimodal Chains
10.3 Self-Reflection
10.4 Self-Consistency
10.5 Retrieval-Augmented LMMs
11 Conclusions
11.1 Summary and Conclusions
11.2 Towards Future LMMs
Summary
Abstract: Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V’s capabilities, its supported inputs and working modes, and the effective ways to prompt the model. Observations from these samples demonstrate that GPT-4V’s unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V’s unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. Finally, we acknowledge that the model under our study is solely the product of OpenAI’s innovative work, and they should be fully credited for its development.