Read the original document by opening this link in a new tab.
Table of Contents
1 Introduction
2 Related Work
3 GPT-assisted Visual Instruction Data Generation
4 Visual Instruction Tuning
4.1 Architecture
4.2 Training
5 Experiments
5.1 Multimodal Chatbot
Summary
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks. This paper presents the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. The authors introduce LLaV A: Large Language and Vision Assistant, an end-to-end trained large multimodal model connecting a vision encoder and an LLM for general-purpose visual and language understanding. The paper discusses the challenges in multimodal instruction-following data creation and presents a method using ChatGPT/GPT-4 for data collection. The architecture and training process of LLaV A are detailed, highlighting the two-stage instruction-tuning procedure. Experimental results show the performance of LLaV A in instruction-following and visual reasoning capabilities, particularly in the development of a multimodal chatbot.