A VIS: Autonomous Visual Information Seeking with Large Language Model Agent

By Ziniu Hu et al
Published on Nov. 10, 2023
Read the original document by opening this link in a new tab.

Table of Contents

1. Introduction
2. Related Work
3. Method
3.1 General Framework
3.2 Tools and their APIs

Summary

The paper proposes an autonomous information seeking visual question answering framework, A VIS, that leverages a Large Language Model to dynamically strategize the utilization of external tools for knowledge acquisition. The method achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks by utilizing human decision-making data to guide decision-making processes. The system comprises a planner, a working memory component, and a reasoner, facilitating dynamic decision-making and effective tool usage. A user study was conducted to collect examples of human decision-making, and a structured framework was developed based on the collected data. The workflow involves iterative decision-making cycles, tool execution, and reasoning processes until a satisfactory answer is obtained.
×
This is where the content will go.