Summary
This paper presents a comparative study on generating synthetic documents for cross-encoder re-rankers using ChatGPT and human experts. The authors introduce the ChatGPT-RetrievalQA dataset and evaluate the effectiveness of models fine-tuned on ChatGPT-generated and human-generated data. The study shows that models trained on ChatGPT responses are more effective zero-shot re-rankers, while human-trained models outperform in supervised settings. The paper highlights the potential of generative LLMs in generating training data for neural retrieval models. Further analysis is done on the domain-level effectiveness of the re-rankers across different domains. Results indicate that human-trained models perform better in specific domains such as Medicine. The study also discusses the effectiveness of BM25 on human and ChatGPT-generated responses in various datasets, as well as the performance of cross-encoder re-rankers on unseen documents of human-generated collection. Overall, the findings suggest the usefulness of generative LLMs in augmenting training data for retrieval models.