Vector Search with OpenAI Embeddings: Lucene Is All You Need
By Jimmy Lin et al
Read the original document by opening this link in a new tab.
Table of Contents
1. Introduction
2. From Architecture to Implementation
3. Experiments
4. Discussion
Summary
This paper challenges the prevailing narrative that a dedicated vector store is necessary for vector search with OpenAI embeddings. It demonstrates that Lucene, specifically hierarchical navigable small-world network (HNSW) indexes, is sufficient for vector search capabilities in a standard bi-encoder architecture. The paper argues that existing infrastructure, such as platforms built on Lucene like Elasticsearch and Solr, already provide strong support for search applications. By providing a cost-benefit analysis, the paper asserts that the benefits of a separate vector store do not justify the added complexity in modern enterprise AI stacks. The experiments conducted on the MS MARCO passage ranking test collection show that using OpenAI embeddings with Lucene achieves effectiveness comparable to state-of-the-art models. The paper details the process of encoding the corpus and queries with OpenAI's ada2 model, indexing the dense vectors with Lucene, and the performance metrics achieved on a high-end server. While acknowledging some implementation challenges, the paper concludes that building a vector search prototype with OpenAI embeddings and Lucene is feasible today.