BM𝒳: A Freshly Baked Bake on BM25
We are proud to announce that researchers from Mixedbread and the Hong Kong Polytechnic University have developed a new lexical search algorithm, BMX, that outperforms the current standard BM25 across the board and is easy to use via Mixedbread's open-source Baguetter library.
Read on to learn about BMX, how it works, our benchmarks, and how to use it in practice. If you want to jump right in, you can look into the paper and the library right here:
- BMX Paper: Our research paper covering the approach in more academic detail.
- Baguetter: Our reference implementation of BMX and an easy to hack testing framework for information retrieval.
TLDR:
Our new BMX search algorithm iterates on the long-standing industry standard
and can be accessed via our fully open-source Baguetter library.
What is BMX?
Most text search engines are powered by lexical (keyword) search algorithms, the most prominent of which is BM25. It powers search engines for web search, e-commerce, legal search engines, and many more applications. A key strength of BM25 is that it performs really well in an out of distribution setting, meaning that it generalizes well on data it has not seen before. However, the keyword search approach comes with its own limitations:
- BM25 does not consider the similarity between the query and any given document, which could enable a more accurate assessment of this document's relevance to the query.
- Lexical search algorithms lack semantic understanding and are therefore unable to deal with linguistic nuances like synonyms and homonyms. This limitation is a key factor in the underperformance of lexical search compared to domain-specific text embedding-based semantic search.
To tackle these challenges, we propose BMX, a lexical search algorithm that incorporates both similarity and semantics. It comes with these key innovations:
-
Entropy-weighted similarity: We adjust the similarity scores between query tokens and related documents based on the entropy of each token. This approach reduces the influence of common tokens, allowing less frequent but more meaningful tokens to have a greater impact on the similarity score.
-
Weighted query augmentation (WQA): To incorporate semantics into lexical search, we developed a method that processes both the original query and its augmented versions simultaneously within BMX. This approach removes the need for multiple retrievals and reranking steps, leading to increased efficiency.
How does BMX work?
Now, we will explore the core scoring algorithm in BMX as well as weighted query augmentation (WQA). This will be a math-heavy section, but please bear with us!
Performance and Benchmarking
To validate our ideas and assess the performance of BMX, we evaluated it extensively on different benchmarks including BEIR, BRIGHT, and some multilingual benchmarks. You can find a full overview of our benchmarking in the BMX paper. Our results show that BMX outperforms BM25 across the board and can offer a significant improvement in retrieval quality.
BEIR
BEIR is a benchmark focused on out-of-domain information retrieval. It covers datasets for different domains, document or query lengths, and tasks. After evaluation on the 15 available benchmarks, we compare BMX to the implementations of BM25 in Baguetter and the bm25s library:
Model | AA | SD | SF | NFC | TCV | TCH | FQA | HQA | MM | FVR | NQ | DBP | QRA | CF | CQA | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BM25S, k1=1.2, b=0.75 | ||||||||||||||||
Robertson | 49.2 | 15.5 | 68.3 | 31.9 | 59.0 | 33.8 | 25.4 | 58.5 | 22.6 | 50.3 | 29.2 | 30.3 | 80.4 | 13.7 | 29.9 | 39.87 |
ATIRE | 48.7 | 15.6 | 68.1 | 31.8 | 61.0 | 33.2 | 25.3 | 58.5 | 22.6 | 50.3 | 29.1 | 30.3 | 80.5 | 13.7 | 30.1 | 39.92 |
BM25+ | 48.7 | 15.6 | 68.1 | 31.8 | 61.0 | 33.2 | 25.3 | 58.5 | 22.6 | 50.3 | 29.1 | 30.3 | 80.5 | 13.7 | 30.1 | 39.92 |
BM25L | 49.6 | 15.8 | 68.7 | 32.2 | 62.9 | 33.0 | 25.0 | 55.9 | 21.4 | 46.6 | 28.1 | 29.4 | 80.3 | 13.5 | 29.8 | 39.48 |
BM25 | 48.7 | 15.6 | 68.0 | 31.8 | 61.0 | 33.2 | 25.3 | 58.5 | 22.6 | 50.3 | 29.1 | 30.3 | 80.5 | 13.7 | 30.1 | 39.91 |
Baguetter, k1=1.2, b=0.75 | ||||||||||||||||
Robertson | 49.55 | 15.69 | 68.71 | 32.85 | 64.36 | 30.95 | 25.27 | 59.13 | 23.19 | 50.93 | 28.55 | 31.69 | 78.73 | 13.54 | 30.73 | 40.26 |
ATIRE | 49.23 | 15.66 | 68.73 | 32.71 | 65.94 | 31.15 | 25.30 | 59.13 | 23.23 | 50.98 | 28.59 | 31.56 | 78.72 | 13.55 | 31.00 | 40.37 |
BM25+ | 49.23 | 15.66 | 68.73 | 32.71 | 65.94 | 31.15 | 25.30 | 59.13 | 23.23 | 50.98 | 28.59 | 31.56 | 78.72 | 13.55 | 31.00 | 40.37 |
BM25L | 50.32 | 15.88 | 69.37 | 33.00 | 67.05 | 31.52 | 24.79 | 56.35 | 22.01 | 47.37 | 27.35 | 30.87 | 78.21 | 13.29 | 30.87 | 39.88 |
BM25 | 49.32 | 15.65 | 68.67 | 32.68 | 65.78 | 31.11 | 25.26 | 59.09 | 23.24 | 50.98 | 28.85 | 31.56 | 78.73 | 13.55 | 30.99 | 40.36 |
BMX | 50.46 | 15.91 | 69.42 | 32.84 | 68.1 | 34.34 | 25.39 | 61.73 | 24.21 | 55.75 | 29.84 | 32.2 | 78.56 | 13.32 | 30.77 | 41.52 |
As the table shows, BMX generally outperforms all BM25 variants, achieving the best results on 11 out of 15 datasets. In our view, this underscores the importance of incorporating query-document similarity in our search algorithm.
Other than that, we also observe that embedding-based models consistently outperform lexical models, which can be attributed to their powerful semantic understanding. However, due to their significant advantage regarding resource requirements, lexical search algorithms still remain much more widely used.
BRIGHT
BRIGHT was introduced as the first text retrieval benchmark requiring intensive reasoning to retrieve the relevant documents. The benchmark includes 1,385 real-world queries across diverse domains such as StackExchange, LeetCode, and math competitions. These queries are paired with, for example, relevant web pages linked in StackExchange answers and theorems tagged in math Olympiad questions, all of which necessitate deliberate reasoning to identify their connections to the respective queries. For this evaluation, we also took a look at the performance of BMX with WQA.
Model | BIO | ES | ECO | PSY | ROB | SO | SL | LC | PONY | AOPS | TQ | TT | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Embedding Models | |||||||||||||
E5-Mistral-7B | 18.80 | 26.00 | 15.50 | 15.80 | 16.40 | 9.80 | 18.50 | 28.70 | 4.80 | 7.10 | 23.90 | 25.10 | 17.50 |
Proprietary Models | |||||||||||||
OpenAI | 23.70 | 26.30 | 20.00 | 27.50 | 12.90 | 12.50 | 20.30 | 23.60 | 2.50 | 8.50 | 22.20 | 10.80 | 17.57 |
Cohere | 19.00 | 27.50 | 20.20 | 21.80 | 16.20 | 16.50 | 17.70 | 26.80 | 1.80 | 6.50 | 15.10 | 7.10 | 16.35 |
Voyage | 23.60 | 25.10 | 19.80 | 24.80 | 11.20 | 15.00 | 15.60 | 30.60 | 1.50 | 7.40 | 26.10 | 11.10 | 17.65 |
Lexical Models | |||||||||||||
BM25 (, ) | 19.25 | 29.67 | 16.89 | 15.77 | 14.19 | 15.22 | 15.30 | 25.05 | 4.62 | 8.28 | 6.57 | 2.34 | 14.43 |
BM25 (, ) + WQA | 22.9 | 33.86 | 17.2 | 24.33 | 14.54 | 20.21 | 16.01 | 26.61 | 6.03 | 8.41 | 10.50 | 3.29 | 16.99 |
BMX () | 26.4 | 33.55 | 13.29 | 16.06 | 14.68 | 13.80 | 15.47 | 25.35 | 10.91 | 9.18 | 8.17 | 2.38 | 15.77 |
BMX () + WQA | 31.57 | 40.24 | 17.40 | 23.53 | 16.67 | 18.32 | 15.05 | 28.34 | 10.82 | 9.22 | 9.98 | 2.74 | 18.66 |
The table demonstrates that BMX with weighted query augmentation outperforms all baselines, including embedding models and other lexical models. We attribute this to BMX's ability to generalize effectively across various domains, while embedding models often struggle with out-of-domain data. The results highlight that the WQA mechanism successfully enhances BMX’s semantic understanding, enabling it to handle realistic retrieval scenarios more effectively than alternative solutions. Also, BMX achieves this without the need for costly training on massive high-quality datasets, as would be required for embedding models.
Multilingual Benchmark
To verify the performance of BMX on multilingual data, we also performed benchmarking on datasets for Chinese, Japanese, Korean, German, and French retrieval tasks.
Model | MMarcoRetrieval (Chinese) | JaQuAD (Japanese) | StrategyQA (Korean) | GermanDPR (German) | FQuAD (French) | Avg. |
---|---|---|---|---|---|---|
BM25 | 47.41 | 54.26 | 33.89 | 52.27 | 91.80 | 55.93 |
BMX | 47.94 | 54.63 | 35.72 | 53.58 | 91.92 | 56.76 |
The table presents the evaluation results, demonstrating that BMX consistently outperforms BM25 across these multilingual datasets. This suggests the increased effectiveness of BMX in handling diverse multilingual information retrieval tasks.
Try it out
Installation
Before you start, make sure to install the Baguetter library:
Usage
The following examples show how to use BMX in practice:
Why does this matter?
Improving the quality of lexical search without a significant increase in complexity or computational resource requirements could lead to a range of beneficial outcomes. Using BMX could improve the user experience of any application that relies on search algorithms. We also expect performance benefits for natural language processing applications that include our algorithm as part of their pipeline.
We're excited to hear about the community's opinions on BMX and where this algorithm could be used in the future. If you want to give us feedback or discuss anything search- or machine learning-related, please come join our Discord community.