Except now the semantic capabilities are so much stronger. The transformer allow...

s-macke · on May 31, 2024

You are talking about English, right? And only for searches without any special technical terms or abbreviations?

Also my use case includes more than 20 languages. To find usable embeddings for all languages is next to impossible. However, there are keyword plugins for most languages in Solr or ElasticSearch.

Btw. In my benchmarks the result look something like this in English (MAP=mean average precision):

BM25(keyword search) -> MAP=45%

Embedding (Ada-002) -> MAP=49%

Hybrid (BM25 + Embedding) -> MAP=57%

Hybrid (Embedding + BM25) -> MAP=57%

And that's before you use synonym dictionaries for keyword searches.

spencerchubb · on May 31, 2024

I'm curious, in your benchmark, what's the difference between BM25+Embedding and Embedding+BM25? And what do you use to make the embedding

If you make the embedding with an LLM, it should work for any language the LLM is trained on.

s-macke · on June 4, 2024

BM25+Embedding and Embedding+BM25 is exactly the same and shows the commutative relation whether you start from keyword search or semantic search.

For my tests, I used Ada-002. As data I used small news articles and no chunking and no preprocessing. The query for the articles is embedded directly.

Of course, improvements can be done for both approaches. That should just exemplify, what you might expect with hybrid search.