caio.co/de/cerberus


Experimental bi-gram minhash-based similarity 💬 by Caio 7 years ago (log)
This patch exposes a `Searcher.findSimilar(String)` method that
expects a plaintext recipe (name + ingredients + instructions) as
input.

In a gist, what this does is:

 - When indexing, write to the MINHASH field the minhashes of the
   bi-grams of the recipe text

 - When querying, extract minhashes of the bi-grams of the given
   text and ranks recipes based on the intersection of minhashes
   (larger intersection -> higher ranking) and yield every doc
   that matches >= 20% of hashes (logic pulled out of thin air).

Details:

 - Bi-grams are being used mostly for performance for now; I should
   experiment with larger n-grams later.

 - I'm using the default MinHashFilter configuration, but with a
   small average document size I'm thinking I can tune this to be
   a lot more efficient. Right now we end up with a boolean query
   with 512 clauses.

The added test is pretty much useless, just prints stuff to stdout
so I can eyeball what's going on.

One alternative is to use MoreLikeThis as it can be super fast. So
far I prefer what I'm getting with the minhash approach, but I
haven't actually tweaked it nor created a separate index field for
it, so it's hard to compare. This will come next :-)

Tree


Cerberus

Cerberus is the Search and Metadata Retrieval library for gula.recipes and makes use of Lucene for searching and Chronicle-Map for metadata persistence / memory-mapping.

Build

mvn install

Test

mvn test