Experimental bi-gram minhash-based similarity 💬 by Caio 7 years ago (log)
This patch exposes a `Searcher.findSimilar(String)` method that expects a plaintext recipe (name + ingredients + instructions) as input. In a gist, what this does is: - When indexing, write to the MINHASH field the minhashes of the bi-grams of the recipe text - When querying, extract minhashes of the bi-grams of the given text and ranks recipes based on the intersection of minhashes (larger intersection -> higher ranking) and yield every doc that matches >= 20% of hashes (logic pulled out of thin air). Details: - Bi-grams are being used mostly for performance for now; I should experiment with larger n-grams later. - I'm using the default MinHashFilter configuration, but with a small average document size I'm thinking I can tune this to be a lot more efficient. Right now we end up with a boolean query with 512 clauses. The added test is pretty much useless, just prints stuff to stdout so I can eyeball what's going on. One alternative is to use MoreLikeThis as it can be super fast. So far I prefer what I'm getting with the minhash approach, but I haven't actually tweaked it nor created a separate index field for it, so it's hard to compare. This will come next :-)