Experimental MoreLikeThis-based similarity
💬
by Caio
7 years ago
(log)
This patch exposes a `Searcher.findSimilar(String, int)` method
that expects a plaintext recipe (name + ingredients + instructions)
as input.
The reason I wasn't liking the MoreLikeThis results is that since
I split the FULLTEXT index field into separate fields, the logic
in MoreLikeThis couldn't be applied properly given the String-based
interface I decided to use. Elaborating:
- MoreLikeThis works by looking at the term statistics for the
queried text and compares against the frequencies of the documents
stored in the index
- Passing a single string to represent the recipe meant that it
needed a way to know which of the given terms should only happen
in the name (or only in ingredients, or instructions)
So I had two approaches to tackle this:
1. Manually build a boolean query with a `MoreLikeThis.like()`
clause for each field we care and boost accordingly
2. Add back a "full text" index and query on it.
Alternative #1 sounded annoying to tweak, #2 sounded simple but
costing an increase in the index size. I chose #2 because we have
RAM to spare, so `vmtouch`-ing the index and letting Lucene fly
is still very applicable.
This patch adds a QA-type test where we just verify that searching
for a recipe that's in the index actually returns said recipe in
the top 5 results.