caio.co/de/cerberus


Experimental MoreLikeThis-based similarity 💬 by Caio 7 years ago (log)
This patch exposes a `Searcher.findSimilar(String, int)` method
that expects a plaintext recipe (name + ingredients + instructions)
as input.

The reason I wasn't liking the MoreLikeThis results is that since
I split the FULLTEXT index field into separate fields, the logic
in MoreLikeThis couldn't be applied properly given the String-based
interface I decided to use. Elaborating:

 - MoreLikeThis works by looking at the term statistics for the
   queried text and compares against the frequencies of the documents
   stored in the index

 - Passing a single string to represent the recipe meant that it
   needed a way to know which of the given terms should only happen
   in the name (or only in ingredients, or instructions)

So I had two approaches to tackle this:

 1. Manually build a boolean query with a `MoreLikeThis.like()`
    clause for each field we care and boost accordingly

 2. Add back a "full text" index and query on it.

Alternative #1 sounded annoying to tweak, #2 sounded simple but
costing an increase in the index size. I chose #2 because we have
RAM to spare, so `vmtouch`-ing the index and letting Lucene fly
is still very applicable.

This patch adds a QA-type test where we just verify that searching
for a recipe that's in the index actually returns said recipe in
the top 5 results.

Tree


Cerberus

Cerberus is the Search and Metadata Retrieval library for gula.recipes and makes use of Lucene for searching and Chronicle-Map for metadata persistence / memory-mapping.

Build

mvn install

Test

mvn test